The present application is based upon and claims priority to Chinese Patent Application No. 202210483572.4, filed on May 5, 2022, the entire content of which is hereby incorporated by reference.
The present invention relates to the technical field of intelligent traffic scheduling, and in particular to a method for intelligent traffic scheduling based on deep reinforcement learning, which achieves energy-saving and high-performance traffic scheduling in a data center environment.
With the rapid development of the Internet, global data center traffic increases explosively. A data center network carries thousands of services, and demands for network service traffic are non-uniformly distributed and demonstrate a large dynamic change, such that network infrastructures are facing a problem of huge energy consumption. An existing research shows that in recent years, the energy consumption of the data center network accounts for 8% of global electricity consumption, in which, the energy consumption of the network infrastructures accounts for 20% of the energy consumption of the data center. In the face of ever-complex and changeable network application services and the rapid increase of the energy consumption of network infrastructures, a conventional routing algorithm only aiming at quality of high-performance network service cannot meet the application requirements. Therefore, on the premise of guaranteeing the demand for network services, in order to reduce the influence of the high energy consumption of the network infrastructures, network energy saving optimization is also a target to be guaranteed and optimized.
Current data center traffic features show a distribution feature of elephant flow (80%-90%)/mice flow (10%-20%). The elephant flow usually has long work time and carries large data volume. The data flows in less than 1% of traffic packets can reach more than 90%, and less than 0.1% of the flows can last for 200 s. The mice flow usually has short work time and carries a small data volume. The total quantity of the mice flows reaches 80% of the total traffic quantity, and the transmission time of all the mice flows is less than 10 s. Therefore, the elephant flow and the mice flow are processed differently in traffic scheduling, and energy-saving and high-performance traffic scheduling can be realized.
Aiming at the technical problems that a conventional routing algorithm is low in instantaneity, unbalanced in resource distribution and high in energy consumption and cannot meet application requirements of existing data center networks, the present invention provides a method for intelligent traffic scheduling based on deep reinforcement learning. By using a deep deterministic policy gradient (DDPG) in the deep reinforcement learning as the energy-saving traffic scheduling framework, the convergence efficiency is improved. Flows are divided into elephant flows/mice flows for dynamic energy-saving scheduling, thus effectively improving the energy-saving percentage and network performances such as delay, throughput and packet loss rate, demonstrating the important application value of the present invention in energy-saving of data center networks.
In order to achieve the above purpose, the technical scheme of the present invention is implemented as follows: Provided is a method for intelligent traffic scheduling based on deep reinforcement learning, comprising:
step I: collecting flows in a data center network topology in real time, and dividing the flows into elephant flow or mice flow according to different types of flow features;
step II: establishing traffic scheduling models with energy saving and performance of the elephant flow and the mice flow as targets for joint optimization based on the elephant flow/mice flow existing in a network traffic;
step III: establishing a deep deterministic policy gradient (DDPG) intelligent routing traffic scheduling framework convolutional neural network (CNN) improvement, and performing environment interaction based on environmental perception and deep learning decision-making ability of the deep reinforcement learning;
step IV: state mapping: collecting state messages of a link transmission rate, a link utilization rate and a link energy consumption in a data plane, and jointly inputting the three state messages as a state set into the CNN for training;
step V: action mapping: setting an action as a comprehensive weight of energy saving and performance of each path under the condition of uniform transmission of flows in time and space according to a network state and reward value feedback information, and selecting transmission paths for the elephant flow or the mice flow according to the weight; and
step VI: reward value mapping: designing reward value functions for the elephant flow and the mice flow according to a network energy saving and performance effect of the link.
In the step I, information data of a link bandwidth, a delay, a throughput and a network traffic in the network topology are collected in real time; if a bandwidth demand of a current traffic exceeds 10% of the link bandwidth, the flow is determined as the elephant flow, and otherwise the flow is determined as the mice flow.
An optimization target minϕelephent of the traffic scheduling model of the mice flow is:
an optimization target min ϕmice of the traffic scheduling model of the mice flow is: minϕmice=ηPowertotal′+τLossmice′+ρDelaymice′;
in the formula, η, τ and ρ represent energy saving and performance parameters of the data plane, and η, τ and ρ are all between 0 and 1; Powertotal′ is a normalization result of total network energy consumption Powertotal in a network traffic transmission process; Losselephent′ is a normalization result of an average packet loss rate Losselephent of the elephant flow; Throughtelephent′ is a normalization result of an average throughput Throughtelephent of the elephant flow; Lossmice′ is an average packet loss rate Lossmice of the mice flow; Delaymice′ is a normalization result of an average end-to-end delay Delaymice of the mice flow;
traffic transmission constraint for both the traffic scheduling model of the elephant flow and the traffic scheduling model of the mice flow is:
in the formula, ci is a traffic size of a flow in a transmission interval from start time p′i to end time q′i; u is a sending node of the flow; v is a receiving node of the flow; Γ(u) is a neighbor node set of the sending node u; fiuv is a flow sent by the node u; fivu is flow received by the node v; si represents a source node of the flow; and di represents a destination node of the flow.
The total network energy consumption Powertotal in the network traffic transmission process is:
in the formula, p′i and q′i respectively represent the start time and the end time of the flow in an actual transmission process; Eα represents a set of active links, i.e., links with traffic transmission; e is an element in the link set; P represents the total number of transmitted network flows in a current link; sj(t) is a transmission rate of a single network flow; i refers to the ith network flow; j refers to the jth network flow; σ represents an energy consumption of the link in an idle state; μ represents a link rate correlation coefficient; α represents a link rate correlation index and α>1; (re1+re2)α>re1α+re2α, wherein re1 and re2 are respectively link transmission rates of the same link at different time or of different links; 0≤re(t)≤βR, wherein β is a link redundancy parameter in a range of (0, 1), and R is the maximum transmission rate of the link;
a network topology structure of the data center is a set G=(V,E,C), wherein V represents a node set of the network topology; E represents a link set of the network topology; C represents a capacity set of each link; an elephant flow set transmitted in the network topology is Flowelephent={fm|m∈N+}, and a mice flow set is Flowmice={fn|n∈N+}, wherein m represents the number of elephant flows; n represents the number of mice flows; N+ represents a positive integer set; in flow fi=(si,di,pi,qi,ri), si represents a source node of the flow; di represents a destination node of the flow; pi represents the start time of the flow; qi represents the end time of the flow; ri represents a bandwidth demand of the flow;
the average packet loss rate of the elephant flow is
the average throughput of the elephant flow is
the average end-to-end delay of the mice flow is
the average packet loss rate of the mice flow is
wherein delay( ) is an end-to-end delay function in the network topology; loss( ) is a packet loss rate function; throught( ) is a throughput function;
and the normalization results are
wherein Powertotal
In the DDPG intelligent routing traffic scheduling framework based on CNN improvement, a conventional neural network in the DDPG is replaced with the CNN, such that a CNN update process is merged with an online network and a target network in the DDPG.
An update process of the online network and the target network in the DDPG and an interaction process with the environment are as follows:
firstly, updating the online network, the online network comprising an Actor online network and a Critic online network, wherein the Actor online network generates a current action αt=μ(st|θμ), i.e., a link weight set, according to a state st and a random initialization parameter θμ of the link transmission rate, the link utilization rate and the link energy consumption, and interacts with the environment to acquire a reward value rt and a next state st+1; the state st and the action αt are jointly input into the Critic online network, and the Critic online network iteratively generates a current action value function Q(st,αt|θQ), wherein θQ is a random initialization parameter; the Critic online network provides gradient information grad[Q] for the Actor online network and helps the Actor online network to update the network; and
then updating the target network, wherein the Actor target network selects a next-time state st+1 from an experience replay buffer tuple (st,αt,rt,st+1), and obtains a next optimal action at, αt+1=μ′(st+1) through iterative training, wherein μ′ represents a deterministic behavior policy function; the network parameter θμ′ is obtained by regularly copying an Actor online network parameter θμ; the action αt+1 and the state st+1 are jointly input into the Critic target network; the Critic target network performs iterative training to obtain a target value function Q′(st+1, μ′(st+1|θμ′)|θQ′); the parameter θQ′ is obtained by regularly copying an Actor online network parameter θQ.
The Critic online network updates the network parameters with a minimum calculation error through an error equation, and the error is
wherein yt is a target return value calculated by the Critic target network; L is a mean square error; N is the number of random samples from the experience replay buffer.
The Critic target network provides the target return value yt=rt+γQ′(st+1, μ′(st+1|θμ′)|θQ′) for the Critic online network, and γ represents a discount factor.
The action set in the step V is Action={αw1, αw2, . . . αwi, . . . , αwz}, wi∈W;
wherein W is an optional transmission path set of network traffic; =wi represents the with path in the optional transmission path set; αwi represents an action value in the action set and refers to a path weight value of the with path;
if the network traffic is detected to be the elephant flow, the traffic is transmitted in a multipath manner, and the elephant flow is distributed according to proportions of different link weights in a total link weight;
if the network traffic is detected to be the mice flow, the traffic is transmitted in a single-path manner; a path with a large link weight is selected as a traffic transmission path, i.e., a path with the maximum link weight is selected as a transmission path for the mice flow through the action set.
An implementation method of the step IV comprises: mapping state elements in the state set into a state feature of the CNN; selecting a link transmission rate SLR
The proportion calculation method comprises: in a traffic transmission from the source node s to the target node d through n paths, calculating a traffic distribution proportion
of each path from the source node s to the target node d.
The reward value function of the elephant flow is:
the reward value function of the mice flow is:
wherein the sum of reward value factor parameters η, τand ρ is 1; Powertotal′ is a normalization result of the total network energy consumption Powertotal in the flow transmission process; Losselephent′ is a normalization result of the average packet loss rate Losselephent of the elephant flow; Throughtelephent′ is a normalization result of the average throughput Throughtelephent of the elephant flow; Lossmice′ is an average packet loss rate Lossmice of the mice flow; Delaymice′ is a normalization result of the average end-to-end delay Delaymice of the mice flow.
Compared with the prior art, the present invention has the following beneficial effects: In order to jointly optimize the network energy saving and performance of a data plane on the basis of a software defined network technology, scheduling energy saving and performance optimization models for elephant flow and mice flow are designed. Reference is made to the DDPG in the deep reinforcement learning as an energy-saving traffic scheduling framework, and a CNN is introduced in a DDPG training process to achieve continuous traffic scheduling and optimization for the energy saving and performance. The present invention has better convergence efficiency by adopting of the DDPG based on CNN improvement. By combining environmental features such as the link transmission rate, the link utilization rate and the link energy consumption in the data plane, the present invention divides the flows into elephant flows and the mice flows for traffic scheduling, and takes the energy saving and packet loss rate of traffic transmission as targets for joint optimization according to the high-throughput demand of the elephant flow and the low-delay demand of the mice flow, such that the flows are uniformly transmitted in time and space. Compared with the routing algorithm DQN-EER, the energy saving percentage is increased by 13.93%. Compared with the routing algorithm EARS, the delay is reduced by 13.73%, the throughput is increased by 10.91% and the packet loss rate is reduced by 13.51%.
In order to more clearly illustrate the technical solutions in the embodiments of the present invention or in the prior art, the drawings required to be used in the description of the embodiments or the prior art are briefly introduced below. It is obvious that the drawings in the description below are some embodiments of the present invention, and those of ordinary skilled in the art can obtain other drawings according to the drawings provided herein without creative efforts.
The technical schemes in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort shall fall within the protection scope of the present invention.
For the problems that routing optimization of existing routing algorithms is achieved only through the quality of network service and the user experience quality, and the energy consumption of a data center network is ignored, the present invention provides a method for intelligent traffic scheduling based on deep reinforcement learning, and the flow of the method is shown in
Step I: collecting data flows in a data center network topology in real time, and dividing the data flows into elephant flow or mice flow.
Step II: establishing intelligent traffic scheduling models with energy saving and performance as targets for joint optimization based on the elephant flow/mice flow existing in a network traffic.
The present invention takes traffic scheduling of a data center as an example. The network traffic in the conventional data center adopts unified traffic scheduling, without distinguishing elephant flow and mice flow, which inevitably causes the problems of low scheduling instantaneity, unbalanced resource distribution, high energy consumption and the like. In order to ensure the balance of traffic in user services, the present invention further divides the traffic into elephant flow/mice flow for dynamic scheduling. Therefore, according to different types of traffic features, different optimization methods are established for the elephant flow and the mice flow so as to achieve intelligent traffic scheduling of the elephant flow and the mice flow.
In the present invention, when the network topology of the data center is confirmed, and activation and dormancy of the links and the switches are clear, energy saving traffic scheduling is performed. On this basis, a network energy consumption model can be simplified into a link rate level energy consumption model, and a link power consumption function is recorded as Power(re), wherein re(t) is a link transmission rate. The calculation process is as shown in formula (1).
Power(re)=σ+μreα(t), 0≤re≤βr (1)
In the formula, σ represents an energy consumption of the link in an idle state; μ represents a link rate correlation coefficient; α represents a link rate correlation index and α>1; (re1+re2)α>re1α+re2α, wherein re1 and re2 are respectively link transmission rates of the same link at different time or of different links; Power(□) can be superimposed; β is a link redundancy parameter in a range of (0,1), and R is the maximum transmission rate of the link. Therefore, it can be seen from formula (1) that the link energy consumption is minimized when the traffic is uniformly transmitted in time and space. A calculation process of the total network energy consumption Powertotal in the network traffic transmission process is shown in formula (2).
In the formula, p′i and q′i respectively represent the start time and the end time of the flow in an actual transmission process; Eα represents a set of active links, i.e., links with traffic transmission; e is an element in the link set, which can be used as one edge in the network topology; P represents the total number of transmitted network flows in a current link; sj(t) is a transmission rate of a single network flow; i refers to the ith network flow; and j refers to the jth network flow.
The network topology structure of the data center is defined as a set G=(V,E,C), wherein V represents a node set of the network topology; E represents a link set of the network topology; C represents a capacity set of each link. It is assumed that the elephant flow set transmitted in the network topology is Flowelephent={fm|m∈N+}, and the mice flow set is Flowmice={fn|n∈N+}, wherein m represents the number of elephant flows and n represents the number of mice flows. In flow fi=(si,di,pi,qi,ri), si represents a source node of the flow; di represents a destination node of the flow; pi represents the start time of the flow; qi represents the end time of the flow; ri represents a bandwidth demand of the flow. An end-to-end delay in the network topology is recorded as delay(x); a packet loss rate is recorded as loss(x); a throughput is recorded as throught(x); and x represents a variable, which refers to the network flow. Calculation processes of an average packet loss rate Losselephent and an average throughput Throughtelephent of the elephant flow and an average end-to-end delay Delaymice and an average packet loss rate Lossmice of the mice flow are respectively shown in formulas (3), (4), (5) and (6).
The optimization target of the present invention is the energy saving and performance routing traffic scheduling of the data plane. Main optimization targets include: (1) weighted minimum values of reciprocals of the network energy consumption and the average packet loss rate and throughput of the elephant flow; and (2) weighted minimum values of the network energy consumption and the average packet loss rate and average end-to-end delay of the mice flow. In order to simplify the calculation method, dimensional expressions are converted into table quantities, i.e., normalization of energy saving and performance parameters of the data plane. Calculation processes are shown in formulas (7), (8), (9), (10) and (11).
In the formula, Powertotal
After the normalization is completed, network energy saving and performance optimization targets minϕelephent and minϕmice for elephant flow and mice flow scheduling are established, and the calculation processes are shown in formulas (12) and (13).
In the formula, η, τ and ρ represent energy saving and performance parameters of the data plane, and η, τ and ρ are all between 0 and 1. In order to ensure that the above traffic scheduling process is not affected by the environment, in the present invention, traffic transmission constraints are defined as shown in formulas (14) and (15).
In the formula, ci is a traffic size of a flow in a transmission interval from start time p′i to end time q′i; u is a sending node of the flow; v is a receiving node of the flow; Γ(u) is a neighbor node set of the sending node u; fiuv is a flow sent by the node u; fivu is flow received by the node v. si represents a source node of the flow and di represents a destination node of the flow.
Step III: establishing a deep deterministic policy gradient (DDPG) intelligent routing traffic scheduling framework convolutional neural network (CNN) improvement based on environmental perception and deep learning decision-making ability of the deep reinforcement learning.
In the present invention, a conventional neural network in the DDPG is replaced with a CNN, such that a CNN update process is merged with an online network and a target network in the DDPG, and the system convergence efficiency can be effectively improved by utilizing the high-latitude data processing advantage of the CNN. The DDPG uses a Fat Tree network topology structure as a data center network environment. The DDPG intelligent routing traffic scheduling framework based on CNN improvement, as shown in
Firstly, updating the online network: the online network consists of an Actor online network and a Critic online network, wherein the Actor online network generates a current action αt=μ(st|θμ), i.e., a link weight set, according to states st and random initialization parameters θμ of the link transmission rate, the link utilization rate and the link energy consumption, and interacts with the environment to acquire a reward value rt and a next state st+1. The state st and the action αt are jointly input into the Critic online network, and the Critic online network iteratively generates a current action value function Q(st,αt|θQ), wherein θQ is a random initialization parameter. The online network Critic provides gradient information grad[Q] for the online strategy network Actor and helps the online strategy network Actor to update the network. In addition, the online strategy network Critic updates the network parameters with a minimum calculation error through an error equation. The calculation error process is shown in formula
wherein yt is a target return value calculated by the Critic target network; L is a mean square error; Nis the number of random samples from the experience replay buffer.
Secondly, updating the target network: the Actor target strategy network selects a next-time state st+1 from an experience replay buffer tuple (si,αi,ri,si+1), and obtains a next optimal action at αt+1=μ′(st+1) through iterative training, wherein μ′ represents a deterministic behavior policy function; the network parameter θμ′ is obtained by regularly copying an Actor online strategy network parameter θμ; the action αt+1 and the state st+1 are jointly input into the Critic target network; the Critic target network performs iterative training to obtain a target value function Q′(st+1, μ′(st+1|θμ′)|θQ′); the parameter θQ′ is obtained by regularly copying an Actor online strategy network parameter θQ. The Critic target network provides the target return value yt for the Critic online strategy network as calculated by the formula yt=rt+γQ′(st+1, μ′(st+1|θμ′)|θQ′), and γ represents a discount factor. The DDPG training process is completed after the Actor-Critic online networks and target networks are updated.
Step IV: state mapping: collecting state messages of a link transmission rate, a link utilization rate and a link energy consumption in a data plane, and jointly inputting the three state features as a state set statet={sLR
In the present invention, energy saving and network performance of the data plane are used as targets for joint optimization, which is mainly related to the link transmission rates, the link utilization rates and the link energy consumption information of the current time and the historical time. It is assumed that there are m links. In the present invention, the three state features are jointly used as a state set statet={sLR
Step V: action mapping: setting actions of the elephant flow and the mice flow as a comprehensive weight of energy saving and performance of each link under the condition of uniform transmission of flows in time and space.
The present invention sets the actions as a comprehensive weight of performance and energy saving of each link under the condition of uniform transmission of flows in time and space according to a network state and reward value feedback information. A specific action set is shown in formula (16).
Action={αw1,αw2, . . . αwi, . . . , αw2}wi∈W (16)
In the formula, W is an optional transmission path set of network traffic; =wi represents the with path in the optional transmission path set; αwi represents an action value in the action set and refers to a path weight value of the with path; z represents the total number of optional transmission paths. In the present invention, flows are divided into the elephant flow and the mice flow for traffic scheduling. As such, if the controller (arranged in the control plane) detects that the network traffic is an elephant flow, the traffic transmission is conducted in a multipath manner, and the elephant flow is distributed according to proportions of different link weights in a total link weight. For example, a traffic transmission may be conducted from a certain source node s to a target node d through n paths, that is, a traffic distribution proportion of each path from the source node s to the target node d can be calculated through formula
if the controller detects that the network traffic is the mice flow, the traffic is transmitted in a single-path manner. A path with a large link weight is selected as a traffic transmission path, i.e., a path with the maximum link weight is selected from the action set {αw1,αw2, . . . αwi, . . . , αwn} as a transmission path for the mice flow.
Step VI: reward value mapping: designing reward value functions or reward value accumulation standards for the elephant flow and the mice flow according to a network energy saving and performance effect of the link.
In consideration of the features of different data flows, the reward value functions of the elephant flow and the mice flow are set. Main optimization targets of the elephant flow are low energy consumption, low packet loss rate and high throughput. As such, values of normalized energy consumption, packet loss rate and throughput are used as reward value factors. A smaller optimization target indicates a larger reward value. In order to directly read accumulated reward value gains, reciprocals of the energy consumption and the packet loss rate are selected as reward value factors during setting of a reward value. A specific calculation process is shown in formula (17).
In the formula, the reward value factor parameters η, τ and ρ are all between 0 and 1, including 0 and 1. A parameter represents a ratio of one element in the formula, which can be selected according to proportions of the importance of the energy consumption, the packet loss rate and the throughput in the elephant flow. Similarly, the mice flow takes low energy consumption, low packet loss rate and low delay as the optimization targets, and reciprocals of three normalized elements are used as reward value factors. A specific calculation process is shown in formula (18).
After the training is converged, the method further tests the convergence, the energy saving percentage, the delay, the throughput, the packet loss rate and the like of the system.
In order to test the energy saving and network performance advantages of the method for intelligent traffic scheduling disclosed herein, in the testing processes, the present invention is compared with an existing good energy saving routing algorithm, high-performance intelligent routing algorithm and heuristic energy-saving routing algorithm. An energy-saving effect evaluation index is shown in formula
wherein lpi represents the network link energy consumption consumed by the current routing algorithm, and lpfull is the total link energy consumption consumed under a full load of the link. In order to test the energy saving and network performance effects of the present invention in a real network scenario, network load environments with different traffic intensities are set in the test process. The network energy consumption, the delay, the throughput and the packet loss rate are used as optimization targets. In the process of testing energy saving, the parameter weight η is set as 1, and the parameter weights τ and ρ are set as 0.5. In the process of testing performance, the parameter weight η is set as 0.5, and the parameter weights τ and ρ are set as 1; in the energy consumption function, α is set as 2, and μ is set as 1; and periodic traffics are set as 20%,40%, 60% and 80%. Test results are shown in
The above mentioned contents are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent substitution, improvement, etc., made within the spirit and principle of the present invention shall all fall within the scope of protection of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
202210483572.4 | May 2022 | CN | national |