The present disclosure belongs to that technical field of network management and control, and particularly relates to a reinforcement learning agent training method, a modal bandwidth resource scheduling method and an apparatus.
In a polymorphic smart network, a variety of network protocols are running at the same time, and each technology system is a network modal. Various network modals share network resources. If they are not well managed or controlled, they will directly compete for network resources, such as bandwidth, which will directly affect the communication transmission quality of some key modals. Therefore, reasonable control of each modal in the network is one of the necessary prerequisites to ensure the stable operation of a polymorphic smart networks.
At present, the prevailing technology for the above requirements is to control the proportion of bandwidth used in switch ports and limit the size of traffic at the export to avoid network overload.
In that process of implementing the present disclosure, the inventor found that the prior art has at least the follow problem:
Static strategies (such as limiting the bandwidth usage ratio to a certain maximum) will not be able to adapt to the dynamic changes of network modals. However, in the actual network, it is very likely that the traffic of individual modals will increase due to business changes, and the original static strategy is no longer applicable.
It is an object of the embodiment of the application to provide a reinforcement learning agent training method, a modal bandwidth resource scheduling method and an apparatus, so as to solve the technical problem that modal resources in a polymorphic smart network cannot be intelligently controlled in the related art.
A first aspect of an embodiment of the present disclosure provides a modal bandwidth resource scheduling method in a polymorphic smart network, including:
S11, constructing a state of a global network characteristic, an action and a deep neural network model required for training the reinforcement learning agent, the deep neural network model including a new execution network, an old execution network and an action evaluation network;
S12, setting a maximum number of steps in a round of training;
S13, acquiring the state of the global network characteristic in each step, inputting the state of the global network characteristic into the new execution network, controlling Software Defined Network (SDN) switches to execute actions output by the new execution network, acquiring the state of the global network characteristic and reward values after the SDN switches execute the actions, and storing the actions, the reward values and the states in two periods before and after the actions are executed in an experience pool;
S14, updating network parameters of the action evaluation network according to all the reward values and the states before the actions are executed in the experience pool;
S15, assigning network parameters of the new execution network to the old execution network, and updating the network parameters of the new execution network according to all actions and the states before the actions are executed in the experience pool;
S16, repeating steps S13-S15 until the bandwidth occupied by each modal in the polymorphic smart network ensures the communication transmission quality without overloading a network export.
Further, the global network characteristic state includes a number of packets in each modal, an average packet size of each modal, an average delay of each flow, a number of packets in each flow, a size of each flow and an average packet size in each flow.
Further, the action is a sum of an average value and noises of action vectors selected under the state of the corresponding global network characteristics.
Further, the step of updating network parameters of the action evaluation network according to all the reward values and the states before the actions are executed in the experience pool includes:
Further, the step of updating the network parameters of the new execution network according to all actions and the states before the actions are executed in the experience pool includes:
A second aspect of an embodiment of the present disclosure provides a reinforcement learning agent training apparatus in a polymorphic smart network. The apparatus is applied to a reinforcement learning agent, the apparatus including:
A third aspect of an embodiment of the present disclosure provides a modal bandwidth resource scheduling method in a polymorphic smart network, including:
applying a reinforcement learning agent trained by the reinforcement learning agent training method in a polymorphic smart network according to any one of claims 1 to 5 to the polymorphic smart network;
A fourth aspect of an embodiment of the present disclosure provides an apparatus for scheduling modal bandwidth resources in a polymorphic smart network, including:
A fifth aspect of an embodiment of the present disclosure provides an electronic device, including:
A sixth aspect of an embodiment of the present disclosure provides a computer-readable storage medium on which computer instructions are stored, when executed by a processor, the instructions implement the steps of the reinforcement learning agent training method in a polymorphic smart network or the modal bandwidth resource scheduling method in a polymorphic smart network.
The technical solution provided by the embodiment of the application may have the following beneficial effects.
As can be seen from the above embodiments, the application uses the idea of reinforcement learning algorithms to construct the global network characteristic states, execution actions and reward functions suitable for the polymorphic smart network, so that reinforcement learning agent can continuously interact with the network and output the optimal execution actions according to the changes of the network states and reward values, so that the allocation of polymorphic smart network resources meets expectations, and the network operation performance is guaranteed, which has strong practical significance for promoting the intelligent management and control of the polymorphic smart network.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not intended to limit the present disclosure.
The attached drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with this application and serve to explain the principles of this application together with the description.
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the attached drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.
The terminology used in this application is for the purpose of describing specific embodiments only and is not intended to limit this application. The singular forms “a”, “said” and “the” used in this application and the appended claims are also intended to include the plural forms, unless the context clearly indicates other meaning. It should also be understood that the term “and/or” as used herein refers to and includes any or all possible combinations of one or more associated listed items.
It should be understood that although the terms “first”, “second”, “third”, etc. may be used in this application to describe various information, this information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of this application, the first information can also be called the second information, and similarly, the second information can also be called the first information. Depending on the context, the word “if” as used herein can be interpreted as “when” or “in case of” or “in response to a determination”.
S11, a state of a global network characteristic, an action and a deep neural network model required for training the reinforcement learning agent are constructed; the deep neural network model comprises a new execution network, an old execution network and an action evaluation network;
S12, a maximum number of steps is set in a round of training;
S13, the global network characteristic state is acquired in each step, the global network characteristic state is input into the new execution network, SDN switches are controlled to execute actions output by the new execution network, network states and reward values after the SDN switches execute the actions are acquired, and the actions, the reward values and the states in two periods before and after the actions are executed are stored in an experience pool;
S14, network parameters of the action evaluation network are updated according to all the reward values and the states before the actions are executed in the experience pool;
S15, network parameters of the new execution network are assigned to the old execution network, and the network parameters of the new execution network are updated according to all actions and the states before the actions are executed in the experience pool;
S16, steps S13-S15 are repeated until the bandwidth occupied by each modal in the polymorphic smart network ensures the communication transmission quality without overloading a network export.
As can be seen from the above embodiments, the application uses the idea of reinforcement learning algorithms to construct the global network characteristic states, execution actions and reward functions suitable for the polymorphic smart network, so that reinforcement learning agent can continuously interact with the network and output the optimal execution action according to the changes of the network state and reward value, so that the allocation of polymorphic smart network resources meets expectations, and the network operation performance is guaranteed, which has strong practical significance for promoting the intelligent management and control of the polymorphic smart network.
In the concrete implementation of step S11, a state of a global network characteristic, an action and a deep neural network model required for training the reinforcement learning agent are constructed, and the deep neural network model includes a new execution network, an old execution network and an action evaluation network:
In an embodiment, the global network characteristic state includes the number of packets in each modal, the average packet size in each modal, the average delay in each flow, the number of packets in each flow, the size of each flow, and the average packet size in each flow. These characteristics constitute the global network state of the current time interval of Δt seconds. st is used to represent the global network characteristics in the tth Δt seconds.
In an embodiment, the action is the sum of the average value and the noises of the action vectors selected under the state of the corresponding global network characteristic. at is used to represent the action of the tth Δt seconds. The action is used to adjust the bandwidth of the flow, and then schedule the resources occupied by each modal to ensure that the network communication quality meets the expected goal. The physical meaning of the action is the ratio of each flow to the export area in each modal. P is used to represent the number of modals running in the network. Since a modal corresponds to a network protocol, it is assumed that the number of modals running in the network is fixed. Fm is used to represent the maximum number of flows in each modal, and the output action space dimension is P×Fm. F(p,t) is used to represent the number of flows based on the pth modal in the tth Δt seconds, which satisfies F(p,t)<Fm. Therefore, in the tth Δt seconds, only P×F(p,t) elements have corresponding flows, with a value of 0.1-1, while other elements have no actual flows, with a value of 0.
In the concrete implementation, for the convenience of implementation, the same architecture can be adopted for the new execution network, the old execution network and the action evaluation network, for example, a deep neural network, a convolutional neural network, a cyclic neural network and other architectures can be adopted. Initialize parameters randomly after network construction is completed.
In the concrete implementation of step S12, the maximum number of steps in one round of training is set;
In an embodiment, the maximum number of steps T is set for each round of training. In practice, the value of T is related to the number of modals in the network and other factors, therefore it is necessary to try to choose a more optimal value many times during the training process. For example, if the number of modals in the network is 8, it is optimal that T is 120 after many attempts.
In the concrete implementation of step S13, in each step, the global network characteristic state is obtained, the global network characteristic state is input into the new execution network, the SDN switches are controlled to execute the action output by the new execution network and the network states of the global network characteristic and reward values after the action are acquired, and the actions, reward values and the states before and after the action are stored in the experience pool;
In an embodiment, in each step, the reinforcement learning agent acquires the global network characteristics in a period of Δt seconds by a controller at the sampling interval of Δt seconds. The current network state st is into the new execution network, and the mean value μ(st|θμ) and variance N of the execution action based on the current parameter θμ, and the output execution action is expressed as
at=μ(st|θμ)+N
where μ(st|θμ) represents the average value of the action vectors selected by the reinforcement learning agent in a certain state of st, θμ represents the parameter of the new execution network, and N represents the noise, which is a normal function that decays with time.
The SDN controller sets the bandwidth for each flow according to the proportion set in the execution action, converts it into an instruction recognizable by the SDN switches, and issues the configuration. The SDN switches receive the configuration and forwards the flows of various modals according to the configured bandwidth. If a flow needs to occupy more bandwidth than the configured bandwidth, part of the flow will be randomly discarded to meet the allocated bandwidth.
The reinforcement learning agent obtains the new state st+1 and the reward value rt of the network after executing the action, and stores (st, at, rt, st−1) in the experience pool. For a round of training, the reinforcement learning agent will go through the process of step S13 for T times, during which the network parameters are not updated, where the reward value rt is the value of the reward function calculated by the reinforcement learning agent. The reward function is defined as follows
where ηP is the weight coefficient of the pth modal, the value of which is determined artificially according to the network operation quality target,
is the velocity of the ith flow in the pth modal in the tth Δt seconds, which can be obtained from the global network characteristic state. βp(i,t) is the proportion of the ith flow in the pth modal reaching the server in the tth Δt seconds, which can be obtained from the execution action. ξ is the upper limit of the flows that can be carried by the export area during normal operation.
The setting of the above reward function can allocate appropriate bandwidth according to the communication transmission situation of different modals in the network, and at the same time avoid the network overload caused by the preemption of network resources by each modal. In the aspect of bandwidth resource allocation, the proportion of the number of flows arriving at the server in each modal is used to characterize the transmission of this modal. If the transmission of this modal is congested, even if its weight coefficient is not high or the whole network is not congested temporarily, the reward function will push the subsequent actions to allocate more bandwidth to this modal. If multiple modals in the network are congested, the modal with a higher weight coefficient will get more bandwidth, which is also in line with the actual needs, that is, giving priority to more important communication services. In order to avoid network overload, a penalty value of −1 is used to make negative feedback to the previous action and reduce the allocated bandwidth to avoid network overload. Therefore, the setting of the above reward function can ensure the normal operation of the network, and at the same time dynamically adjust the bandwidth resource allocation according to the transmission situation of each modal in the network.
In the concrete implementation of step S14, the network parameters of the action evaluation network are updated according to all the reward values and the state before the action is executed in the experience pool;
In an embodiment, as shown in
Step S21, all the states in the experience pool before executing actions are input into the action evaluation network to obtain corresponding expected values;
In an embodiment, in the sample in the experience pool, st in the sample is input into the action evaluation network to get the corresponding expected value V(st) t=1, 2, . . . , T. The expected value represents the evaluation of the network state at time t, that is, the instantaneous value of the current state to achieve the goal set by the reward function.
Step S22, the discount reward in the state before each action is calculated according to the expected value, the corresponding reward value and the preset decay rate;
In an embodiment, the discount reward for each st is calculated.
R(t)=−V(st)+rt+γrt−1+γ2rt+2+ . . . +γT−1−trT−1+γT−tV(sT), t=1, 2, . . . , T, where γ is the decay rate, which is artificially taken. Since each round of training needs to go through T steps, it needs to know the long-term value of the current network state for the subsequent network state changes to achieve the goal set by the reward function.
S23, the difference between the discount reward and the expected value is calculated, the mean square error is calculated according to all the differences, and the obtained mean square error is taken as the first loss value to update the network parameters of the action evaluation network;
In an embodiment, R(t)−V(st), t=1, 2, . . . , T is calculated according to the sample distribution, and the standard deviation is calculated as the first loss value for updating the action evaluation network parameters; this difference represents the gap between instantaneous value and long-term value; this gap is used to adjust the subsequent parameters of the action evaluation network and optimize the output execution action; the smaller the gap is, the closer the action network is to the optimum.
In the concrete implementation of step S15, the network parameters of the new execution network are assigned to the old execution network, and the network parameters of the new execution network are updated according to all actions and the states before the actions are executed in the experience pool;
In an embodiment, it needs to constantly compare the parameters of the old and new execution networks, and update the parameters of the execution networks to continuously optimize the output actions, so as to finally make the parameters of the new execution network optimal and output the optimal actions.
In an embodiment, as shown in
Step S31, all the states before execution of the actions in the experience pool are input into the old execution network and the new execution network respectively to obtain an old execution action distribution and a new execution action distribution;
In an embodiment, st in the samples stored in the experience pool is input into the old execution network and the new execution network obtain old execution action distribution and new execution action distribution with action normal distribution; the implementation of the old and new execution networks is also based on the same neural network architecture, and the two architectures are the same, only the parameters being different; since the input of these two neural networks is set as the network state sample st, and the output as the mean value μ(st|θμ) and variance N of the current optimal execution action. At the same time, it is generally assumed that the probability distribution of actions is a normal distribution, therefore it can be determined that the old probability distribution and the new probability distribution of actions based on the outputs of two execution networks.
S32, a first probability and a second probability that each action in the experience pool appears in the corresponding old execution action distribution and new execution action distribution are calculated;
In an embodiment, a first probability pold(at) and a second probability pnew(at) of each stored action at, t=1, 2, . . . , T in the corresponding distribution are calculated; these two probabilities respectively represent the probability that the action stored in the sample pool is selected for execution in the old and new execution networks.
Step S33, the ratio of the second probability to the first probability is calculated;
In an embodiment,
t=1, 2, . . . , T is calculated; and the ratio represents the parameter difference between the old and new execution networks. If the parameters between the old and new execution networks are consistent, it means that the execution networks have been updated to the best. Since it is desirable that the parameters of the networks be continuously updated and optimized, the ratio will be calculated to update the network parameters.
Step S34, all the ratios are multiplied by the corresponding differences and averaged to obtain a second loss value to update the network parameters of the new network;
In an embodiment, for t=1, 2, . . . , T, ratiot is multiplied by R(t)−V(st) and averaged as the second loss value to update parameters of the new execution network. ratiot represents the updating direction of the action network, and R(t)−V(st) represents the updating direction of the parameters of the evaluation network; since the optimization of the output execution action needs to be combined with the change of network state, the product of the two is selected to update the parameters of the new execution network, so that the latest network state can be learned and the action suitable for the network state is output in the next step.
In the concrete implementation of step S16, steps S13-S15 are repeated until the bandwidth occupied by each modal in the polymorphic smart network ensures the communication transmission quality and does not overload the network export;
In an embodiment, the process of S13-S15 is a round of training, and the next round of training is continued until each modal reasonably occupies the bandwidth, so as to ensure the communication transmission quality without overloading the network outlet. After sufficient training, the reinforcement learning agent has completely learned the optimal strategy in different network environments, that is, the execution action that can achieve the set expected goal.
Corresponding to the aforementioned embodiment of the reinforcement learning agent training method in a polymorphic smart network, the application also provides an embodiment of the reinforcement learning agent training apparatus in a polymorphic smart network.
Step S41, applying a reinforcement learning agent trained by the reinforcement learning agent training method in a polymorphic smart network according to Embodiment 1 to the polymorphic smart network;
Step S42, scheduling resources occupied by each modal according to a scheduling strategy output by the reinforcement learning agent.
According to the above embodiment, the application applies the trained reinforcement learning agent to the modal bandwidth resource scheduling method, which can be adaptive to networks with different characteristics, can be used for intelligent management and control of polymorphic smart networks, and has good adaptability and scheduling performance.
In an embodiment, the reinforcement learning agent training method in the above-mentioned polymorphic smart network has been described in detail in Embodiment 1, and the application of the reinforcement learning agent to a polymorphic smart network and scheduling according to the scheduling strategy output by reinforcement learning agent are both conventional technical means in this field, and will not be repeated here.
Corresponding to the aforementioned embodiment of the modal bandwidth resource scheduling method in the polymorphic smart network, the application also provides an embodiment of the modal bandwidth resource scheduling apparatus in a polymorphic smart network.
With regard to the apparatus in the above embodiment, the specific way in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.
For the apparatus embodiment, because it basically corresponds to the method embodiment, it is only necessary to refer to part of the description of the method embodiment for the relevant points. The apparatus embodiments described above are only schematic, in which the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed to multiple network units. Some or all of the modules can be selected according to the actual needs to achieve the purpose of the application solution. Those skilled in the art can understand and implement it without creative labor.
Correspondingly, the application also provides an electronic device, which includes one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the reinforcement learning agent training method in the polymorphic smart network or the modal bandwidth resource scheduling method in the polymorphic smart network as described above. As shown in
Correspondingly, the application also provides a computer-readable storage medium, on which computer instructions are stored, which, when executed by the processor, impellent the reinforcement learning agent training method in the polymorphic smart network or the modal bandwidth resource scheduling method in a polymorphic smart network. The computer-readable storage medium can be an internal storage unit of any apparatus with data processing capability as described in any of the previous embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage apparatus of the wind turbine, such as a plug-in hard disk, Smart Media Card (SMC), a SD card, a Flash Card and the like provided on the device. Further, the computer-readable storage medium can also include both internal storage units and external storage devices of any device with data processing capability. The computer-readable storage medium is used for storing the computer program and other programs and data required by any equipment with data processing capability, and can also be used for temporarily storing data that has been output or will be output.
Other embodiments of the present disclosure will easily be conceived by those skilled in the art after considering the specification and practicing the disclosure herein. This application is intended to cover any variations, uses or adaptations of this application, which follow the general principles of this application and include common sense or common technical means in this technical field that are not disclosed in this application.
It shall be construed that this application is not limited to the precise structure described above and shown in the drawings, and various modifications and changes can be made without departing from its scope.
Number | Date | Country | Kind |
---|---|---|---|
202210782477.4 | Jul 2022 | CN | national |
The present disclosure is a continuation of International Application No. PCT/CN2022/130998, filed on Nov. 10, 2022, which claims priority to Chinese Application No. 202210782477.4, filed on Jul. 5, 2022, the contents of both of which are incorporated herein by reference in their entireties.
Number | Name | Date | Kind |
---|---|---|---|
20200162535 | Ma et al. | May 2020 | A1 |
20210241090 | Chen | Aug 2021 | A1 |
20220124543 | Orhan | Apr 2022 | A1 |
20220166683 | Shiner | May 2022 | A1 |
20220210200 | Crabtree et al. | Jun 2022 | A1 |
Number | Date | Country |
---|---|---|
108683614 | Oct 2018 | CN |
111988225 | Nov 2020 | CN |
112465151 | Mar 2021 | CN |
113254197 | Aug 2021 | CN |
113328938 | Aug 2021 | CN |
113595923 | Nov 2021 | CN |
113963200 | Jan 2022 | CN |
114626499 | Jun 2022 | CN |
114866494 | Aug 2022 | CN |
2022083029 | Apr 2022 | WO |
Entry |
---|
International Search Report (PCT/CN2022/130998); dated Mar. 16, 2023. |
Notice of Allowance(CN202210782477.4); dated Aug. 16, 2022. |
Survey-of-deep-reinforcement-learning-based-on-walue-function-and-policy-gradient. |
Number | Date | Country | |
---|---|---|---|
20240015079 A1 | Jan 2024 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/130998 | Nov 2022 | US |
Child | 18359862 | US |