The present disclosure is related to obtaining a policy, for load balancing a communication system, from previous policies.
The present application relates to support of cellular communications particularly with respect to supporting traffic volume (or traffic for short) efficiently with a deployment of base stations. An example of cellular communications is 5G.
A problem occurs in a radio communication system when a base station (BS) has a poor configuration with respect to present traffic demand. The base station may not be able to provide enough spectral bandwidth requested from user equipment devices (UEs). Or, a base station may be inefficiently allocated more spectrum than necessary to meet demand.
An embodiment may address these problems by using a policy bank to provide a policy for choosing load balancing parameters for a network characterized by previously unseen scenario, where a scenario is a description of a network layout and traffic demand statistics.
Communication load balancing balances the communication load between different network resources, e.g., frequencies. Load balancing improves the quality of service of the communications systems. For example, efficient load balancing improves the system's total IP throughput and minimum IP throughput of all frequencies.
The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of embodiments provided herein.
Embodiments provide a reduction in required minimum computational power and a reduction in required computer memory at a network server 5-8. These computational hardware improvements are provided by the embodiments performed by the policy selection server 3-2 disclosed below. The policy selection server 3-2 efficiently selects a policy to be implemented by the network server 5-8 in balancing load in a communication system 3-4 described below. The balanced load is in terms of camping UEs 5-2 and active UEs 5-4. By balancing the load, fewer base stations are needed in the communication 3-4 to provide support for a demanded amount of traffic service at a given performance level (bit error rate, dropped call rate, waiting time to be provided a channel).
At operation 1-4, load balancing parameters 1-8 for the communication system 3-4 are determined based on policy πs.
At operation 1-6, the load of the communication system 3-4, which is a target network, is balanced using the load balancing parameters 1-8.
At time (1) data from known networks is obtained and the policy bank 2-8 is trained and the policy selector 2-10 is trained.
At time (2), a system set of parameters 3-6 is obtained from the communication system 3-4. The system set of parameters 3-6 is an example of current traffic state. At time (3), the policy selection server 3-2 selects a policy πs based on data from the previously unseen traffic scenario (new traffic scenario) of communication system 3-4.
At time (4), the action selection server 3-11 takes action 3-8 based on the policy πs. The action 3-8 includes providing the load balancing parameters 1-8 (including updated reselection parameters 10-12 and updated handover parameters 10-14) for use in the cells 5-6 of the communication system 3-4. The action 3-8 is an example of a first action.
At time (5), the communication system 3-4 applies the updated reselection parameters 10-12 and updated handover parameters 10-14 and performance metrics 3-12 are achieved because the load of the target network has been balanced. The performance metrics 3-12 reflect a first reward obtained by moving, based on the first action, from a first state of the communication system 3-4 to a second state.
An some embodiments of
Additionally, in some embodiments of
A policy πs is then selected by the policy selector 2-10. Load balancing parameters 1-8 are determined as an action 3-8 based on the state of the network with the unseen scenario. Then, among other load balancing events and referring to time (4) of
In some embodiments, and again referring to time (1), the data from known networks includes a plurality of traffic profiles. In an example, a first traffic profile of the plurality of traffic profiles is a time series of traffic demand values and the first traffic profile is associated with a first channel of a first base station situated at a first geographic location of a known network of the known networks.
In some embodiments and referring to time (2) of
One cluster of the plurality of profiles is associated with one Markov decision process (MDP). MDP is discussed in detail below.
Each state representation includes a plurality of state vectors corresponding to a traffic profile and a policy. The plurality of state representations are a training set. The policy selector 2-10 is trained based on the training set. Further details of training the policy selector 2-10 are given in
Referring to times (1) and (2), receiving the state of the communication system 3-4, a state which represents an unseen scenario, is an example of deploying the plurality of policies of the policy bank 2-8 and the policy selector 2-10 to the target network (communication system 3-4), see the event of “deploy” from 2-10 to 2-6 in
At the top of
During day i, operation (1) occurs a single time, followed by alternating occurrences of operations (2) and (3).
Load balancing occurs in a given environment. An environment is a configuration of base stations and a pattern of traffic demands made on the base stations, which are also referred to as traffic patterns.
While communication system 3-4 is operating, observed traffic patterns are stored in a buffer as shown in
A state st is a communication network state at a time step t. The state indicates a number of active UEs, the number of camping UEs, the data throughput (also referred to as IP throughput), and the resources usage (for example the number of physical resource blocks, PRBs) of each cell of the communication network 3-4. A current state of the network may be referred to as network state 1-10.
A baseline policy π0 is a policy not obtained by reinforcement learning (RL) and is for example, a rule-based policy. A rule-based policy sets reselection parameters 10-12 and handover parameters 10-14 to some predetermined values without training the parameters based on traffic experiences. For example, a rule-based policy may set the parameters based on expert knowledge.
A set of policies (elements of the policy bank 2-8) may be trained using Algorithm 1 as provided in Table 1.
For item 6 of Table 1, a proximal policy optimization (PPO) algorithm may be used to operate on Ei to find πi. PPO is a non-limiting example, and other reinforcement learning algorithms can be used.
A policy is initialized as πθ0.
For each set of trajectories, D, an advantage estimate  is computed using a function vϕ.
The advantage estimate is used to find a next iteration of θi.
The value function is then updated based on vϕ and the rewards observed up to time T.
For further details of implementation of a proximal policy optimization, see J. Schulman, et al., “Proximal Policy Optimization Algorithms,” Aug. 27, 2017, Cornell Archive paper number arXiv:1707.06347v2.
In
The communication system 3-4 includes cells 5-6 (also see
The policy selection server 3-2 determines a policy by using Algorithms 1, 2 and 3 and providing the input data as shown in
The network server 5-8 performs the function of the action selection server 3-11 and also applies the policy based on the network state 1-10 to obtain load balancing parameters 1-8 which includes updated reselection parameters 10-12 and updated handover parameters 10-4. Action selection applies the policy πs based on the network state to obtain load balancing parameters (for example, updated reselection parameters 10-12 and updated handover parameters 10-4).
The geographic layout and bandwidth allocation of the communication system 3-4 along with the statistics of traffic demand from the camping UEs 5-2 and active UEs 5-4 represent a scenario. For a scenario, the statistics are stationary. Examples of a scenario include an urban cell layout on a weekday, an urban cell layout on a weekend day, a rural cell layout on a weekday, or an rural cell layout on a weekend day.
Load balancing includes redistributing user equipment (UEs) between cells. A cell is a service entity serving UEs on a certain carrier frequency and within a certain direction range relative to the base station it resides on. A base station can host multiple cells serving at different non-overlapping direction ranges (or sectors). Load balancing can be triggered between cells in the same (or different) sector(s).
The UEs have two states: active and idle. A UE is active when it is actively requesting network resources. For example, such a user might be streaming videos or making a call. When an UE is not in such a state, it is idle. There are two types of load balancing methods: (1) active UE load balancing (AULB) which is done through handover, and (2) idle UE load balancing (IULB) which is done through cell-reselection. The first one results in instantaneous changes in the load distribution with the cost of system overheads. The second one is relatively more lightweight and it affects the load distribution when UEs change from idle to active.
Active UE load balancing (AULB): AULB, such as mobility load balancing moves, by handover, active UEs from their serving cells to neighboring cells if better signal quality can be reached.
A handover occurs if Eq. 1 is true.
F
j
>F
i+αi,j+H Eq. (1)
H is the handover hysteresis and αi,j is a control parameter, such as the Cell Individual Offset (CIO). Equation 1 shows that by decreasing αi,j, the system can more easily handover UEs from cell i to cell j, thereby offloading from i to j, and vise-versa. Therefore, finding the best αi,j value suitable for different combinations of traffic status at cells i and j can allow us to perform AULB optimally.
2) Idle UE load balancing (IULB): IULB moves idle UEs from their camped cell to a neighboring cell based on cell—reselection. From the cell it is camped on, an idle UE can receive minimal service, such as broadcast service. Once it turns into active mode, it stays at the cell it camped on, and can be moved to another cell through AULB.
Generally, cell-reselection is triggered when the following
F
i<βi,jand Fj>γi,j Eq. (2)
where βi,j and γi,j are control parameters. By increasing βi,j and decreasing γi,j, the system can more easily move idle UEs from cell i to cell j, and vice-versa. Hence, optimally controlling these parameters will allow the system to balance the anticipated load and reduce congestion when idle UEs become active.
3) Performance metrics: Let C be the group of cells on which the system uses to balance the load. To achieve this goal and to ensure that the system enhances the overall performance of the network, four throughput-based system metrics are considered.
Gavg describes the average throughput over all cells in C, defined in Eq. (3).
where Δt is the time interval length and Ac is the total throughput of cell c during that time interval. Maximizing Gavg means increasing the overall performance of the cells in C.
Gmin is the minimum throughput among all cells in C, see Eq. (4).
Maximizing Gmin improves the worst-case cell performance.
Gsd is the standard deviation of the throughput, see Eq. (5)
Minimizing Gsd reduces the performance gap between the cells, allowing them to provide more fair services.
Gcong quantifies the ratio of uncongested cells, see Eq. (6).
where 1(·) is the indicator function returning 1 if the argument is true, otherwise 0 and ∈ is a small value. Maximizing Gcong discourages cells getting into a congested state. An example of E is 1 Mbps.
Sequential decision-making is an important problem in the field of machine learning. It covers a wide range of applications such as telecommunication, finance, self—driving cars etc. In short, sequential decision-making describes the task where given some past experience, an intelligent agent is expected to make a decision in an uncertain environment in order to achieve the given objective.
A formal framework known as reinforcement learning (RL) for sequential decision making. The core idea of RL is that by mimicking a biological agent, the artificial agent can learn from its past experience by optimizing some objectives given in the form of cumulative rewards. Formally speaking, a general RL problem is a discrete—time stochastic control process. In this framework, a common assumption is that the control process follows Markov property, that is, the future of the process only depends on the current state.
The solution to an RL problem (policy) is a function 7 which maps from S to A. To obtain this solution, the agent needs to achieve the maximum expected cumulative rewards.
There are two main types of approaches, one is the value-based method, the other is the policy gradient-based method. The value-based method focuses on building a value function or an action-value function, i.e. an estimation of the accumulated rewards and then generate a policy based on the estimated value function by taking the argmax over the action space. Some significant work includes Q learning, Q learning with function approximator and, deep Q networks (DQN). The policy gradient method leverages a function approximator (e.g. neural networks) to model the policy and directly optimizes the policy with respect to a performance objective (typically the expected cumulative reward).
One heuristic found in training policy networks (i.e. neural networks parameterized policy) is that if the parameter updates change the policy too much at one step, it is often detrimental to the training process, which is known as policy collapse. By enforcing a KL divergence constraint between each update, trust region policy optimization (TRPO) successfully adopts this idea and guarantees a monotonic improvement over policy iteration. However, TRPO is often criticized for its complex structure and its incompatibility with some common deep learning structures. To alleviate this problem, a clipped surrogate objective is introduced in the proximal policy optimization (PPO) method. The proposed method only requires first-order optimization and can still retain similar performance compared to TRPO. The method is much simpler to implement, and more importantly, it has better sample complexity compared to TRPO, which is of great importance when it comes to real-world application. Some drawbacks of the PPO framework are stability issues over the continuous action domain and proposed some simple workarounds to resolve these problems. Embodiments use PPO as an example RL algorithm as it reduces the risk of policy collapse and provides a more stable learning.
L(θ)=Êt(rt(θ)Ât) Eq. (7)
where Ât is an estimator of the advantage function at timestep t. The expression does not have the KL constraint proposed in TRPO. Maximizing L will lead to an excessively large policy update. The PPO may be clipped to a small interval near 1 as follows.
L
CLIP(θ)=Êt[min(rt(θ)Ât,clip(rt(θ),1−∈,1+∈)At)] Eq. (8)
At each time step t, an action at, containing new balancing parameter values, is chosen according to the network state st. After applying at, the network will transition from st to st+1 according to the dynamics of the network captured by a transition probability function P (st+1|st, at). The MDP is defined as a tuple S, A, R, P, u as follows:
S is the state space, where each state is a high—dimensional vector of network status information in the last k time steps, describing the recent traffic pattern. The network status information contains the number of active UEs in every cell, the bandwidth utilization of every cell, and the average throughput of every cell. These features are averaged over the time interval between each application of a new action. In an example, each time step is one hour and k=3.
“A” is the action space in which each action is a concatenation of the load balancing parameters ai,j, βi,j and γi,j for all i,j ∈ C.
R is the reward which is a weighted average of the performance metrics of Eq. (1)-Eq. (6). R can be directly computed with the state, the reward is an observed outcome of the chosen action.
P is the transition probability function between states, P (st+1|st,αt).
μ is the initial distribution over all states in S, μ=P (s0).
While S, A and R are the same for all scenarios, P and μ are in general for different scenarios. As an RL policy is trained to maximize the long term reward, it will inevitably be biased by P and μ, therefore a policy trained on one scenario may not be optimal on another.
One of the main challenges for load balancing with a data—driven solution is the generalization of the learning to diverse communication traffic patterns. To ensure that the policy bank can cover a wide range of traffic conditions, traffic scenarios may be clustered based on their daily traffic patterns to discover different traffic types. For this step, the daily traffic pattern is described as a sequence of states over, for example, 24 hours and K-Means is used to perform the clustering. The, a subset of scenarios is randomly picked from each type to form the set of M traffic scenarios. Then PPO is applied using the MDP formulation on each scenario to obtain policy πi ∈ Π. The policies are learned by maximizing the expected sum of discounted reward:
πi=ar g maxπEπ(εt=1nλt−1Rt) Eq. (9)
The policy selector aims to find the traffic scenario that is most similar to the target scenario. The policy πi that is trained on scenario i (also referred to as Xi) is chosen to execute in the target scenario. A policy trained on a scenario that is similar to the target scenario results in better performance.
When testing on an unseen traffic scenario, the system feeds in the state description from the last T time steps to the traffic identifier to pick the best policy in the policy bank H. In some embodiments, T=24 hours, allowing the system to capture the peaks and valleys in the regular daily traffic pattern observed in traffic scenario data.
The policy selector, in some embodiments, is a feed-forward neural network classifier with 3 hidden layers and 1 output layer.
In some embodiments, each of the first three layers follow a batch normalization. The number of neurons for these layers are 128, 64 and 32, respectively. Some embodiments use rectified linear unit activation for the first three layers and softmax for the last layer. The number of layers, number of neurons for each layer and the activation function are chosen using cross-validation.
An example of a simulation scenario is shown in
In an example and to mimic real-world data, a portion of UEs is uniformly concentrated at specified regions while the remaining are uniformly distributed across the environment. These dense traffic locations changes at each hour. All UEs follow a random walk process with an average speed of 3 m/s. The packet arrival follows a Poisson process with variable size between 50 Kb to 2 Mb and inter-arrival time between 10 to 320 ms. Both are specified at each hour to create the desired traffic condition.
Table 2 provides average performance over six days and several algorithms.
Table 3 provides Algorithm 4 for applying clustering to traffic profiles.
The steps of Table 3, in a non-limiting example, correspond to obtaining a plurality of traffic profiles, wherein a first traffic profile of the plurality of traffic profiles comprises a first time series of traffic demand values and the first traffic profile is associated with a first base station situated at a first geographic location, wherein a second traffic profile of the plurality of traffic profiles comprises a second time series of traffic demand values and the second traffic profile is associated with a second base station situated at a second geographic location different from the first geographic location; obtaining, by clustering over the plurality of traffic profiles, a vector, each element of the vector corresponding to a plurality of representative traffic profiles; and obtaining, for each plurality of representative traffic profiles, one policy of the plurality of policies.
In
A given policy is trained using, for example, Algorithm 2. The collection of trained policies forms the policy bank 2-8, shown in
At operation 8-2, a plurality of traffic profiles is obtained. The plurality of traffic profiles corresponds to {D1, D2, . . . , DN} of Table 1 and of Algorithm 4.
At operation 8-4, clustering is performed.
At operation 8-6 policy training is performed, one policy per cluster.
At operation 8-8, a policy is selected and load balancing is performed. Also see
Operation 8-10 indicates that performance metrics 3-12 of the communication system 3-4 are improved by the load balancing.
Policy j is used to act on scenario i. See the inner loop indexed by t in Algorithm 3,
At 10-1, the policy selection server 3-2 builds the policy bank 2-8 using data of known networks.
With respect to a previously unseen scenario, communications system 3-4, network state 1-20 is sent by network server 5-8 to the policy selection server 3-2 in a message 10-2. This information is buffered, see
At 10-4, policy determination takes place and a message 10-5 carries an indicator of policy πs to the network server 5-8. Based on network state, the network server 5-8 takes action 10-8 configuring the load balancing parameters 1-8. This will happen several times a day as shown in
At 10-11 and 10-13, reselection parameters 10-12 and handover parameters 10-14 are updated (subsets of the load balancing parameters 1-8) to improve load balancing.
At 10-20 and 10-22 behavior of UEs 5-2 and UEs 5-4, respectively, are influenced by the load balancing parameters 1-8.
Based on network events, another policy may be selected, see 10-28.
As indicated at 10-32, radio resource utilization of the communication system 3-4 is improved by the contributions of the policy selection server 3-2.
At algorithm state 1 of the overall algorithm flow, network state values related to cells 5-6, camping UEs 5-2 and active UEs 5-4 of known networks are obtained.
At algorithm state 2, the policy selection server 3-2 formulates the policy bank 2-8, trains the policy selector 2-10 using data from the known networks. The policy selection server then selects a policy πs for the previously-unseen scenario of the communications system 3-4. The selection is based on the network state of the target network, communications system 3-4. The known networks of algorithm state 1 may be considered to exhibit source traffic profiles. The communications system may be considered to exhibit target traffic profiles. There may be overlap, in a statistical sense, between the source traffic profiles and the target traffic profiles.
At algorithm state 3, network server 5-8 updates the load balancing parameters 1-8 of the target network, communications system 3-4.
Thus, algorithm flow 11-1 balances the communication load between different network resources, e.g., frequencies of target network 3-4. Load balancing improves the quality of service of the camping UEs 5-2 and the active UEs 5-4. For example, efficient load balancing improves the system's total IP throughput and minimum IP throughput of all frequencies.
Hardware for performing embodiments provided herein is now described with respect to
This application claims benefit of priority of U.S. Provisional Application No. 63/227,951 filed Jul. 30, 2021, the contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63227951 | Jul 2021 | US |