The present disclosure is related to obtaining a policy, for load balancing a communication system, with a learning technique using multi-objective reinforcement learning and meta-learning.
The fast-increasing traffic demand in a cellular communication system may cause uneven distribution of load across the network in the cellular communication system. Load balancing allocates load according to available resources such as bandwidth and base stations. Allocating includes redistributing the traffic load between different available resources. Load balancing requires automatic adjustment of several parameters to improve key performance indicators (KPIs). Maximizing one KPI such as minimum throughput (Tmin) over all base stations may lead to poor performance in another KPI such as standard deviation of throughput (Tstd).
Alternative solutions may consider multiple KPIs simultaneously, but not provide sufficient performance for certain KPIs, for example the alternative solutions do not result in sufficient performance. An example of an alternative solution is the radial algorithm (RA) of S. Parisi, M. Pirotta, N. Smacchia, L. Bascetta and M. Restelli, “Policy gradient approaches for multi-objective sequential decision making: A comparison,” IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014, pp. 1-8.
In one example, the present application relates to the support of cellular communications particularly with respect to supporting traffic volume efficiently with an effective artificial intelligence model (AI model). Supporting includes regulating and handling cellular communications. In embodiments, an AI model may also be referred to as a policy. An example of cellular communications is 5G. The application also relates to problems of energy saving in a telecommunications network.
In embodiments, artificial intelligence model may referred to as a task. An AI model, also referred to as a learned reinforcement learning model, may also be referred to as a policy, in embodiments. Further, in embodiments, in reinforcement learning, experiences or histories of a policy performing in an environment may be referred to as trajectories.
Embodiments provided herein apply multi-objective reinforcement learning (MORL) to simultaneously increase Tmin while reducing Tstd. In some embodiments, meta-MORL load balancing (also referred to as MeMo LB) is used to efficiently learn from available data. In some embodiments, a distilled policy from tasks with a variety of KPI goals is used to initialize the MeMo LB solution.
Embodiments provided herein outperform comparative approaches such as the radial algorithm cited above. Thus KPIs for a cellular communication system are improved and bandwidth and base stations are more effectively used in providing cellular communications.
In the discussion below, there may be more than two preference vectors.
Provided herein is a method of obtaining a key performance indicator (KPI) fast-adaptive artificial intelligence (AI) model, the method including: receiving first KPI preference setting information; obtaining a first AI model based on the first KPI preference setting information; receiving second KPI preference setting information; obtaining a second AI model based on the second KPI preference setting information; obtaining a distilled AI model by knowledge distillation based on the first AI model and the second AI model; and obtaining the KPI fast-adaptive AI model by meta learning based on the distilled AI model, the first KPI preference setting information and the second KPI preference setting information. The preferences for policy distillation and meta-learning are not necessarily the same.
In some embodiments, the method includes applying the KPI fast-adaptive AI model to perform load balancing in a cellular communications system.
In some embodiments, the method includes initializing the KPI fast-adaptive AI model with the distilled policy.
In some embodiments, the method includes performing, using the KPI fast-adaptive AI model, task adaptation for a first task associated with a first preference vector and a second task associated with a second preference vector to obtain a plurality of first task parameters and a plurality of second task parameters, wherein the first preference vector indicates a first weighting over a plurality of KPIs and the second preference vector indicates a second weighting over the plurality of KPIs; collecting one or more first validation trajectories and one or more second validation trajectories, wherein the first task is associated with a first task policy and with first task parameters, and the second task is associated with a second task policy and with second task parameters; and updating a plurality of meta parameters of the KPI fast-adaptive AI model using the one or more first validation trajectories and the one or more second validation trajectories.
In some embodiments, the obtaining the distilled AI model includes using a distillation loss function.
In some embodiments, the method includes obtaining the distilled AI model by knowledge distillation based on the first AI model and the second AI model includes: training the first AI model, wherein the first AI model corresponds to a first teacher; training the second AI model, wherein the second AI model corresponds to a second teacher; collecting a plurality of trajectories using the first teacher and the second teacher; and training the distilled policy to match state-dependent action probability distributions of the first teacher and the second teacher using the distillation loss function.
In some embodiments, the distillation loss function expresses a Kullback-Leibler (KL) divergence loss.
In some embodiments, the distillation loss function expresses a negative log likelihood loss.
In some embodiments, the distillation loss function expresses a mean-squared error loss.
In some embodiments, the method includes fine tuning the KPI fast-adaptive AI model to approximate a Pareto front.
In some embodiments, obtaining the KPI fast-adaptive AI model by meta learning includes: the performing the task adaptation includes: sampling one or more first training trajectories using the KPI fast-adaptive AI model; updating the plurality of first task parameters of the first task policy based on the one or more first training trajectories; sampling one or more second training trajectories using the KPI fast-adaptive AI model; and updating the plurality of second task parameters of the second task policy based on one or more second training trajectories; and the collecting includes: obtaining the one or more first validation trajectories using the KPI fast-adaptive AI model; and obtaining the one or more second validation trajectories using the KPI fast-adaptive AI model.
Also provided herein is a server for obtaining a key performance indicator (KPI) fast-adaptive artificial intelligence (AI) model, the server including: one or more processors; and one or more memories, the one or more memories storing a program, wherein execution of the program by the one or more processors is configured to cause the server to at least: receive first KPI preference setting information; obtain a first AI model based on the first KPI preference setting information; receive second KPI preference setting information; obtain a second AI model based on the second KPI preference setting information; obtain a distilled AI model by knowledge distillation based on the first AI model and the second AI model; and obtain the KPI fast-adaptive AI model by meta learning based on the distilled AI model, the first KPI preference setting information and the second KPI preference setting information.
Also provided herein is a non-transitory computer readable medium configured to store a program for obtaining a key performance indicator (KPI) fast-adaptive artificial intelligence (AI) model, wherein execution of the program by one or more processors of a server is configured to cause the server to at least: receive first KPI preference setting information; obtain a first AI model based on the first KPI preference setting information; receive second KPI preference setting information; obtain a second AI model based on the second KPI preference setting information; obtain a distilled AI model by knowledge distillation based on the first AI model and the second AI model; and obtain the KPI fast-adaptive AI model by meta learning based on the distilled AI model, the first KPI preference setting information and the second KPI preference setting information.
Provided herein is a method of multi-objective reinforcement learning load balancing, the method including: initializing a meta policy; performing, using the meta policy, task adaptation for a first task associated with a first preference vector and a second task associated with a second preference vector to obtain a plurality of first task parameters and a plurality of second task parameters. The first preference vector indicates a first weighting over a plurality of key performance indicators and the second preference vector indicates a second weighting over the plurality of KPIs. The method also includes collecting one or more first validation trajectories and one or more second validation trajectories, wherein the first task is associated with a first task policy and with first task parameters, and the second task is associated with a second task policy and with second task parameters; updating a plurality of meta parameters of the meta policy using the one or more first validation trajectories and the one or more second validation trajectories; and applying the meta policy to perform load balancing in a cellular communications system.
Also provided herein is a server for multi-objective reinforcement learning load balancing, the server including: one or more processors; and one or more memories, the one or more memories storing a program, wherein execution of the program by the one or more processors is configured to cause the server to at least: initialize a meta policy; perform, using the meta policy, task adaptation for a first task associated with a first preference vector and a second task associated with a second preference vector to obtain a plurality of first task parameters and a plurality of second task parameters. The first preference vector indicates a first weighting over a plurality of key performance indicators and the second preference vector indicates a second weighting over the plurality of KPIs. The program is further configured to cause the server to collect one or more first validation trajectories and one or more second validation trajectories, wherein the first task is associated with a first task policy and with first task parameters, and the second task is associated with a second task policy and with second task parameters; update a plurality of meta parameters of the meta policy using the one or more first validation trajectories and the one or more second validation trajectories; and apply the meta policy to perform load balancing in a cellular communications system.
Also provided herein is a non-transitory computer readable medium configured to store a program for multi-objective reinforcement learning load balancing, wherein execution of the program by one or more processors of a server is configured to cause the server to at least: initialize a meta policy; perform, using the meta policy, task adaptation for a first task associated with a first preference vector and a second task associated with a second preference vector to obtain a plurality of first task parameters and a plurality of second task parameters. The first preference vector indicates a first weighting over a plurality of key performance indicators and the second preference vector indicates a second weighting over the plurality of KPIs. The stored program is further configured to cause the server to collect one or more first validation trajectories and one or more second validation trajectories, wherein the first task is associated with a first task policy and with first task parameters, and the second task is associated with a second task policy and with second task parameters; update a plurality of meta parameters of the meta policy using the one or more first validation trajectories and the one or more second validation trajectories; and apply the meta policy to perform load balancing in a cellular communications system.
The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of embodiments provided herein.
Knowledge distillation 1-40 combines knowledge from different AI models into a single distilled AI model. In
The resulting distilled policy 1-3 is used as a starting point for meta training 1-60. Meta training 1-60 approximates an optimal solution in a multi-objective MDP. The approximation is the KPI fast-adaptive AI model, also referred to as meta policy 1-7. The exact optimal solution is a Pareto front and the exact Pareto front may be difficult to achieve due to computational requirements and a volume of training data required. Meta training 1-60 is an efficient approach to finding a solution to the load balancing problem for the cellular communications system 1-11 when there are multiple objectives represented by different sampled tasks 1-62 corresponding to possible different operator objectives for the cellular communications system 1-11. The sampled tasks 1-62 correspond to KPI preferences and may be different than the KPI preference setting information used in the knowledge distillation 1-40.
Overall, Meta training 1-60 is an efficient and accurate solution for the cellular communication system 1-11, and the efficiency is assisted by knowledge distillation 1-40.
More specifically, knowledge distillation 1-40 includes obtaining KPI preference setting information 1-48 and KPI preference setting information 1-50. In general, there may be more than two KPI preference setting information quantities.
Details of knowledge distillation 1-40 are provided in
Meta training 1-60 is used to initialize sampled tasks 1-62 which then interact with an environment 1-63. A task corresponds to a Markov Decision Process for a given weight vector corresponding to a KPI preference. Environment 1-63 may the same or different from environment 1-53. The interactions provide interaction histories 1-64 which are then used in meta adaptation training 1-66 to produce a KPI fast-adaptive AI model. This model is also referred to herein as meta policy 1-7. Details of meta training 1-60 are provided in
In embodiments, a policy is a function that maps states (system states of communications systems) to actions (load balancing control parameters). In embodiments, a state is a vector that describes the current status of a system or an environment. For wireless systems, the state can contain information about active users, current IP (data, such as Internet Protocol data) throughput, and/or current cell PRB (physical resource block) usage.
In reinforcement learning (RL), a policy is often approximated using a neural network with learnable parameters θ. A policy may be referred to as π or πθ. The objective of RL methods is to learn optimal parameters θ* by maximizing an agent's expected return (accumulation of rewards received in different time steps). To do so, the agent interacts with its environment by applying the policy πθ and collecting interaction data D=(st, at, rt) and performs gradient ascent to maximize its expected return; (st, at, rt) refer to the state, action and reward at time step k, respectively.
In general, a policy is approximated using a neural network or a model with learnable parameters θ. Model initialization refers to how these parameters are initialized or first set at the beginning of the learning process. One initialization technique is to randomly sample the parameters from a given distribution. However, learning a neural network using random initialization may take excessive time. To speed up the learning process, the parameters θ can be initialized using other parameters, such as the parameters θPD of a distilled policy πPD (item 1-3 of
Continuing with logic 1-21, at operation 1-30 the logic includes updating meta parameters θmeta, of the meta policy 1-7 using the validation trajectories.
Finally, at operation 1-32, the logic includes applying the meta policy 1-7. In some embodiments, the meta policy 1-7 is applied to perform load balancing in a cellular communications system 1-11.
In
Further details of the re-selection offset between cells can be found in 3GPP TS 38.304, “User Equipment (UE) procedures in idle mode and in RRC Inactive state.” For active UEs, handover is controlled by condition A2 which is Xc<ThA2 and condition A5 which is that Xc<ThA21ΛXn>ThA52.
The MORL 2-18 of
While cellular communication system 1-11 is operating, observed network state are stored in a buffer. The stored data is a trajectory i of length TT where each point in the trajectory is a tuple of the form (sk, ak, rk) where k indexes time (the kth time step). The trajectory may thus be expressed as i={(s0, a0, r0), . . . , (sT, aT, rT)}. Thus a trajectory is a sequence of state, action and reward obtained by running some policy π on an environment E (for example, cellular communication system 1-11) for TT consecutive time steps. The environment E is defined by a geographical placement of base stations, resource allocation in terms of bandwidth, and a set of statistics indicating traffic demand for a geographical distribution of camping UEs and a geographical distribution of active UEs (for example, see the discussion above of camping UEs 2-2 and active UEs 2-4).
A task corresponds to a Markov Decision Process for a given weight vector ωi corresponding to KPI preferences. In other words, for a given weight vector ωi, the reward function becomes the weighted sum of the objectives. Different weight vectors ωi result in different reward functions and thus different policies. In an example, a task is defined by a weight vector. Embodiments disclose learning a single policy πmeta that can perform well for different tasks and hence for different weight vectors between different system KPIs. Training a policy on a weight vector ωi does not necessarily perform well on another task ωj. Hence, the use of meta-reinforcement learning as a solution concept for the multi-objective load balancing problem solved by embodiments provided herein.
As mentioned above, policy distillation (also referred to as knowledge distillation) is a mechanism to combine knowledge from different expert policies into a single policy. The basic policy distillation algorithm consists of two main stages.
Expert policies (teacher policies) are obtained as follows. In a first stage p expert policies are learned for different p tasks. A task corresponds to a specific weight vector ωi for multiple KPIs. An expert policy is an RL policy trained on a given preference weight vector to the maximum performance. After this stage, each expert policy achieves the best solution for a given weight vector ωi.
The distilled policy is obtained to mimic the behaviors of expert policies. To do so, the algorithm uses data collection and policy distillation. In data collection, the algorithm collects interaction data De={st, at, st+1, rt} from expert or teacher policies (as mentioned before, these expert policies are learned on different KPI preference vectors) and stores the data in a common memory buffer. During policy distillation the distilled policy πPD is initialized randomly (i.e., elements of πPD are random samples governed by some distribution). The distilled parameters πPD are learned using the collected data from experts. Specifically, the distilled parameters are updated using gradient descent to minimize the difference between the experts' actions and the distilled policy actions. Various loss functions may be used; see Equations 7, 8 and 9. Once the optimization is done, the distilled policy represents an aggregate of knowledge from all the experts and will achieve similar good performance on all the considered tasks.
In an MDP framework (for example Multi-Objective MDP MOMDP), S is a state space, A is an action space, P(st+1∨st, at) is the transition probability function, R(st, at) is the reward function returning a vector of m rewards [r1, . . . , rm]T where m is the number of objectives (number of KPIs), Z is the discount factor and ϕ0 is the initial state distribution. For a given policy π, the expected discounted return is defined as Jπ=[J1π, . . . , Jmπ]T such that the result of Eq. (1) is obtained.
J
i
π
=E[Σ
t=0
H
Z
t
r
i(st,at)∨s0˜ϕ0,at˜π]
Equation (1) defines one objective of the multi-objective in MORL.
The policy π which solves the max J expression below may be referred to as providing a Pareto front.
Maximizing the expected discounted return requires solving the problem
Embodiments provide performance approaching the solution of Eq. 1 using a meta policy approach. The meta policy approach approximates the Pareto front. In embodiments provided herein, an initial meta policy is fine-tuned for a set of preferences for a small number of iterations.
To learn the meta-parameters θmeta, embodiments start by sampling N weight vectors and train N policies for each task using K gradient updates. This step is called task adaptation. At the end of the task adaptation, the algorithm will have N policies parameters θi for each task i. Each policy πi performs well for a corresponding weight vector ωi. The next step is updating the meta-policy using the obtained parameters {θi}i=1N.The meta policy πmeta is updated by aggregating the errors from the N tasks. These two steps are repeated for a given number of meta iterations Nmeta and, at the end, the algorithm obtains the meta-control policy, πmeta.
As mentioned above, in meta-RL, an agent strives to learn a policy with parameters θ that solve multiple tasks from a given distribution p(T). Each task Ti is an MDP defined by its inputs si its outputs ai, a loss function Li, a transition function Pi, a reward function Ri and an episode length, Hi. Generally, meta-RL methods have two steps and two task sets: the meta-training tasks Ttrain and meta-testing or fine tuning where the agent is evaluated on a set of test tasks Ttest. It is assumed that both training and testing task sets are drawn from the same distribution p(T), but Ttest can be different from Ttrain. Each task Ti has both training and validation data Di={Ditrain,Dival}. For each task, the goal is to learn task-specific parameters θi=Alg (θ, Ditrain) starting from θ using Ditrain such that the loss Li on the validation set Dival is minimized. The final general policy obtained as θmeta may be referred to, during training, as either as θ with no subscript or referred to θmeta. Alg(·) refers to the algorithm used to update the task specific parameters θi. For example, gradient-based meta-RL methods such as Model Agnostic Meta Learning may be used. The meta-training phase is a bi-level optimization problem where the objective is to learn the optimal meta-parameters as shown in Eq. 2 and Eq. 3.
The inner optimization (in the argument of Li) may be solved using one or more gradient descent steps using Eq. 4, in which β is the step size of the inner level optimization.
θi=Alg(θ,Ditrain)=θ−β∇βLi(θ,Ditrain) Eq. 4
For the multi-objective load balancing problem, each task Ti is an MDP corresponding to a specific weight vector ωi. In one example, the solution is a gradient-based meta-RL method as described in Equations 2, 3 and 4.
The state, action and reward of the learning problem of Eq. 2 correspond to the network state 2-16, the chosen parameters 1-9 and functions related to system performance such as Tmin and Tstd. In an example, the state value includes the number of active UEs per frequency channel, the load for each frequency channel and the throughput per frequency channel. The action is selection of the parameters 1-9 (see Table 1). The rewards, in a non-limiting example, are
The coefficients 4.9 and 2.4 are only examples and do not limit the embodiments. The rewards have different scales and the scaled reward functions allow them to be combined. The scaling factors can be found by a grid search over a set of plausible values. The best factors in terms of rewards are selected. The technique of reward engineering can be used.
The learning problem expressed in Eq. 2 is solved by alternating two optimization steps: (i) task adaptation Alg (inner level) where a number of task-specific policies are learned starting from the meta-policy parameters θmeta, (ii) meta-adaptation (outer level) that adjusts the meta-parameters using trajectories sampled from the adapted policies (see
For task adaptation, N preference vectors are randomly sampled from a specific distribution p(ω) such that each weight element (ωi)j is positive and the elements of ωi sum to 1. For each ωi, the loss function is given by Eq. 5.
L
1(θ1,ω1)=−E{s
where the sum Σ is over t=0 to H1, {circumflex over (r)} is a reward, st is a state in a first MDP at time t, at is an action in the first MDP at the time t, and E is an expectation operator over states and actions defined by the meta policy πmeta. To estimate the gradients of the loss in Eq. 5, trajectories Ditrain are collected by running the meta policy in an environment governed by a Markov Decision Process of the ith task Ti, and a training trajectory is represented as {s1, a1, r1, . . . , sH, aH, rH}∈Dtrain, Dtrain comprises a set of training trajectories, and H is an episode horizon for the ith task Ti. Task specific parameters θi are obtained using one or more gradient steps of Eq. 4.
Meta adaptation is performed as follows. A meta-learner aggregates trajectories Dival sampled using policies πθ
θ→θ−η∇θΣi=1NLi(θi,ωi) Eq. 6
Meta MORL may be implemented as described by Algorithm 1 as provided in Table 2.
The “Meta Adaptation” portion of Table 2 is performed Nmeta times (see the “t loop”). That is, meta adaptation includes performing, for example, an iteration of meta adaptation to improve the meta policy by repeating: i) performing, using the meta policy, task adaptation (line 9 of Table 2), ii) collecting, as a non-limiting example, first validation trajectories and second validation trajectories (line 10 of Table 2), and iii) updating, as a non-limiting example, the plurality of meta parameters of the meta policy using the first validation trajectories and the second validation trajectories (line 13 of Table 2).
The challenging task of learning a general meta-policy is accomplished by embodiments provided herein. First, the task adaptation step explained above includes the collection of trajectories for each task. Generally, the number of these trajectories is limited to ensure the adaptation with few samples. Further, θmeta and θi have the same parameter space which could be in the order of millions in deep neural networks. Additionally, learning one initial condition for a large family of tasks is not trivial. To account for these challenges, some of the embodiments provided herein use policy distillation to combine the knowledge from different tasks into a single policy which will be used to initialize the meta-training. This provides better task-specific policies with fewer samples since the algorithm assumes that some of the preferences encountered during the task adaptation phase can be similar to the tasks used during policy distillation.
An achievement of MORL LB (and Meta MORL LB) is using known tasks to learn a task policy for a new task, when the new task is a previously-unseen task.
As mentioned above, in some embodiments, policy distillation is used to initialize the meta policy (that is, to initialize πmeta. The policy distillation stage starts by selecting P≠N preferences {ω1, . . . , ωp} and training P task-specific policies for each weight vector to maximum performance. The task-specific policies may be referred to as teachers or experts. Next, the trained teachers with parameters θi, for task i (task Ti, also referred to as Ei), are used to collect trajectories which will be saved in separate memory buffers. The distilled policy πPD is learned to match the teachers' state-dependent action probability distributions πθ
An alternative expression of the KL divergence uses a temperature parameter the KL divergence being then expressed as
The Q values of this expression are described below in the discussions of negative log likelihood loss and mean-squared-error loss.
In an example using Eq. 7, the AI model parameters of a first task π1 are first task parameters θ1 and are found using model-free reinforcement methods.
In example, related to Eq. 7 and to
Alternative loss functions may be used. The loss functions of Eq. 8 and Eq. 9 use q values. As background, Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy. The algorithm computes a function corresponding to the expected rewards, q (also called Q-values), for an action taken in a given state.
In Eq. 8, a negative log likelihood loss is provided. Negative log likelihood loss is a loss function to measure how the new student policy can perform; the lower the better. In Eq. 8, D is the data set, a1 is an action, ai,best is the highest value action, xi is the state, for example, the network state 2-16. θs are the student model parameters (for example θi. The state of the network, an input to the AI model πmeta, can contain information about active users, current IP (data, such as Internet Protocol data) throughput, and/or current cell PRB (physical resource block) usage.
In Eq. 8, ai,best=argmax (qi), where qi is a vector of unnormalized Q-values with one value per action.
L=−Σ
i=1
DlogP(ai=ai,best|xi,θs) Eq. 8
In an example of Eq. 8, the distillation loss function expresses a negative log likelihood loss.
In Eq. 9, a mean-squared-error loss is provided describing a squared loss between a student model (distilled policy) and a teacher model (πi with parameters θi for task Ti. The mean-squared-error loss is a loss function to measure the distance for the outputs of the actions determined by the student policy and the actions determined by the teacher policy. In Eq. 9, qiT refers to the Q-value of the teacher for the ith input data and qiS refers to the Q-value of the student for the ith input data.
L=Σ
i=1
D
∥q
i
T
−q
i
S∥22 Eq. 9
In an example of Eq. 9, the distillation loss function expresses a mean-squared error loss.
When considering a reinforcement learning problem, a suitable loss function can be chosen on whether the outputs are discrete values (use negative log likelihood loss or mean-squared error loss) or continuous (use KL divergence loss).
Policy distillation, in some embodiments, is a first stage. The second stage of Meta MOPD LB is the training of πmeta using πPD as initialization (also see
As shown in
At an operation 3-10, the BS 2-12 performs load balancing using parameters derived either directly from πmeta 1-7 or parameters derived after fine-tuning πmeta (that is, directly from πnew 3-9).
Logic flow 6-1 illustrates obtaining πmeta by using an initialized policy of πPD.
BS 2-12 may provide a graphic user interface (GUI) for entry of KPI weights preferred by the operator of BS 2-12. The BS 2-12 then sends KPI preference set 1 through KPI preference set N to the parameter server 2-8, where they are input to module 1. Module 2 then performs the i loop of Table 2 based on the target system; also see
Regarding
In an implementation example, base stations (including BS 2-12) interconnect with each other via an LTE X2interface. Once a handover decision is made from one BS to another BS, relevant information is exchanged through the X2 interface. In some embodiments, BS 2-12 uses the parameters 1-9 to perform load balancing in the cellular communications system 1-11. As shown in
Whether the BS 2-12 obtains parameters 1-9 from the parameter server 2-8 or locally using πnew, the BS 2-12 then is able to achieve improved load balancing of the different frequency bands with each sector of cellular communication system 1-11 shown in
In some embodiments, Meta-MORL for load balancing is deployed in three phases: offline phase, staging phase and online phase.
In the offline phase, field data is collected to generate real-world traffic patterns and performance records. These traffic scenarios are used to calibrate simulation parameters to mimic real-world dynamics. Specifically, πmeta is trained over degrees of freedom of number of UEs per frequency channel, traffic conditions such as request interval, file size, variations in demand over the different hours of the day, and traffic volume being a high traffic volume or a low traffic volume, for example.
In an example, the πmeta AI model has three hidden layers of 256 units each. Policy gradients may be computed using REINFORCE, as is known in the art. Trust-region policy optimization (TRPO) may be used for meta-adaptation. The value function used in both task and meta-adaptation phases, is a linear feature model fitted separately for each task. The learning rate β may be 0.1 during the meta-training and 0.003 for the finetuning phase. The episode length may be H=24 time steps. In each meta iteration N=5 tasks may be used and K=10 trajectories may be sampled. Preferences may be sampled from a Gaussian distributions, being restricted to be positive and Li normalized. In an example, πmeta for MeMo-LB and MeMoPD-LB are trained for 500 meta-iterations. For policy distillation, the teacher and student models may have the same architecture as the meta-policy. In an example, p=3 expert (teacher) policies are trained using proximal policy optimization (PPO).
A comparison of results is provided below in Table 3. It is good for Tmin to be high and Tstd to be low.
The Pareto front has been considered in order to evaluate the quality of the approximated Pareto fronts.
A measure of the quality of the approximated Pareto front is the hypervolume indicator, see Table 4.
Embodiments thus provide a better approximation in the multi-objective problem situation than the baselines.
Also, during fine-tuning (see
Augmenting the meta-training with policy distillation provides better performing individual policies. In an example for 300 different preferences, ωi, the realized rewards are higher for MeMoPD-LB compared to MeMo-LB for 90% of the preferences. This is with the same number of samples and gradient steps.
Thus, embodiments provide a general parameterized policy, πmeta, which can be adopted to new preferences with fewer samples and gradient steps. This meta policy is a differentiable solution which can be optimized end-to-end over a set of training preferences. MeMo-LB and MeMoPD-LB can be applied to complex, high-dimensional real-world control problems even with a limited number of samples and tasks (preferences). The multi-objective approach of embodiments is more effective to improve cellular network performance than single policy and traditional rule-based approaches. Policy distillation improves the generalization of the meta-policy by providing a task-specific starting point for the meta-training.
To assist in understanding, a list of symbols is provided in Table 5.
The embodiments provided above are not limited to cellular systems. For example, the embodiments described above are also applicable to intelligent transportation networks in which traffic congestion is a problem and load balancing will relieve traffic congestion. In terms of KPIs 1-5, objectives are traffic waiting time, delay, and queue length. For example, objectives are waiting time given by the sum of the times that vehicles are stopped. Delay is the difference between the waiting times of continuous green phases. Queue length is calculated for each lane in an intersection. Embodiments above simultaneously optimize the different metrics and quickly adapt to new unseen preferences depending on the intersection and the region.
For another example, the embodiments described above are also applicable to smart grid/smart home in which energy consumption is a problem and load balancing will reduce energy consumption. In terms of KPIs 1-5, objectives are operational cost of the smart grid and the environmental impact (e.g., greenhouse gas emission). Embodiments above can be applied to provide intelligent optimal policies for a better energy consumption.
Hardware for performing embodiments provided herein is now described with respect to
Provided herein is a method of multi-objective reinforcement learning load balancing, the method comprising: initializing a meta policy; performing, using the meta policy, task adaptation for a first task associated with the first preference vector and a second task associated with the second preference vector to obtain a plurality of first task parameters and a plurality of second task parameters; collecting one or more first validation trajectories and one or more second validation trajectories, wherein the first task is associated with a first task policy and with first task parameters, and the second task is associated with a second task policy and with second task parameters; updating a plurality of meta parameters of the meta policy using the one or more first validation trajectories and the one or more second validation trajectories; and applying the meta policy to perform load balancing in a cellular communications system.
In some embodiments, the method further comprises initializing the first task parameters based on the plurality of meta parameters using policy distillation.
In some embodiments initializing the first task parameters comprises: selecting a first plurality of preferences; training a first plurality of task policies, wherein the first plurality of task policies correspond to a first plurality of teachers; collecting a first plurality of trajectories using the first plurality of teachers; training the distilled policy to match state-dependent action probability distributions of the first plurality of teachers; and initializing the first task parameters using the distilled policy.
In some embodiments, the multi-objective reinforcement learning load balancing uses known tasks to learn a task policy for a new task, wherein the new task is a previously unseen task.
In some embodiments, the method further comprises fine tuning the meta policy using one or more first training trajectories.
In some embodiments, the performing the task adaptation comprises: sampling one or more first training trajectories using the meta policy; updating the plurality of first task parameters of the first task policy based on the one or more first training trajectories; sampling one or more second training trajectories using the meta policy; and updating the plurality of second task parameters of the second task policy based on one or more second training trajectories; and the collecting comprises: obtaining the one or more first validation trajectories using the meta policy; and obtaining the one or more second validation trajectories using the meta policy.
In some embodiments, the sampling the one or more first training trajectories using the meta policy comprises: running the meta policy in an environment governed by a Markov Decision Process of the first task, wherein a first training trajectory of the one or more first training trajectories is represented as {s1, a1, r1, . . . , sH, aH, rH}∈Dtrain, Dtrain comprises a plurality of training trajectories, and H is an episode horizon for the first task.
In some embodiments, the updating the meta policy comprises adjusting the plurality of meta parameters, θ, by a gradient expression θ−∇θ{L1(θ1; ω1)+L2(θ2; ω2)}, wherein η is a step size, ∇θ is a gradient operator with respect to the plurality of meta parameters, L1 and L2 are a first loss function and a second loss function of the first task and the second task, respectively, θ1, θ2∈θ, ω1 is the first preference vector and ω2 is the second preference vector.
In some embodiments, the first loss function is of a form L1(θ1, ω1)=−Σ{s
In some embodiments, the one or more first training trajectories correspond to a daily traffic pattern for a low traffic scenario for a first configuration of base stations in a first geographic area.
In some embodiments, the method further comprises initializing the first task parameters randomly.
This application claims benefit of priority of U.S. Provisional Application No. 63/242,417 filed Sep. 9, 2021, the contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63242417 | Sep 2021 | US |