The present application claims priority to Indian Provisional Patent Application No. 202321076223 filed on Nov. 8, 2023, the entirety of which is incorporated by reference herein.
The present disclosure is related to 5G wireless networks, and relates more particularly to optimized radio resource management using machine learning approaches in O-RAN networks.
In the following sections, overview of Next Generation Radio Access Network (NG-RAN) architecture and 5G New Radio (NR) stacks will be discussed. 5G NR (New Radio) user and control plane functions with monolithic gNB (gNodeB) are shown in
In addition, as shown in
For the control plane (shown in
NG-Radio Access Network (NG-RAN) architecture from 3GPP TS 38.401 is shown in
In this section, an overview of NOTIFY elementary procedure will be provided. As per 3GPP TS 38.473, the purpose of the NOTIFY procedure is to enable the gNB-DU to inform the gNB-CU that the QoS of an already established GBR DRB cannot be fulfilled any longer or that it can be fulfilled again. The procedure uses UE-associated signaling. As shown in
In this section, an overview Layer 2 (L2) of 5G NR will be provided in connection with
Open Radio Access Network (O-RAN) is based on disaggregated components which are connected through open and standardized interfaces based on 3GPP NG-RAN. An overview of O-RAN with disaggregated RAN CU (Centralized Unit), DU (Distributed Unit), and RU (Radio Unit), near-real-time Radio Intelligent Controller (Near-RT-RIC) and non-real-time RIC is illustrated in
As shown in
A cell site can comprise multiple sectors, and each sector can support multiple cells. For example, one site could comprise three sectors and each sector could support eight cells (with eight cells in each sector on different frequency bands). One CU-CP (CU-Control Plane) could support multiple DUs and thus multiple cells. For example, a CU-CP could support 1,000 cells and around 100,000 User Equipments (UEs). Each UE could support multiple Data Radio Bearers (DRB) and there could be multiple instances of CU-UP (CU-User Plane) to serve these DRBs. For example, each UE could support 4 DRBs, and 400,000 DRBs (corresponding to 100,000 UEs) may be served by five CU-UP instances (and one CU-CP instance).
The DU could be located in a private data center, or it could be located at a cell-site. The CU could also be in a private data center or even hosted on a public cloud system. The DU and CU, which are typically located at different physical locations, could be tens of kilometers apart. The CU communicates with a 5G core system, which could also be hosted in the same public cloud system (or could be hosted by a different cloud provider). A RU (Radio Unit) (shown as O-RU 803 in
The E2 nodes (CU and DU) are connected to the near-real-time RIC 132 using the E2 interface. The E2 interface is used to send data (e.g., user and/or cell KPMs) from the RAN, and deploy control actions and policies to the RAN at near-real-time RIC 132. The applications or services at the near-real-time RIC 132 that deploys the control actions and policies to the RAN are called xApps. During the E2 setup procedures, the E2 node advertises the metrics it can expose, and an xApp in the near-RT RIC can send a subscription message specifying key performance metrics which are of interest. The near-real-time RIC 132 is connected to the non-real-time RIC 133 (which is shown as part of Service Management and Orchestration (SMO) Framework 805 in
In this section, PDU sessions, DRBs, and quality of service (QOS) flows will be discussed. In 5G networks, PDU connectivity service is a service that provides exchange of PDUs between a UE and a data network identified by a Data Network Name (DNN). The PDU Connectivity service is supported via PDU sessions that are established upon request from the UE. The DNN defines the interface to a specific external data network. One or more QoS flows can be supported in a PDU session. All the packets belonging to a specific QoS flow have the same 5QI (5G QOS Identifier). A PDU session consists of the following: Data Radio Bearer which is between UE and CU in RAN; and an NG-U GTP tunnel which is between CU-UP and UPF (User Plane Function) in the core network.
The following should be noted for 3GPP 5G network architecture, which is illustrated in
In this section, standardized 5QI to QoS characteristics mapping will be discussed. As per 3GPP TS 23.501, the one-to-one mapping of standardized 5QI values to 5G QoS characteristics is specified in Table A shown below. The first column represents the 5QI value. The second column lists the different resource types, i.e., as one of Non-GBR, GBR, Delay-critical GBR. The third column (“Default Priority Level”) represents the priority level Priority5QI, for which the lower the value the higher the priority of the corresponding QoS flow. The fourth column represents the Packet Delay Budget (PDB), which defines an upper bound for the time that a packet may be delayed between the UE and the N6 termination point at the UPF. The fifth column represents the Packet Error Rate (PER). The sixth column represents the maximum data burst volume for delay-critical GBR types. The seventh column represents averaging window for GBR, delay critical GBR types.
For example, as shown in Table A below, 5QI value 1 represents GBR resource type with the default priority value of 20, PDB of 100 ms, PER of 0.01, and averaging window of 2000 ms. Conversational voice falls under this category. Similarly, as shown in Table A, 5QI value 7 represents non-GBR resource type with the default priority value of 70, PDB of 100 ms and PER of 0.001. Voice, video (live streaming), and interactive gaming fall under this category.
In this section, Radio Resource Management (RRM), e.g., per-DRB RRM, will be discussed.
Once one of the above methods is used to compute scheduling priority of a logical channel corresponding to a UE in a cell, the same method is used for all other UEs.
In the above expressions, the parameters are defined as follows:
P
GBR=remData/targetData
In this section, a general overview of reinforcement learning (RL) will be provided. Reinforcement Learning is a feedback-based machine learning technique in which an agent learns to behave in an environment by performing actions and seeing the results of the actions. For each good action, the agent gets a positive feedback, and for each bad action, the agent gets a negative feedback or penalty. In reinforcement learning, the agent learns automatically using feedbacks, without the use of any labelled data. Since there is no labelled data, the agent is bound to learn by its experience only. The primary goal of an agent in reinforcement learning is to improve the performance by getting the maximum overall reward (or minimum overall cost).
Some of the terms used in connection with reinforcement learning technique are listed below:
In this section, Markov Decision Process (MDP), will be discussed. MDP is used to formalize the reinforcement learning (RL) problems. If the environment is completely observable, then its dynamics can be modelled as a Markov Process. In MDP, which is illustrated in
MDP uses Markov property, i.e., the current state transition does not depend on any past action or state. Hence, MDP is an RL problem that satisfies the Markov property. Markov Process is a memoryless process with a sequence of random states that uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S, P) on state S and transition probability matrix P. These two components (S and P) can define the dynamics of the system.
Reinforcement learning uses MDPs where the probabilities or costs are unknown. Reinforcement learning can solve MDPs without explicit specification of the transition probabilities; the values of the transition probabilities are needed in value and policy iteration. Reinforcement learning can also be combined with function approximation to address problems with a very large number of states.
In this section, some of the example approaches used in RL will be discussed, which example approaches include: Value-based approach (Value iteration methods); Q-learning; Deep Q Neural Network (DQN); and Policy-based approach (Policy iteration methods).
The value-based approach is about finding the optimal value function, which is the maximum value at a state under any policy. Therefore, the agent expects the long-term return at any state(s) under policy π. V(s), which is specified below, indicates an example value-based approach. Q-learning, which is discussed later and is a variant of the value-based approach, takes one additional parameter as a current action ‘a’. The below system of equations for the state space are called Bellman equations or optimality equations, which characterize values and optimal policies in infinite-horizon models:
Q-learning involves learning the value function Q (s, a), which characterizes how good it is to take an action “a” at a particular state “s”. The main objective of Q-learning is to learn the policy which can inform the agent what actions should be taken for minimizing the overall cost. The goal of the agent in Q-learning is to optimize the value of Q. The value of Q-learning can be derived from the Bellman equation. Q represents the quality of the action at each state, so instead of using a value at each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies which action is more beneficial or lucrative than others, and according to the best Q-value, the agent takes his next move. The Bellman equation can be used for deriving the Q-value.
After performing an action “a”, the agent will incur a cost C(s, a), and the agent will end up at a certain state, so the Q-value equation will be:
The flowchart shown in
Deep Q Neural Network (DQN) is a Q-learning using Neural networks. For a big state space environment, it will be a challenging and complex task to define and update a Q-table. To solve such an issue, we can use a DQN algorithm. In this approach, instead of defining a Q-table, neural network approximates the Q-values for each action and state.
Policy-based approach is to find the optimal policy for the minimum future costs without using the value function. It converges for finite state, action set. It has two steps, policy evaluation and policy improvement. In policy improvement, If the policy is same at two consecutive epochs, then it is the optimal policy.
The policy-based approach involves mainly two types of policies: 1) Deterministic policy, whereby the same action is produced by the policy for any given state; and 2) Stochastic policy, whereby for each state there is a distribution over set of actions possible at that state.
In this section, the application of artificial intelligence (AI) and/or machine learning (ML) in RAN will be discussed. 3GPP TR 37.817 focuses on the analysis of data needed at the Model Training function and Model Inference function from Data Collection. Where AI/ML functionality resides within the current RAN architecture depends on the deployment and the specific use cases. The Model Training and Model Inference functions should be able to request, if needed, specific information to be used to train or execute the AI/ML algorithm and to avoid reception of unnecessary information. The nature of such information depends on the use case and on the AI/ML algorithm. The Model Inference function should signal the outputs of the model only to nodes that have explicitly requested them (e.g., via subscription), or nodes that take actions based on the output from Model Inference. An AI/ML model used in a Model Inference function has to be initially trained, validated and tested by the Model Training function before deployment. In this context, NG-RAN stand-alone (SA) is prioritized, while EN-DC and MR-DC are less prioritized, but not precluded from 3GPP Release 18.
In this section, the functional framework for RAN intelligence, which is illustrated in
Given the increasing demand in mobile data traffic and the heterogeneity of the Quality of Service with stringent requirements such as throughput, latency of various applications is an important challenge for the existing 5G networks. To solve this issue, a base station requires an intelligent Radio Resource Management (RRM) which can cope with the existing and emerging QoS challenges in the near future. Conventional RRM methods (e.g., as previously described above) include those which are hosted at DU and are very computation intensive.
Accordingly, there is a need for an improved system and method for optimizing RRM methods in O-RAN networks.
Accordingly, what is desired is an improved system and method for optimizing RRM methods, e.g., using machine learning approaches, in O-RAN networks.
According to a first example embodiment of the system and method according to the present disclosure, the RRM analytics module can be hosted at one of the following components, for example: 1) RIC (e.g., Near-RT RIC) server; 2) gNodeB (e.g., at CU-UP or CU-CP); 3) 5G Network Data Analytics Function (NWDAF) server; or 4) Operations, Administration Maintenance (OAM) server. According to the first example embodiment, various performance measurements are communicated from gNodeB-DU to the RRM analytics module. These performance measurements include the following for each UE: Buffer Occupancy (BO) at 5QI level; Channel State Information (CSI); UE Throughput; Packet delay at 5QI level; Packet error rate (PER) at 5QI level; and PF metric. These performance measurements are analyzed at the RRM analytics module for each UE.
According to a first variant of the first example system and method, the RRM analytics module is located at Near-RT RIC. In this first variant, the following steps are implemented:
According to a second variant of the first example system and method, the RRM analytics module is located at the CU-UP. In this second variant, the following steps are implemented:
According to an example embodiment of the method according to the present disclosure, the mapping of the RRM-related decision making process to a reinforcement learning problem by formalizing it using an MDP, which involves four elements: States; Actions; Costs/Rewards; and Transition Probabilities.
According to an example embodiment of the method according to the present disclosure, at time t, the immediate cost for UE; when this UE; is in state “s” with action “a” is represented as C(si(t)=s, ai(t)=a)=Ci(s(t),a(t)). The cost can be written as a function of costs associated with each state variable, as shown below in the expanded cost equation. The overall cost is a weighted sum of the costs associated with the BO, GBR, PDB, PER, and PF state variables and cell level cost. Here, CCL(
According to an example embodiment, a method for optimizing RRM methods includes a Value iteration policy (Policy 1). For the unknown transition probability matrix, we initialize the matrix with zero and update it in the following manner. At state “s”, after taking action “a”, the system lands in state s′, then we update P(s′|s, a)=1. Later, at the same state “s”, taking the same action “a”, the system lands in s″, then we update P(s′|s, a)=P(s″|s, a)=0.5. We update the transition probabilities based on i) the different states in which the system landed and ii) the number of times the system landed in each state.
According to an example embodiment, a method for optimizing RRM methods includes a Policy 2. After initial learning of transition probabilities as mentioned in Policy 1, we can run the RL with exploration and exploitation strategy. In the exploration stage, we choose the action to schedule the UE or not to schedule the UE uniformly at random. In the exploitation stage, we choose the action that incurred the minimum cost until now (i.e., during the training phase). Here, delta is a parameter controlling the amount of exploration vs. exploitation.
In the present disclosure, system and/or network components (e.g., RRM analytics module, RIC, and RAN nodes (including CU-CP, CU-UP, and DU) can be implemented as i) software executed on processor(s) and/or servers, and/or ii) dedicated hardware modules comprising processor(s) executing software, to implement the respective functionalities of the system and/or network components.
For this application, the following terms and definitions shall apply:
The term “network” as used herein includes both networks and internetworks of all kinds, including the Internet, and is not limited to any particular type of network or inter-network.
The terms “first” and “second” are used to distinguish one element, set, data, object or thing from another, and are not used to designate relative position or arrangement in time.
The terms “coupled”, “coupled to”, “coupled with”, “connected”, “connected to”, and “connected with” as used herein each mean a relationship between or among two or more devices, apparatus, files, programs, applications, media, components, networks, systems, subsystems, and/or means, constituting any one or more of (a) a connection, whether direct or through one or more other devices, apparatus, files, programs, applications, media, components, networks, systems, subsystems, or means, (b) a communications relationship, whether direct or through one or more other devices, apparatus, files, programs, applications, media, components, networks, systems, subsystems, or means, and/or (c) a functional relationship in which the operation of any one or more devices, apparatus, files, programs, applications, media, components, networks, systems, subsystems, or means depends, in whole or in part, on the operation of any one or more others thereof.
The above-described and other features and advantages of the present disclosure will be appreciated and understood by those skilled in the art from the following detailed description, drawings, and appended claims.
According to a first example embodiment of the system and method according to the present disclosure, the RRM analytics module can be hosted at one of the following components, for example: 1) RIC (e.g., Near-RT RIC) server; 2) gNodeB (e.g., at CU-UP or CU-CP); 3) 5G Network Data Analytics Function (NWDAF) server; or 4) Operations, Administration Maintenance (OAM) server. According to the first example embodiment, various performance measurements are communicated from gNodeB-DU to the RRM analytics module. These performance measurements include the following for each UE: Buffer Occupancy (BO) at 5QI level; Channel State Information (CSI); UE Throughput; Packet delay at 5QI level; Packet error rate (PER) at 5QI level; and PF metric. These performance measurements are analyzed at the RRM analytics module for each UE.
The RRM analytics module forms an MDP. Forming an MDP requires states, actions, transition probabilities and costs. Each performance measurement takes a certain range of values. We quantize (or classify) the range of values taken by each performance measurement to n levels, where n is a finite value. Here, we consider each performance measurement as a state variable of an MDP. There are finite state variables, and each state variable takes finite values, so the state space is finite. The DU and RRM analytics module (e.g., hosted at CU-UP, CU-CP, or Near-RT RIC) agree upon a common state space table, RRM Parameters Set List, as shown in Table 2 below. The RRM Parameters Set List IE contains the sets of performance measurements gNB-DU can indicate to the RRM Analytics module. We index the entries in the table using Performance Measurement Index, shown in Table 1 below, for easy communication across different interfaces. The Performance Measurement Index IE indicates the index of the items within the RRM Parameter Set List corresponding to the current state of the UE after taking an action. The actions at each state for any UE are binary, i.e., {schedule (1), not schedule (0)}. For any active UE, if the DU wants to communicate its current performance measurements to RRM analytics module, first the DU chooses one entry from the RRM Parameters Set List based on the state it landed upon (current state after the action) based on the action. There will be a cost associated with each action. For the case where the transition probabilities are unknown, the transition probabilities are obtained during the training phase and can be used at later stage, as explained below in the section discussing RRM related decisions using reinforcement learning (as part of Method 2).
In the section discussing RRM related decisions using reinforcement learning (as part of Method 2), various policies that will be computed at the RRM analytics module are discussed. In this context, POLICY IE is implemented as a lookup table with (state, action) pair, i.e., for each state assigning a near-optimal action to achieve optimal allocation of radio resources. The lookup table (i.e., POLICY) will be sent to DU from the RRM analytics module based on certain triggers (and/or periodically). In this context, a Binary Action Array (which is a bit string of zeroes and ones) is included as part of the POLICY (as shown in Table 3 below) and sent with size equal to the length of Performance Measurement Index to represent the near optimal action for each state. This Binary Action Array will be used at the DU for scheduling until the next policy is received from the RRM analytics module. This will reduce the computational burden at the DU for allocating radio resources, since no computation at the DU is required for resource allocation. It is a lightweight approach from the DU point of view.
Method 1A (Machine Learning based RRM analytics module at near-RT-RIC): In the below section, a first variant of the first example system and method is presented, in which first variant the RRM analytics module is located at Near-RT RIC. In this first variant, the following steps are implemented (as shown in
It should be noted that the Near-RT RIC sends the policy (i.e., lookup table) based on RIC event trigger (e.g., specified periodicity), as shown by the process arrow 2104 in
Method 1B (Reinforcement learning based RRM analytics module at CU-UP): In the below section, a second variant of the first example system and method is presented, in which second variant the RRM analytics module is located at the CU-UP. In this second variant, the following steps are implemented (as shown in
Method 2 (RRM-related decisions using reinforcement learning): In this section, we describe the mapping of the RRM-related decision making process to a reinforcement learning problem by formalizing it using an MDP. As mentioned earlier, MDP involves four elements: States; Actions; Costs/Rewards; and Transition Probabilities. These elements are represented as follows:
We quantize (or classify) the range of values taken by each state variable to n levels, where n is a finite value (e.g., n=4, 16 or a higher number, depending on the parameter being quantized).
Normalized Buffer Occupancy (BO_normalized): We consider normalized BO for DRBs of different 5QIs. For a particular 5QI DRB of the UEi, the normalized BO is BO for the particular 5QI DRB of the UEi divided by the average packet size for that 5QI in the cell. At the UE level, the normalized BO is a summation over all the normalized BOs of different 5QI traffics carried by the UE. For example, if UEi is carrying the traffic of two different DRBs (one with 5QI 1 and another DRB with 5QI 7), e.g., 5QI 1-VoNR DRB with its BO as V1 and 5QI 7 video streaming DRB with its BO as V2, with average packet sizes as A1 for 5QI 1 DRBs in that cell and A2 for 5QI 7 DRBs in that cell, then the BO_normalized for UEi, which is represented as BO_normalizedi, is equal to (V1/A1)+(V2/A2). The range of values taken by BO_normalized for all the UEs is quantized to n1 levels (for example 4 levels). For example, if the BO_normalized is in the range of [0, B], we divide them into four equal levels, and value between [0, B/4] is taken as 0, [B/4, B/2] is taken as B/4, [B/2, 3B/4] is taken as B/2, and [3B/4, B] is taken as 3B/4.
UE Channel State Information (CSI): The time and frequency resources that can be used by the UE to report CSI are controlled by the gNB. CSI can consist of Channel Quality Indicator (CQI), precoding matrix indicator (PMI), CSI-RS resource indicator (CRI), SS/PBCH Block Resource indicator (SSBRI), layer indicator (LI), rank indicator (RI), L1-RSRP or L1-SINR. As per 3GPP TS 38.214, CQI shall be calculated conditioned on the reported PMI, RI, and CRI. CQI reporting may be periodic or aperiodic. We consider the average CQI, or some alternate estimate of CQI is used to represent the UE channel conditions in each timeslot where the CQI is not reported (CQI index ranges from 0 to 15 in the 3GPP table, and higher the CQI index better the channel quality).
Delay observed by packets of a DRB in DU: We consider packet delay budget from the 5QI table and estimate a target delay for packets of each DRB in the DU. For example, with PDB of 300 ms for 5QI 9 (video streaming) DRB, we could have a target delay in DU for this DRB as 230 ms in a deployment scenario. We refer to this target delay as target DU PDB (e.g., target DU PDB is equal to 230 ms in this example). We quantize the delay experienced by packets of a DRB into n2 different levels. We give an example for three levels here (i.e., n2=3). For UE; with DRB of 5QI 9, if the waiting time of a packet in the DU for that DRB is within X1 percentage of its target PDB in the DU, and X1 is less than 80% of target DU PDB, it is classified as level 1 (e.g., “excellent”). If the waiting time in DU is between X1 and X2 percentage of target DU PDB (e.g., Xi=80% and X2=105%), it is classified as level 2. If the waiting time in DU is above X2 percentage, it is classified as level 3 in the context of target DU PDB.
Throughput observed by DRBs of a UE at DU: For UE; with GBR DRB traffic, if the achieved throughput exceeds the target by at least Y1 percentage above the target throughput (e.g., if Y1 is 10%, the achieved throughput is 110%, or above, the target GBR throughput), it is classified as level 1 (e.g., “excellent”). If the achieved throughput is between the target and target+Y1 percentage (e.g., Y:=10%, the achieved throughput is between 100% to 110% of the target throughput), it is classified as level 2 (e.g., “good”). If the achieved throughput is between the target minus Y1 and the target percentage (e.g., Y1=10%, the achieved throughput is between 90% to 100% of the target throughput), it is classified as level 3 (e.g., “average”). If the achieved throughput is below target minus Y1 (e.g., below 90% of the target throughput), it is classified as level 4 (e.g., “poor”).
Packet error rate (PER) observed by DRBs of a UE: We consider the packet error rate (PER) from the 5QI table as the target PER for packets of each DRB in the DU. For UE; with traffic of any 5QI (e.g., VoNR), if the observed PER is at least Z1 % of target PER (e.g., if Z1 is equal to 50% and the target PER for VoNR is 102, meaning one packet error out of every 100 packets, the observed PER should be at least ½*10−2=>one packet error out of every 200 or more packets), it is classified as level 1 (e.g., “excellent”). If the observed PER is between Z1% and Z2% of the target PER (e.g., for VoNR, if Z1 is equal to 50% and Z2 is equal to 100%, the observed PER is between ½*10−2 to 10−2), it is classified as level 2 (e.g., “good”). If the observed PER is between Z2% and Z3 % of the target PER (e.g., for VoNR, if Z2 is equal to 100% and Z3 is equal to 150%, the observed PER is between 10−2 to 1.5*10−2), it is classified as level 3 (e.g., “average”). If the observed PER is above Z3 % of the target PER (e.g., for VoNR, if Z3 is equal to 150%, the observed PER is more than 1.5*10−2), it is classified as level 4 (e.g., “poor”). If there are multiple 5QI flows per a given UE, then we consider the worst-case PER of the 5QI as the UE level PER and classify it as one of the multiple levels described above.
Action (A): We consider the action space as binary for each UE. For UEi with action space represented as ai(t)∈{0,1}, ai(t)=1 represents scheduling the UEi (allocate PRBs based on various policies), and ai(t)=0 represents not scheduling the UE i (i.e., not to allocate any PRBs to UE i). The common action across all K UEs in the cell is represented as an array of actions for each individual UE, i.e., (a1, a2 . . . , ak). It should be noted that, at time t, if there is no data for a particular UE, then this UE is not scheduled, and the corresponding action for this UE is ai(t)=0.
Transition probabilities (P): For the K UEs in the cell, if the realized state at time t represented as
Table 4 below provides an illustrative example to show the dimensions of the lookup table. In this example, let's assume there are 3 UEs in the cell and each of the state variables for any UE can take two values, as follows: Normalized_BO can be (0,1); CQI (of CSI) can be (3, 10]; packet delay can be {Good, Bad}; throughput can be {Good, Bad}; packet error rate can be {Good, Bad}; and PF metric can be {Low, High}. In this case, the state space can be represented as shown in the Table 4Error! Reference source not found. with the 64 possible states. There are six state variables and each state variable takes two values, which means 2°=64 combinations are possible. In addition, state space is finite, and at each state there are two possible actions, A=(0,1), so the transition probability matrix is of size (64*2)3=2097152.
We can use hashing and/or other techniques to reduce the computational effort of this method.
For the sake of clarity and easier explanation, we have considered binary CQI indices. However, shown below in TABLE S is an actual CQI table from 3GPP TS 38.214 with the indices ranges from 0 to 15, which represent a bigger state space,
Costs (C(s,a)): At time t, the immediate cost for UEi when this UEi is in state ‘s’ with action ‘a’ is represented as C(si(t)=s, ai(t)=a)=Ci(s(t),a(t)). The cost of all the K UEs in the cell C(
Below, we have defined Ci(s(t),a(t)) as individual UE level costs of each of the state variable, and summing them over all the active UEs in the cell will result in the cell level cost as required in the above equation. For inactive UEs, the optimal action is not to schedule, and the associated cost is always zero as long as they stay inactive. Now we define each of these cost functions below.
The BO immediate cost: The BO immediate cost for a UE when the action is equal to zero (i.e., this UE is not scheduled) is BO_normalizedi. The BO immediate cost when the action is equal to one is sum of ciBO and min {0, BO_normalizedi−Sched_data_normalizedi}, where ciBO is the cost for scheduling UEi with its BO and Sched_data_normalizedi is summation over normalized scheduled data per 5QI level (similar to that of BO_normalizedi). At UE level, normalized scheduled data is summation over all the normalized scheduled data of different 5QI traffics carried by the UE. For example, if UEi is carrying traffic of two different DRBs (e.g., one with 5QI 1 and another DRB with 5QI 7), 5QI 1-VoNR DRB with its scheduled data as U1 and 5QI 7 Video streaming DRB with its scheduled data as U2, with average packet sizes as A1 for 5QI 1 DRBs in that cell and A2 for 5QI 7 DRBs in that cell, then the normalized scheduled data for UEi which is represented as Sched_data_normalizedi is equal to (U1/A1)+(U2/A2). Accordingly, the following expanded equation for the BO immediate cost is provided:
The PDB immediate cost: The PDB immediate cost for a UEi when the action is equal to zero is the maximum of zero and the difference of the RLC queuing delay (QdelayRLC,i) and the PDB. When the action is equal to one, the PDB immediate cost is simply the cost for taking action equal to one, which is ciPDB. QdelayRLC,I=(ti−TRLC,i) is the delay of the oldest RLC packet in the QoS flow that has not been scheduled yet and is calculated as the difference in time between the SDU insertion in RLC queue to current time, where t:=current time instant, TRLC:=time instant when oldest SDU was inserted in RLC. The PDB also corresponds to the oldest RLC packet in the QoS flow that has not been scheduled yet. Accordingly, the following expanded equation for the PDB immediate cost is provided:
The GBR immediate cost: The GBR immediate cost for a UE when the action is equal to zero is the remaining data. The GBR immediate cost when the action is equal to one is the sum of ciGBR and the difference of remaining data (remData) and scheduled data (schedData).
The PER immediate cost: The PER immediate cost for a UE when the action is equal to zero is the previous slot PER value, i.e., PER(t−1). The PER immediate cost when the action is equal to one is the sum of ciPER and current slot PER value, i.e., PER(t). As per 3GPP TS 23.501, PER defines an upper bound for the rate of PDUs (e.g., IP packets) that have been processed by the sender of a link layer protocol (e.g., RLC in RAN of a 3GPP access) but that are not successfully delivered by the corresponding receiver to the upper layer (e.g., PDCP in RAN of a 3GPP access). The PER defines an upper bound for a rate of non-congestion related packet losses. PER at time t is the ratio of number of packets not successfully delivered by the receiver to the upper layers to the number of packets transmitted. Accordingly, PER(t−1) and PER(t) can be expressed as follows:
In the above expressions, pkts_unsuccessful_t is the number of packets not successfully delivered in time-slot t, pkts_transmitted_t is the number of packets transmitted in time-slot t. The following expanded equation for the PER immediate cost is provided:
The PF immediate cost: The PF immediate cost for a UE when the action is equal to zero is the difference of PFmax and the ratio of ri(t) to the Rave; (t−1). The PF immediate cost for a UE when the action is equal to one is the difference of ciPE+PFmax and the ratio
The cell-throughput-related immediate cost: The cell throughput related immediate cost is defined as follows:
As we mentioned earlier, now the overall cost function C′ (s (t), a (t)) is the sum of weighted cost functions of the state variables BO, PDB, GBR, PER, and PF, and the cell level cost, as shown below.
We can have multiple variants for this cost function, e.g., by just considering some of the state variables by making the selected weight factors as zero based on the policies that we want to use. For example, one can consider overall cost as a function of PDB, GBR and PF, i.e.,
From Value function, we can observe that for each state there is an associated immediate cost and the additional discounted costs associated with the subsequent state and action pairs. The discount factor essentially determines how much the reinforcement learning agents care about costs in the distant future relative to those in the immediate future. Normally, discount factor ranges from 0 to 1, i.e., γ∈[0,1).
Value iteration (Policy 1): For the unknown transition probability matrix, we initialize the matrix with zero and update it in the following manner. At state
P(
Policy 2: After the initial learning of transition probabilities as mentioned in Policy 1, we run the reinforcement learning method with exploration and exploitation strategy. In the exploration stage, we choose the action from the set of all possible actions uniformly at random. For example, in a given time slot, if there are four active UEs, then there are sixteen possible combinations of actions, i.e., {0000, 0001 . . . , 1111}. If the policy chooses exploration, then it chooses any one of the actions from the 16 possible actions with 1/16 probability. In the exploitation stage, Policy 2 chooses the action that gave the minimum cost till now (during the training phase). For example, in a given time slot, if the policy chooses exploitation and has seen three actions {0000, 0011, 1111} till now, and the associated costs are 10, 20, and 30, respectively, then the policy chooses the action corresponding to the minimum cost, which is 10, i.e., the policy chooses action 0000. In this method, exploitation is chosen with probability (1 minus delta) and exploration is chosen with probability delta. Here, delta is a parameter controlling the amount of exploration vs. exploitation. This “delta” parameter could be a fixed parameter or it can be adjusted either according to a schedule (e.g., making the agent explore progressively less), or adaptively based on some policies.
For a finite state space with finite actions, we can obtain near-optimal action for each state. Once the policy is decided, it is used to create a candidate list of UEs which can be scheduled (for a given state) in a given slot. A subset of UEs is picked up from this candidate list and allocated resources based on some policies which could consider QoS requirements (such as delay, throughput, etc.), buffer depth (of pending packets at the base station for the corresponding logical channel), and other parameters for each UE.
In a given slot and for a given state of the system, candidate UEs to serve in a slot are selected as per Policy 1 or Policy 2 above. A given base station system can serve only a certain number of UEs in slot (e.g., max Z UEs in a slot). If the number of selected candidate UEs is higher than what the base station can serve in that slot (i.e., Z in this example), the radio resource management (RRM) module selects Z UEs using various policies (e.g., selecting Z UEs which have packets that may miss their delay targets if not scheduled, or selecting Z UEs which may miss their throughput and delay targets if not scheduled). These selected UEs (i.e., Z UEs from the candidate set if the number of UEs in the candidate set is more than Z, or all the candidate UEs if the number of candidate UEs is less than Z) are allocated resources (e.g., PRBs) based on another set of policies (e.g., drain full buffer for each selected UE, or serve a subset of packets queued for each UE and give opportunity to more of the Z UEs to be served in that slot) as long as PRBs are available in that slot.
While the present disclosure has been described with reference to one or more exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the present disclosure. For example, although the example methods have been described in the context of 5G cellular networks, the example methods are equally applicable for 4G, 6G and other similar wireless networks. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the disclosure without departing from the scope thereof. Therefore, it is intended that the present disclosure not be limited to the particular embodiment(s) disclosed as the best mode contemplated, but that the disclosure will include all embodiments falling within the scope of the appended claims.
For the sake of completeness, a list of abbreviations used in the present specification is provided below:
Number | Date | Country | Kind |
---|---|---|---|
202321076223 | Nov 2023 | IN | national |