Embodiments herein relate to a central node and a method therein. In some aspects they relate to controlling an exploration strategy associated to Reinforcement Learning (RL) in one or more RL modules in a distributed node in a Radio Access Network (RAN).
Embodiments herein further relates to computer programs and carriers corresponding to the above method, and central node.
In a typical wireless communication network, wireless devices, also known as wireless communication devices, mobile stations, stations (STA) and/or User Equipment (UE), communicate via a Local Area Network such as a Wi-Fi network or a Radio Access Network (RAN) to one or more core networks (CN). The RAN covers a geographical area which is divided into service areas or cell areas, which may also be referred to as a beam or a beam group, with each service area or cell area being served by a radio network node such as a radio access node e.g., a Wi-Fi access point or a radio base station (RBS), which in some networks may also be denoted, for example, a NodeB, eNodeB (eNB), or gNB as denoted in 5G. A service area or cell area is a geographical area where radio coverage is provided by the radio network node. The radio network node communicates over an air interface operating on radio frequencies with the wireless device within range of the radio network node.
Specifications for the Evolved Packet System (EPS), also called a Fourth Generation (4G) network, have been completed within the 3rd Generation Partnership Project (3GPP) and this work continues in the coming 3GPP releases, for example to specify a Fifth Generation (5G) network also referred to as 5G New Radio (NR) or Next Generation (NG). The EPS comprises the Evolved Universal Terrestrial Radio Access Network (E-UTRAN), also known as the Long Term Evolution (LTE) radio access network, and the Evolved Packet Core (EPC), also known as System Architecture Evolution (SAE) core network. E-UTRAN/LTE is a variant of a 3GPP radio access network wherein the radio network nodes are directly connected to the EPC core network rather than to RNCs used in 3G networks. In general, in E-UTRAN/LTE the functions of a 3G RNC are distributed between the radio network nodes, e.g. eNodeBs in LTE, and the core network. As such, the RAN of an EPS has an essentially “flat” architecture comprising radio network nodes connected directly to one or more core networks, i.e. they are not connected to RNCs. To compensate for that, the E-UTRAN specification defines a direct interface between the radio network nodes, this interface being denoted the X2 interface.
Multi-antenna techniques may significantly increase the data rates and reliability of a wireless communication system. The performance is in particular improved if both the transmitter and the receiver are equipped with multiple antennas, which results in a Multiple-Input Multiple-Output (MIMO) communication channel. Such systems and/or related techniques are commonly referred to as MIMO.
Deep Reinforcement Learning (RL)
A neural network is essentially a Machine Learning model, more precisely, Deep Learning, that is used in both supervised learning and unsupervised learning. A Neural Network is a web of interconnected entities known as nodes wherein each node is responsible for a simple computation.
RL is a powerful technique to efficiently learn a behavior of a system within a dynamic environment. By incorporating recent advances in deep artificial neural networks, deep RL (DRL) has been shown to enable significant autonomy in complex real-world tasks. DRL uses deep learning and reinforcement learning principles to create efficient algorithms applied on areas like robotics, video games, computer science, computer vision, education, transportation, finance, healthcare, etc. As a result, DRL approaches are quickly becoming state-of-the-art in robotics and control, online planning, and autonomous optimization.
Despite its significant success, the intuition behind DRL is relatively simple. For an observed environment state, a DRL agent attempts to learn the optimal action by exploring the space of available actions. For an observed state ‘S[t]’ at time ‘t’, the DRL agent selects an action ‘a[t]’ that is predicted to maximize the cumulative discounted rewards over the next several time intervals. The heuristically-configured discounting factor avoids actions that maximize the immediate, short-term, reward but lead to poor states in the future. After taking an action, the DRL agent feeds back the reward into a learning module, typically a neural network, which learns to make better action choices in subsequent time intervals.
At the beginning of its operation, DRL agent has incomplete, often zero, knowledge of the system. Depending on the tolerance of the system to occasional failures, the agent may either choose to collect data for offline learning through an existing policy, which is safer, or select actions online in some randomized manner, which is efficient. In either case, the collected data is used to iteratively update the model, for example the weight and bias variables within a neural network. The training parameters, such as the size of the neural network, number of iterative updates, and parameter update scheme are all configured heuristically based on empirical findings from state-of-the-art DRL implementations. As the DRL agent learns the true value of actions over time, the need for exploring random actions decreases as well. This decrease is encoded in an exploration rate variable whose value is slowly reduced to nearly zero with time.
Majority of radio network management and optimization problems are about tuning parameters to adapt to local propagation environment, traffic patterns, service types and UE device capabilities. DRL is a promising technique to automate such tuning. In the context of radio networks, DRL has recently been proposed for several challenging cellular network problems, ranging from data rate selection, beam management, to trajectory optimization for aerial base stations.
Machine Learning Architectures in Radio Networks
A radio network consists for multiple distributed base stations. The RL policy may be trained and/or inferred in a centralized, distributed or hybrid manner.
Further a Training for global node 210, a Training for local distributed node 1 referred to as 211a and a Training for local distributed node n referred to as 211n, an Inference for local distributed node 1 referred to as 221a, and an inference for local distributed node n referred to as 221n.
Yet further, a Global Training orchestrator, e.g. a learning orchestrator, referred to as 230, a Distributed node 1 referred to as 222a and a Distributed node n referred to as 222n.
Solid lines illustrate data movement of training data. Dotted lines illustrate model deployments, i.e. from trained models to inference using the trained models. Dashed lines illustrate the communication of model weights and training also referred to as learning, hyper parameters.
In the distributed learning architecture in
Since the memory and computation power of the distributed nodes are usually limited, the training can be moved to a central node as shown in the centralized learning local inference architecture in
The hybrid learning architecture in
E-UTRAN and NG-RAN Architecture Options
The current 5G RAN (NG-RAN) architecture is depicted and described in 3GPP TS 38.401v15.4.0 as follows. Mapped to the RL architecture, centralized learning functions may be located in either Fifth Generation Core network (5GC) or gNB-Central Unit (CU), and gNB-Distributed Unit (DU) is an example of the distribute node.
A gNB may also be connected to an LTE eNB via the X2 interface. In this architectural option an LTE eNB connected to the Evolved Packet Core network is connected over the X2 interface with a so called nr-gNB. The latter is a gNB not connected directly to a CN and connected via X2 to an eNB for the sole purpose of performing dual connectivity.
In yet another architecture option a gNB may be connected to an eNB via an Xn interface. In this option both gNB and eNB are connected to the 5GC and can communicate over the Xn interface.
It is worth noticing that RAN nodes can not only communicate via direct interfaces such as the X2 and Xn but also via CN interfaces such as the NG and S1 interfaces. Such communication requires the involvement of CN nodes and/or transport nodes (such as IP packet routers, Ethernet switches, microwave links or optical ROADMs) to route and forward messages from the source RAN node to the target RAN node.
The architecture in
RL Exploration and Exploitation in Radio Networks
One challenge about the RL technique, comparing with rule-based methods, is the risk of significant performance degradation in the radio network when taking random actions. For example, performance degradation in the form of coverage holes might be a result of an action of reducing cell transmission power. Such risks are rooted in the way a RL agent explores the environment.
The balance between exploration and exploitation is a key aspect of RL when deciding which action to take. While exploitation is about taking advantage of the learning in the past, exploration is a procedure to learn new knowledge, e.g. by taking random actions and observing the consequences. Usually, a RL agent applies a high exploration rate in the beginning phase of learning when the policy has only been trained with limited amount of data samples. As the training continues and the trained policy becomes more reliable, the exploration rate is gradually reduced to a value close to zero.
One way to reduce the risk of taking random actions during exploration is to craft the action space so that all actions are more or less safe to the system. To craft used herein means to define a set of allowed actions for an individual or a group of states. At least, no catastrophic consequences should occur by taking any action. In one prior-art method, a heuristic model is deployed in parallel to a RL policy. When the performance of the RL policy degrades below a threshold, the heuristic model is activated to replace the RL policy.
Learning an RL strategy, also referred to as a policy or a model, that performs well requires proper exploration to produce rich training data samples. During explorations, an RL agent may follow a randomization exploration strategy to explore combination of state and actions that would otherwise be unknown. While this allows to possibly learn better state-action combinations from which the agent policy can be improved upon, taking an action at random in a given state of the system may also lead to suboptimal behavior and therefore a performance degradation of the user experience and/or system availability, accessibility, reliability and retainability.
As a part of developing embodiments herein a problem was identified by the inventors and will first be discussed.
As such, while it is necessary to explore actions at random to learn unseen parts of the state-action space, the resulting RAN system performance, e.g. availability, accessibility, reliability and retainability, and user experience may be negatively affected by the exploration. It is therefore necessary to control and optimize the collection of data samples via proper exploration strategies, so as to minimize the system performance degradation due to exploration.
In addition to the exploration rate, efficient operation of DRL requires careful tuning of training parameters, including but not limited to, the discount factor, the number of parameter update iterations, the parameter update scheme, etc. A discount factor when used herein means the weight of future rewards respect to the immediate reward. It is computationally very expensive to obtain the optimal training parameters. The agent typically tries out different parameter configurations and selects those that best improve the learning performance. Hence, techniques that efficiently select the optimal training parameters lead to improvements in the overall system performance.
An object of embodiments herein is to provide an improved performance of a RAN using RL with low risk of instantaneous performance degradation due to the exploration.
According to an aspect, the object is achieved by a method performed by a central node for controlling an exploration strategy associated to RL in one or more RL modules in a distributed node in a RAN. The central node evaluates a cost of actions performed for explorations in the one or more RL modules, and a performance of the one or more RL modules. Based on the evaluation, the central node determines one or more exploration parameters associated to the exploration strategy. The central node controls the exploration strategy by configuring the one or more RL modules with the determined one or more exploration parameters to update its exploration strategy. This enforces the respective one or more RL modules to act according to the updated exploration strategy to produce data samples for the one or more RL modules in the distributed node.
According to another aspect, the object is achieved by a central node configured to control an exploration strategy associated to RL in one or more RL modules in a distributed node in a RAN. The central node is further configured to:
Thanks to that the evaluated a cost of actions performed for explorations in the one or more RL modules, and a performance of the one or more RL modules e.g. identifies services of high importance or strict requirements according to the evaluation, the central node may determine the one or more exploration parameters associated to the exploration strategy to achieve a reduced exploration in the presence of the identified services of high importance or strict requirements according to the evaluation. This results in a reduced impact of performance degradation of the RAN is achieved by a reduced exploration in the presence of services of high importance or strict requirements according to the evaluation. This in turn provides an improved performance of the RAN and improved level of user satisfaction using RL.
a, b, and c are schematic block diagrams illustrating prior art.
a and b are schematic block diagrams depicting embodiments of a wireless communication network.
a and b are schematic block diagrams depicting embodiments in a central node.
An example of embodiments herein relates to methods for controlling exploration and training strategies associated to RL in a wireless communications network.
Embodiments herein are e.g. related to Radio network optimization, Network Management, Reinforcement Learning, and/or Machine Learning.
In some examples of embodiments herein it is provided a signaling method between a central node and a distributed node to exchange control messages to properly configure exploration and training parameters of an RL algorithm in the distributed node.
Network nodes, such as a distributed node 110, operate in the RAN 102. The distributed node 110 may provide radio access in one or more cells in the RAN 102. This may mean that the distributed node 110 provides radio coverage over a geographical area by means of its antenna beams. The distributed node 110 may be a transmission and reception point e.g. a radio access network node such as a base station, e.g. a radio base station such as a NodeB, an evolved Node B (eNB, eNode B), an NR Node B (gNB), a base transceiver station, a radio remote unit, an Access Point Base Station, a base station router, a transmission arrangement of a radio base station, a stand-alone access point, a Wireless Local Area Network (WLAN) access point, an Access Point Station (AP STA), an access controller, a UE acting as an access point or a peer in a Device to Device (D2D) communication, or any other network unit capable of communicating with a radio device within the cell served by network node 110 depending e.g. on the radio access technology and terminology used.
The distributed node 110 comprises one or more one or more RL modules 111. The distributed node 110 is adapted to execute RL in the one or more RL modules 111.
UEs such as the UE 120 operate in the wireless communications network 100. The UE 120 may e.g. be an NR device, a mobile station, a wireless terminal, an NB-IoT device, an eMTC device, a CAT-M device, a WiFi device, an LTE device and an a non-access point (non-AP) STA, a STA, that communicates via such as e.g. the distributed node 110, one or more RANs such as the RAN 102 to one or more CNs. It should be understood by the skilled in the art that the UE 120 relates to a non-limiting term which means any UE, terminal, wireless communication terminal, user equipment, (D2D) terminal, or node e.g. smart phone, laptop, mobile phone, sensor, relay, mobile tablets or even a small base station communicating within a cell.
Core network nodes, such as e.g. a central node 130, operate in the CN. The central node 130 is adapted to control exploration strategies associated to RL in the one or more RL modules 111 in the distributed node 110, e.g. by means of an exploration controller 132 in the central node 130.
Methods herein may e.g. be performed by the central node 110. As an alternative, a Distributed Node (DN) and functionality, e.g. comprised in a cloud 140 as shown in
In some example embodiments, the distributed node 110 is an eNB and/or gNB and the central node 130 may be an Operation and Maintenance (OAM) node. One or more RL modules 111 are located in the distributed node 110. The respective one or more RL module 111 is a module that trains a policy and uses the policy to infer an action, e.g. changing the values of one or multiple configuration parameters in the distributed node 110. An exploration controller 132 may be located in the central node 130. The exploration controller 132 is a unit that may decide the value of one or multiple exploration parameters for the RL modules 111.
The central node 130 has access to knowledge related to the cost of random actions taken by the RL modules 111 for exploration and the performance of the RL modules 111 in distributed nodes such as the distributed node 130.
Based on this knowledge, the central node 130 may
Based on this knowledge, the central node 130 may further
Exploration
The wordings exploration and exploration strategy when used herein e.g. means the behaviour of the one or more RL modules 111 to probe state transition and resulted reward in an environment by randomly selecting an action.
The one or more exploration parameters to be determined herein will e.g. be used for the one or more RL modules 111 to decide the frequency of selecting a random action and/or the candidate actions that can be randomly selected in a given state.
Training
Compared to exploration and exploration strategy, the wordings training and training strategy when used herein e.g. means the process to update a policy based on the observed state transition and resulted reward after taking an action.
The one or more training parameters to be determined may e.g. be used for the RL module to control the training process by specifying the configuration of methods for ML model update.
The types of and the formats of the parameters associated to an exploration strategy that may be signaled with the control message explained more in detailed below.
Upon the reception of the message, the distributed node 110 applies the exploration and the training parameters configured by the central node 130 to the corresponding exploration strategy and training strategy for one or more RL modules 111.
Embodiments herein may provide following advantages:
Example embodiments of the provided method controls the exploration strategy and possibly the training strategy in the distributed node 110, e.g. by the exploration controller 132 located in the central node 130 where a richer knowledge is available e.g. compared to the distributed node 110. The richer knowledge may comprise, in the serving area of the distributed node 110, whether there are prioritized users, whether the served traffic is critical, whether there is an important event, etc.
This results in:
The method comprises one or more of the following actions, which actions may be taken in any suitable order. Actions that are optional are marked with dashed boxes in the figure.
Action 401
The central node 130 evaluates a cost of actions performed for explorations in the one or more RL modules 111 and a performance of the one or more RL modules 111.
The cost of actions performed for explorations e.g. means degraded user experience with lower throughput and/or higher latency and degraded system performance with worse availability, accessibility, reliability and/or retainability. The cost of actions performed for explorations may e.g. be evaluated by predicting the outcome of the actions based on knowledge obtained from domain experts and/or past experiences.
The performance of the one or more RL modules 111 means the capability to achieve high rewards which is related to user experiences and system performance. The performance of the one or more RL modules 111 may e.g. be evaluated by the value of reward signals and/or Key Performance Indicators (KPIs) indicating user experience and system performance.
Action 402
Based on the evaluation, the central node 130 determines one or more exploration parameters associated to the exploration strategy.
These one or more exploration parameters may later be used by the distributed node 110 for an exploration procedure according to the exploration strategy. i.e. the procedure to learn new knowledge, e.g. by taking random actions according to the determined one or more exploration parameters and observing the consequences.
In some embodiments the one or more exploration parameters is determined for a specific cell or group of cells controlled by the distributed node 110.
The one or more exploration parameters may be determined further based on any one or more out of: Which may mean that the cost of actions performed for explorations in the one or more RL modules 111 and the performance of the one or more RL modules 111 may comprise any one or more out of:
The one or more exploration parameters may comprise any one or more out of:
Action 403
The central node 130 controls the exploration strategy by configuring the one or more RL modules 111 with the determined one or more exploration parameters to update its exploration strategy. To update its exploration strategy e.g. means to change the frequency of selecting a random action and/or changing the candidate actions that may be randomly selected in a given state.
This enforces the respective one or more RL modules 111 to act according to the updated exploration strategy to produce data samples for the one or more RL modules 111 in the distributed node 110. To act according to the updated exploration strategy to produce data samples means to select an action according to the updated exploration strategy and observe system transition and resulted reward.
It is an advantage that the central node 130 controls the exploration strategy since the central node 130 may possess more knowledge than the distributed node 110 to evaluate the cost of the exploration in the distributed node 110.
In some embodiments the central node 130 configures the one or more RL modules 111 with the determined one or more exploration parameters by sending the one or more exploration parameters in a first control message.
In some embodiments the method is further performed for controlling a training strategy associated to the RL in the one or more RL modules 111 in the distributed node 110. In these embodiments, the below actions 404-405 are performed.
Action 404
In these embodiments the central node 130 determines one or more training parameters based on the evaluation. The one or more training parameters are associated to the training strategy.
The one or more training parameters may be determined further based on any one or more out of: Which may mean that the cost of actions performed for explorations in the one or more RL modules 111 and the performance of the one or more RL modules 111 may in these embodiments comprise any one or more out of:
The one or more training parameters may comprise any one or more out of:
Action 405
In these embodiments the central node 130 further configures the one or more RL modules 111 with the determined one or more training parameters to update its training strategy. It is an advantage that the central node 130 controls the training strategy since the central node 130 may possess more knowledge than the distributed node 110 about the best strategy for training.
This enforces the respective one or more RL modules 111 in the distributed node 110 to act according to the updated training strategy to use the produced data samples to update an RL policy of the RL module. To act according to the updated training strategy to use the produced data samples to update an RL policy of the RL module means to apply the method and hyperparameters specified in the updated training strategy to update the RL policy of the RL module.
In some embodiments the central node 130 configures the one or more RL modules 111 with the one or more training parameters, by sending the one or more training parameters in a second control message.
The embodiments described above will now be further explained and exemplified. The example embodiments described below may be combined with any suitable embodiment above.
Method in the Central Node 130 and its Embodiments.
Example embodiments herein discloses methods performed in the central node 130 for optimizing and controlling the configuration of the exploration strategy and possibly also the training strategy associated to RL, also referred to as machine learning, algorithms executed by the distributed node 130. In one embodiment, the distributed node 110 is an eNB or gNB, and the central node 130 is an OAM node.
Exploration
As mentioned above the method may e.g. comprise the following related to the Actions described above:
In some embodiments, the central node 130 determines the one or more parameters associated to the exploration strategy for the one or more RL modules 111 of the distributed node 110 for a specific cell or group of cells controlled by the distributed node 110.
In some other embodiments of the method, the central node 130 determines the one or more parameters associated to the exploration strategy for the one or more RL modules 111 of a distributed node 110 based on network performance and/or service requirements associated to services and applications provided by the distributed node 110. Such examples may comprise:
For instance, in case the central node 130 detects critical or prioritized services, or VIP users, or services with stringent requirements in terms of data rate, latency, reliability, energy efficiency, etc. to be provided within the coverage area of one or more radio cells controlled by a distributed node 110 where exploration is configured, the central node 130 may determine to reduce the amount of exploration by changing the one or more parameters of the exploration strategy.
For example, with an E-greedy exploration strategy, wherein a control policy is tasked to explore with probability ∈∈[0, 1], i.e., acting according to a random probability distribution, such as taking an action with uniform probability among all available actions, and to act according to the control policy with probability 1−∈, the central node 130 may determine to reduce the current value E configured for the distributed node 110 so as to reduce the average number of actions taken according to a random probability distribution. Vice versa, when the central node 130 detects that there are no critical traffic or services to be supported in any of the cells controlled by the distributed node 110, the central node 130 may determine to increase the explorative behavior of the distributed node 110.
In some embodiments of the method, the central node 130 determines the one or more parameters associated to the exploration strategy for the one or more RL modules 111 of the distributed node 110 based on network performance experienced in the coverage area of the radio cells controlled by the distributed node 110. For instance, if the network performance measured in the radio cells controlled by the distributed node 110 falls below a threshold or is lower compared to the performance of other radio cells controlled by other distributed nodes, for instance with similar deployment and radio conditions, the central node 130 may infer that the RL policy used by the distributed node 110 in one or more controlled radio cells is not sufficiently good, and may thereby determine to increase the explorative behavior of the distributed node 110 in one or more of its controlled cells in order to collect new data that could improve the current policy.
In some embodiments the central node 130 may determine to change exploration strategy for the distributed node 110. Examples of possible exploration strategies include, but are not limited to:
Recency-Based Exploration
Therefore, the central node 130 may signal to the distributed node 110 which exploration strategy to use and the corresponding the one or more parameters. For instance, the central node 130 may signal an exploration strategy as one element of an enumerated list or using a bitmap with each bit indicating one specific exploration strategy and setting the bit to equal to 1 only for the selected exploration strategy.
In case the distributed node 110 is using an exploration strategy where the one or more parameters are changed dynamically and locally by the distributed node 110, the central node 130 may further:
For instance, if the distribute node 110 is configured to explore according to an E-greedy exploration strategy with decaying and/or annihilating exploration over time, the value of the exploration parameter E initially configured by the central node 130 for the distributed node 110 may be reduced by the distributed node 110 over time so as to reduce the amount of exploration. If the central node 130 has not configured the distributed node 110 with a specific decaying and/or annihilating exploration parameters, the central node 130 may not be aware of the current value of the parameter E governing the amount of exploration at the distributed node 110. The knowledge of such parameter would be necessary to the central node 130 to determine whether the exploration strategy used by the distributed node 110, or its associated one or more parameters, need to be updated, e.g., due to critical or prioritized services or users according to other embodiments.
Training
As mentioned above the method may in some embodiments further comprise the following, related to the Actions described above:
In some embodiments, the central node 130 determines one or more training parameters such as one or more efficient training parameters. For example, the central node 130 may signal different learning parameters to each of different distributed nodes such as e.g. the distributed node 110. For a distributed node, e.g. the distributed node 110, that handles critical or prioritized traffic, the central node 130 may configure training parameters that have provided a high training performance, also referred to as learning performance, in previous instances. Learning performance when used herein may mean the achieved accuracy of the model prediction after trained with a given number of samples. For other distributed nodes, which in some embodiments also may be the distributed node 110, the central node 130 may configure training parameters for which the impact on learning performance is insufficiently known. In this manner, the central node 130 may efficiently obtain knowledge about the best training parameter configurations comprising the one or more training parameters, while minimizing the adverse impact on the overall system performance. Periodically, the central node 130 may update the training parameters for all or a subset of the distributed nodes, e.g. comprising the distributed node 110, in response to the type of traffic being currently served by that distributed node, and the knowledge about the training parameters collected so far. The central node 130 may choose training parameters based on, for example,
At the distributed node 110 the following actions may be performed.
Example of embodiments herein provide:
To perform the action as mentioned above, the central node 130 may comprise the arrangement as shown in
The central node 130 may comprise a respective input and output interface 500 configured to communicate with e.g. the distributed node 110, see
The central node 130 may further be configured to, e.g. by means of an evaluating unit 510 in the central node 130, evaluate a cost of actions performed for explorations in the one or more RL modules 111, and a performance of the one or more RL modules 111.
The central node 130 may further be configured to, e.g. by means of a determining unit 511 in the central node 130, based on the evaluation, determine one or more exploration parameters associated to the exploration strategy.
The one or more exploration parameters may be adapted to be determined, e.g. by means of the determining unit 511, for a specific cell or group of cells controlled by the distributed node 110.
The central node 130 may further be configured to, e.g. by means of the determining unit 511, determine the one or more exploration parameters based on any one or more out of:
The one or more exploration parameters may be adapted to comprise any one or more out of:
The central node 130 may further be configured to, e.g. by means of the determining unit 511, determine one or more training parameters, which one or more training parameters are adapted to be associated to the training strategy.
The central node 130 may further be configured to, e.g. by means of the determining unit 511, determine the one or more training parameters based on any one or more out of:
The one or more training parameters may be adapted to comprise any one or more out of:
The central node 130 may further be configured to, e.g. by means of an configuring unit 512 in the central node 130, control the exploration strategy by configuring the one or more RL modules 111 with the determined one or more exploration parameters to update its exploration strategy, to enforce the respective one or more RL modules 111 to act according to the updated exploration strategy to produce data samples for the one or more RL modules 111 in the distributed node 110.
The central node 130 may further be configured to, e.g. by means of the configuring unit 512, configure the one or more RL modules 111 with the determined one or more training parameters to update its training strategy, to enforce the respective one or more RL modules 111 in the distributed node 110 to act according to the updated training strategy to use the produced data samples to update an RL policy of the RL module.
The central node 130 may further be configured to, e.g. by means of the configuring unit 512, any one or more out of:
The embodiments herein may be implemented through a processor or one or more processors, such as a processor 550 of a processing circuitry in the central node 130 in
The central node 130 may further comprise a memory 560 comprising one or more memory units. The memory 560 comprises instructions executable by the processor 550 in the central node 130. The memory 560 is arranged to be used to store, e.g. training parameters, exploration parameters, training strategy, control messages, data samples, RL policies, information, data, configurations, and applications, to perform the methods herein when being executed in the central node 130.
In some embodiments, a computer program 570 comprises instructions, which when executed by the at least one processor 550, cause the at least one processor 550 of the central node 130 to perform the actions above.
In some embodiments, a carrier 580 comprises the computer program 570, wherein the carrier 580 is one of an electronic signal, an optical signal, an electromagnetic signal, a magnetic signal, an electric signal, a radio signal, a microwave signal, or a computer-readable storage medium.
Those skilled in the art will also appreciate that the units in the units described above may refer to a combination of analog and digital circuits, and/or one or more processors configured with software and/or firmware, e.g. stored in the central node 130 that when executed by the one or more processors such as the processors or processor circuitry described above. One or more of these processors, as well as the other digital hardware, may be included in a single Application-Specific Integrated Circuitry (ASIC), or several processors and various digital hardware may be distributed among several separate components, whether individually packaged or assembled into a system-on-a-chip (SoC).
Further Extensions and Variations
With reference to
The telecommunication network 3210 is itself connected to a host computer 3230, which may be embodied in the hardware and/or software of a standalone server, a cloud-implemented server, e.g. cloud 140, a distributed server or as processing resources in a server farm. The host computer 3230 may be under the ownership or control of a service provider, or may be operated by the service provider or on behalf of the service provider. The connections 3221, 3222 between the telecommunication network 3210 and the host computer 3230 may extend directly from the core network 3214 to the host computer 3230 or may go via an optional intermediate network 3220. The intermediate network 3220 may be one of, or a combination of more than one of, a public, private or hosted network; the intermediate network 3220, if any, may be a backbone network or the Internet; in particular, the intermediate network 3220 may comprise two or more sub-networks (not shown).
The communication system of
Example implementations, in accordance with an embodiment, of the UE, base station and host computer discussed in the preceding paragraphs will now be described with reference to
The communication system 3300 further includes a base station 3320 provided in a telecommunication system and comprising hardware 3325 enabling it to communicate with the host computer 3310 and with the UE 3330. The hardware 3325 may include a communication interface 3326 for setting up and maintaining a wired or wireless connection with an interface of a different communication device of the communication system 3300, as well as a radio interface 3327 for setting up and maintaining at least a wireless connection 3370 with a UE 3330 located in a coverage area (not shown) served by the base station 3320. The communication interface 3326 may be configured to facilitate a connection 3360 to the host computer 3310. The connection 3360 may be direct or it may pass through a core network (not shown in
The communication system 3300 further includes the UE 3330 already referred to. Its hardware 3335 may include a radio interface 3337 configured to set up and maintain a wireless connection 3370 with a base station serving a coverage area in which the UE 3330 is currently located. The hardware 3335 of the UE 3330 further includes processing circuitry 3338, which may comprise one or more programmable processors, application-specific integrated circuits, field programmable gate arrays or combinations of these (not shown) adapted to execute instructions. The UE 3330 further comprises software 3331, which is stored in or accessible by the UE 3330 and executable by the processing circuitry 3338. The software 3331 includes a client application 3332. The client application 3332 may be operable to provide a service to a human or non-human user via the UE 3330, with the support of the host computer 3310. In the host computer 3310, an executing host application 3312 may communicate with the executing client application 3332 via the OTT connection 3350 terminating at the UE 3330 and the host computer 3310. In providing the service to the user, the client application 3332 may receive request data from the host application 3312 and provide user data in response to the request data. The OTT connection 3350 may transfer both the request data and the user data. The client application 3332 may interact with the user to generate the user data that it provides.
It is noted that the host computer 3310, base station 3320 and UE 3330 illustrated in
In
The wireless connection 3370 between the UE 3330 and the base station 3320 is in accordance with the teachings of the embodiments described throughout this disclosure. One or more of the various embodiments improve the performance of OTT services provided to the UE 3330 using the OTT connection 3350, in which the wireless connection 3370 forms the last segment. More precisely, the teachings of these embodiments may improve the applicable RAN effect: data rate, latency, power consumption, and thereby provide benefits such as corresponding effect on the OTT service: e.g. reduced user waiting time, relaxed restriction on file size, better responsiveness, extended battery lifetime.
A measurement procedure may be provided for the purpose of monitoring data rate, latency and other factors on which the one or more embodiments improve. There may further be an optional network functionality for reconfiguring the OTT connection 3350 between the host computer 3310 and UE 3330, in response to variations in the measurement results. The measurement procedure and/or the network functionality for reconfiguring the OTT connection 3350 may be implemented in the software 3311 of the host computer 3310 or in the software 3331 of the UE 3330, or both. In embodiments, sensors (not shown) may be deployed in or in association with communication devices through which the OTT connection 3350 passes; the sensors may participate in the measurement procedure by supplying values of the monitored quantities exemplified above, or supplying values of other physical quantities from which software 3311, 3331 may compute or estimate the monitored quantities. The reconfiguring of the OTT connection 3350 may include message format, retransmission settings, preferred routing etc.; the reconfiguring need not affect the base station 3320, and it may be unknown or imperceptible to the base station 3320. Such procedures and functionalities may be known and practiced in the art. In certain embodiments, measurements may involve proprietary UE signaling facilitating the host computer's 3310 measurements of throughput, propagation times, latency and the like. The measurements may be implemented in that the software 3311, 3331 causes messages to be transmitted, in particular empty or ‘dummy’ messages, using the OTT connection 3350 while it monitors propagation times, errors etc.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SE2020/051041 | 10/28/2020 | WO |