This disclosure relates to reinforcement learning.
Reinforcement Learning (RL) is a type of machine learning (ML) that enables an agent to learn by trial and error using feedback based on the actions that the agent triggers. RL has made remarkable progress in recent years and is now used in many applications, including real-time network management, simulations, games, etc. RL differs from the commonly used supervised and unsupervised ML approaches. Supervised ML requires a training data set with annotations provided by an external supervisor, and unsupervised ML is typically a process of determining an implicit structure in a data set without annotations.
The concept of RL is straightforward: an RL agent is reinforced to make better decisions based on the past learning experience. This method is similar to the different performance rewards that we encounter in everyday life. Typically, the RL agent implements an algorithm that obtains information about the current state of a system (a.k.a., “environment”), selects an action, triggers performance of the action, and then receives a “reward,” the value of which is dependent on the extent to which the action produced a desired outcome. This process repeats continually and eventually the RL agent learns, based on the reward feedbacks, the best action to select given the current state of the environment.
Although a designer sets the reward policy, that is, the rules of the game, the designer typically gives the RL agent no hints or suggestions as which actions are best for any given state of the environment. It's up to the RL agent to figure out which action is best to maximize the reward, starting from totally random trials and finishing with sophisticated tactics. By leveraging the power of search and many trials, RL is an effective way to accomplish a task. In contrast to human beings, an RL agent can gather experience from thousands of parallel gameplays if a reinforcement learning algorithm is run on a sufficiently powerful computer infrastructure.
Q-learning is a reinforcement learning algorithm to learn the value of an action in a particular statue (see, e.g. reference [2]). Q-learning does not require a model of the environment, and theoretically, it can find an optimal policy that maximizes the expected value of the total reward for any given finite Markov decision process. The Q-algorithm is used to find the optimal action/selection policy: Q: S×A→(Eq. 1).
where α is the learning rate with 0<α≤1 and it determines to what extent newly acquired information overrides the old information, and γ is a discount factor with 0<γ≤1 and it determines the importance of future rewards.
A simple way of implementing Q-learning algorithm is to store the Q matrix in tables. However, this can be infeasible or not efficient when the number of states or actions becomes large. In this case, function approximation can be used to represent Q, which makes Q-learning applicable to large problems. One solution is to use deep learning for function approximation. Deep learning models consist of several layers of neural networks, which are in principle responsible for performing more sophisticated tasks like nonlinear function approximation of Q.
Deep Q-learning is a combination of convolutional neural networks with the Q-learning algorithms. It uses deep neural network with weights θ to achieve an approximated representation of Q. In addition, to improve the stability of the deep-Q learning algorithm, a method called experience replay was proposed to remove correlations between samples by using a random sample from prior actions instead of the most recent action to proceed (see, e.g., reference [3]). The deep Q-learning algorithm with experience replay proposed in reference [3] is shown in the table below. After performing experience replay, the agent selects and executes an action according to an ε-greedy policy. ε defines the exploration probability for the agent to perform a random action.
As noted above, reinforcement learning has been successfully used in in many use cases (e.g., cart-pole problem solving, robot locomotion, Atari games, Go Games, etc.) where the RL agent is dealing with a relatively static environment (the set of states don't change), and it is possible to obtain all possible environment states, which is known as “full observability.”
Theoretically, RL algorithms can also cope with the dynamic changing environment if sufficient data can be collected to abstract the changing environment and there is sufficient time for training and trials. These requirements, however, can be difficult to meet in practice because large data collection can be complex, costly, and time consuming, or even infeasible. In many cases, it is not possible to have the full observability of the dynamic environment, e.g., when quick decision needs to be taken, or when it is difficult/infeasible to collect data for some features. On example is a public safety scenario, where an unmanned aerial vehicle (UAV) (a.k.a., drone) carrying base station (“UAV-BS”) needs to be deployed quickly in a disaster area to provide wireless connectivity for mission critical users. It is important to adapt the UAV-BS' configuration and location to the real-time mission critical traffic situation. For instance, when the mission critical users move on the ground and/or when more first responders join the mission critical operation in the disaster area, the UAV-BS should quickly adapt its location and configuration to maintain the service continuity in this changing environment.
This disclosure aims at mitigating the above problem. Accordingly, in one aspect there is provided a method for dynamic RL. The method includes using an RL algorithm to select a first action and triggering performance of the selected first action. The method also includes after the first action is performed, obtaining a first reward value (R1) associated with the first action. The method also includes using R1 and/or a performance indicator (PI) to determine whether an algorithm modification condition is satisfied. The method further includes, as a result of determining that the algorithm modification condition is satisfied, modifying the RL algorithm to produce a modified RL algorithm. In this way, the RL algorithm adapts to changes in the environment.
In another aspect there is provided a computer program comprising instructions which when executed by processing circuitry of a RL agent causes the RL agent to perform any of the methods disclosed herein. In one embodiment, there is provided a carrier containing the computer program wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
In another aspect there is provided an RL agent node that is configured to use an RL algorithm to select a first action and trigger performance of the selected first action. The RL agent is also configured to, after the first action is performed, obtain a first reward value (R1) associated with the first action. The RL agent is also configured to use R1 and/or a performance indicator (PI) to determine whether an algorithm modification condition is satisfied. The RL agent is also configured to, as a result of determining that the algorithm modification condition is satisfied, modify the RL algorithm to produce a modified RL algorithm. In some embodiments, the RL agent comprises memory and processing circuitry coupled to the memory, wherein the memory contains instructions executable by the processing circuitry to configure the RL agent to perform the methods/processes disclosed herein.
An advantage of the embodiments disclosed herein is that they provide an adaptive RL agent that is able to operate well in a dynamic environment with limited observability of the environment and/or changing state sets over time. That is, embodiments can handle complex system optimization and decision-making problems in dynamic environment with limited environment observability and dynamic state space. Compared to a convention non-adaptive RL agent, the embodiments disclosed herein can respond to changes in the environment and update its RL algorithm to achieve an acceptable level of service quality. In addition, conventional RL agents need to have a retrained RL algorithm completely from scratch when entering a different environment, whereas the embodiment can reuse part of the past learned experience with adjusted algorithm parameters to provide proper and timely decisions in the subsequent changing environments.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
Agent 201 is configured to adapt the RL algorithm that is employs to select the actions. This enables, among other things, fast decision making in a dynamic environment with limited or/and changing state sets over time. The agent 201, in one embodiment, preforms the following steps: 1) the agent 201 monitors a first set of one or more parameters, 2) the agent 201, based on monitored parameter(s), adjusts the RL algorithm (e.g., adjusts a second set of one or more parameters) to adapt the RL algorithm to the new environment, and 3) selects action using the modified RL algorithm.
In one embodiment, the first set of parameters includes at least one or a combination of the following: 1) The received immediate reward rt at a given time t, 2) an accumulated reward Σt=ijrt during a time window, i.e., from time i to time j; and 3) a performance indicator (e.g., a key performance indicator (KPI)). For the UAV-BS in the public safety scenario described above, examples of KPIs include: the drop rate of mission critical users; the worst mission critical user throughput; the wireless backhaul link quality; etc.
With respect to the accumulated reward, in some embodiments the time window is decided based on: i) the correlation time (changing-scale) the environment and/or ii) application requirements, e.g., the maximum allowed service interruption time. In other embodiments, the time window is the time duration from the beginning till now.
Dynamic changing of environment (e.g., user equipment (UE) movements, UAV movements, or/and backhaul connection links of the UAV-BS in the public safety scenario) can result in the change of value(s) of one or a combination of the first set of parameters. By detecting/observing such changes, the agent 201 can automatically adapt the RL algorithm to fit the new environment.
The triggering event for adjusting the second set of parameters at a given time t can be at least one or a combination of the following:
In one embodiment, the second set of parameters consists of algorithm related parameters. The Second set of parameters can include at least one or a combination of the following: i) the exploration probability ε; ii) the learning rate (α; iii) the discount factor γ; iv) and the replay memory capacity N.
In one example, the exploration probability E can be increased to a certain value when an event (e.g., immediate reward is dropped below a threshold) has triggered the update of the algorithm. In another example, the exploration probability E can be reduced to a certain value when another event (e.g., accumulated reward has reached an upper-bound threshold) has triggered the update of the algorithm.
In one example, the learning rate a can be increased to a certain value when an event (e.g., immediate reward is dropped below a threshold) has triggered the update of the algorithm. In another example, the learning rate a can be reduced to a certain value when another event (e.g., accumulated reward has reached an upper-bound threshold) has triggered the update of the algorithm.
In one example, the discount factor γ can be increased to a certain value when an event (e.g., immediate reward is dropped below a threshold) has triggered the update of the algorithm. In another example, the discount factor γ can be reduced to a certain value when another event (e.g., accumulated reward has reached an upper-bound threshold) has triggered the update of the algorithm.
In one example, the replay memory capacity N can be increased to a certain value when an event (e.g., immediate reward is dropped below a threshold) has triggered the update of the algorithm. In another example, the replay memory capacity N can be reduced to a certain value when another event (e.g., accumulated reward has reached an upper-bound threshold) has triggered the update of the algorithm.
The table below shows pseudo-code for a dynamic reinforcement learning process that is performed by agent 201 in one embodiment.
As seen from the above code, the exploration probability E is adjusted when there is a reward value drop greater than a threshold (a.k.a., the “Drop” threshold). Following the completion of each learning iteration, the last reward value rK will be checked and compared to a pre-defined performance drop tolerant threshold and an upper reward threshold. The adjustment is made to exploration probability ε based on the reward value rK and the two thresholds.
In this example, the first set of parameters includes the immediate reward rK, and the second set of parameters consists of the exploration probability ε. There are two triggering events for updating this algorithm related parameter:
Stable connectivity is crucial for improving the situational awareness and operational efficiency in various mission-critical situations. In a catastrophe or emergency scenario, the existing cellular network coverage and capacity in the emergency area may not be available or sufficient to support mission-critical communication needs. In these scenarios, deployable-network technologies like portable base stations (BSs) on UAVs or trucks can be used to quickly provide mission-critical users with dependable connectivity.
In order to best serve the on-ground mission-critical users, and, at the same time, maintaining a good backhaul connection, agent 201 can be employed to autonomously configure the location of the UAV-BS and the electrical tilt for the access and backhaul antenna of the UAV-BS. By employing the RL algorithm adaptation processes disclosed herein, agent 201 is be able to adapt its RL algorithm to the real-time changing environment (e.g., when mission-critical traffic moves on the ground), where traditional reinforcement learning algorithms are not applicable and would result in inappropriate UAV-BS configuration decisions. That is, the agent 201 can be used to automatically control the location of the UAV-BS 302 and the antenna configuration of the UAV-BS in a dynamic changing environment, in order to best serve the on-ground mission critical users and at the same time, maintaining a good backhaul connection between the UAV-BS and an on-ground donor base station.
In some embodiments, modifying the RL algorithm to produce the modified RL algorithm comprises modifying a parameter of the RL algorithm.
In some embodiments, modifying a parameter of the RL algorithm comprises modifying one or more of: an exploration probability of the RL algorithm; a learning rate of the RL algorithm, a discount factor of the RL algorithm, or a replay memory capacity of the RL algorithm. In some embodiments, using the RL algorithm to select the first action comprises selecting the first action based on the exploration probability, an modifying the RL algorithm to produce the modified RL algorithm comprises modifying the exploration probability.
In some embodiments, using R1 and/or PI to determine whether the algorithm modification condition is satisfied comprises one or more of i) comparing R1 to a first threshold, ii) comparing ΔR to a second threshold, wherein ΔR is a difference between R1 and a reward value associated with a second action selected using the RL algorithm, or iii) comparing the PI to a third threshold.
In some embodiments, process 400 also includes: i) before using the RL algorithm to select the first action and obtaining R1, using the RL algorithm to select a second action; ii) triggering performance of the selected second action; and iii) after the second action is performed, obtaining a second reward value, R2 (e.g., rprevious), associated with the second action, wherein using R1 and/or PI to determine whether the algorithm modification condition is satisfied comprises performing a decision process comprising: calculating ΔR=R2−R1 and determining whether ΔR is greater than a drop threshold.
In some embodiments, using the RL algorithm to select the first action comprises selecting the first action based on an exploration probability (E). The exploration probability specifies the likelihood that the agent will randomly select an action, as opposed to selecting an action that is determined to yield the highest expected reward. For example, if E is 0.1, then the agent is configured such that when the agent goes to select an action there is a 10% chance the agent will randomly select an action and a 90% chance that the agent will select an action that is determined to yield the highest expected reward.
In some embodiments, the algorithm modification condition is satisfied when ΔR is greater than the drop threshold, and modifying the RL algorithm as a result of determining that the algorithm modification condition is satisfied comprises generating a new exploration probability, εnew, for the RL algorithm, wherein εnew equals εReStart, where εNewStart is a predetermined exploration probability (e.g., εRestart=0.1).
In some embodiments, the decision process further comprises, as a result of determining that ΔR is not greater than the drop threshold, then determining whether R1 is less than a lower reward threshold.
In some embodiments, the decision process further comprises, as a result of determining that ΔR is not greater than the drop threshold, then determining whether R1 is greater than an upper reward threshold. In some embodiments, the algorithm modification condition is satisfied when ΔR is not greater than the drop threshold and R1 is greater than the upper reward threshold, an modifying the RL algorithm as a result of determining that the algorithm modification condition is satisfied comprises generating a new exploration probability, εnew, for the RL algorithm, wherein εnew equals εEnd, where εEnd is a predetermined ending exploration probability (e.g., εEnd=0.001).
In some embodiments, the algorithm modification condition is satisfied when ΔR is not greater than the drop threshold and R1 is not greater than the upper reward threshold, and modifying the RL algorithm as a result of determining that the algorithm modification condition is satisfied comprises generating a new exploration probability, εnew, for the RL algorithm, wherein εnew equals (E×c), where c is a predetermined constant.
In some embodiments, process 400 further includes, prior to using the RL algorithm to select the action: i) using the RL algorithm to select K−1 actions, where K>1; ii) triggering the performance of each one of the K−1 actions; and iii) for each one of the K−1 actions, obtaining a reward value associated with the action. In some embodiments, using R1 and/or PI to determine whether the algorithm modification condition is satisfied comprises: using R1 and said K−1 reward values to generate a reward value that is a function of these K reward values; and comparing the generated reward value to a threshold. In some embodiments, using R1 and/or PI to determine whether the algorithm modification condition is satisfied comprises: using R1 and said K−1 reward values to generate a reward value that is a function of these K reward values; and comparing ΔR to a threshold, wherein ΔR is a difference between the generated reward value and a previously generated reward value. In some embodiments, the generated reward value is: a sum of the K reward value, a weighted sum of said K reward values, a weighted sum of a subset of said K reward values, a mean of said K reward values, a mean of a subset of said K reward values, a median of said K reward values, or a median of a subset of said K reward values.
In some embodiments, the value of K is determined based on a correlation time of the environment and/or application requirements (e.g., the maximum allowed service interruption time). In some embodiments, the value of K is determined based on a maximum allowed service interruption time.
In some embodiments, one or more of the recited thresholds is dynamically changed based on environment changes and/or service requirement changes.
In some embodiments, process 400 further includes using the modified RL algorithm to select another action and triggering performance of the another action.
While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2021/116892 | Sep 2021 | WO | international |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/086922 | 12/21/2021 | WO |