The present disclosure relates to a computer implemented method for training a policy for managing an environment in a communication network, wherein the environment is operable to perform a task. The method is performed by a training node, and the present disclosure also relates to a training node and to a computer program product configured, when run on a computer, to carry out a method for training a policy for managing an environment in a communication network.
Autonomous decision making and control of telecommunication systems and subsystems can use various Artificial Intelligence (AI) based techniques, of which Reinforcement Learning (RL) is one of the most prominent. RL is a technology according to which a decision-making model is trained by interacting with an environment, with the aim of optimizing some objective, formulated in terms of a reward function defined by a user. Using observations from the environment being controlled, an RL agent adapts and improves its policies (decision making capability) so as to maximise future reward. Stakeholders of such policies for communication network management may include network operators, network service providers, network users. etc. Stakeholders will often ask for explanations of actions taken as a result of an RL management policy, and the quality, accuracy, and usefulness of these explanations can significantly impact stakeholder acceptance and approval of management policies using RL, and AI in general.
An RL problem is usually modeled as a Markov Decision Process (MDP), which includes a state space S, an action space A, a transition function T and the reward function R.
The RL decision-making agent (or policy model), receives as input an observation of the environment, and produces as output an action for execution in the environment. The agent is trained using past trajectories of tuples comprising observation, action, reward, and next observation. An MDP assumes that a managing agent has full knowledge of the current underlying state s of the environment being managed. An MDP is consequently a special case of a Partially Observable Markov Decision Processes (POMDP), in which this assumption of perfect state knowledge no longer holds.
In many application domains, including communication networks, full knowledge of environment states may not be possible, owing for example to noisy measurements giving faulty information, or leading to ambiguity in environment observations. A POMDP models uncertainty about the current state of the environment by introducing hidden states, i.e. states that are not directly observable by the agent. The only perception the agent has of the state of a controlled environment is through imperfect observations. The POMDP model is extended with respect to a standard MDP by introducing an observation space Z of what the agent can observe, and an observation function O, relating the observations to the hidden states through a probability distribution of the observations over the states. The POMDP also introduces the concept of a belief state, which is a probability distribution over the possible hidden states of the environment.
POMDP agents and probabilistic planning techniques consequently have multiple levels of uncertainty and complexity to be resolved in order to arrive at a working management policy. These uncertainties may encompass sensor observations, actions, agent state space, true environment state space and transition probabilities. As these uncertainties are typically weaved into the reward function, the only feedback mechanism for improving belief states is via observations of the environment during policy evaluation. Uncertainties in observations also propagate through to action explanations provided to stakeholders, potentially undermining confidence in the RL system and reducing its usefulness.
It is an aim of the present disclosure to provide a method, a training node, and a computer program product which at least partially address one or more of the challenges discussed above. It is a further aim of the present disclosure to provide a method, a training node and a computer program product that cooperate to achieve at least one of improved belief inferencing in POMDP models using explanation feedback received during training or policy execution, and/or improved explanations of POMDP model actions using through validation of belief states using explanation feedback.
According to a first aspect of the present disclosure, there is provided a computer implemented method for training a policy for managing an environment in a communication network, wherein the environment is operable to perform a task. The method is performed by a training node and comprises using a first policy function to select an action for execution in the environment, which action will maximise a future predicted value of a reward function for task performance, given a current belief state of the environment, wherein the current belief state of the environment comprises a probability distribution over possible operational states in which the environment may exist. The method further comprises recording an explanation tree for the selected action, wherein the explanation tree comprises a representation of the current belief state of the environment, available actions that could be executed in the environment, and predicted reward function values associated with the available actions. The method further comprises causing the selected action to be executed in the environment and obtaining an observation of the environment following execution of the selected action, the observation comprising a measure of task performance by the environment. The method further comprises using a second policy function to generate an updated current belief state of the environment based on the obtained observation, and on a probability that an explanation for the selected action would be accepted by an entity querying the selected action, the explanation being generated using the explanation tree for the selected action.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method according to any one or more of the aspects or examples of the present disclosure.
According to another aspect of the present disclosure, there is provided a training node for training a policy for managing an environment in a communication network, wherein the environment is operable to perform a task. The training node comprises processing circuitry that is configured to cause the training node to use a first policy function to select an action for execution in the environment, which action will maximise a future predicted value of a reward function for task performance, given a current belief state of the environment. The current belief state of the environment comprises a probability distribution over possible operational states in which the environment may exist. The processing circuitry is further configured to cause the training node to record an explanation tree for the selected action, wherein the explanation tree comprises a representation of the current belief state of the environment, available actions that could be executed in the environment, and predicted reward function values associated with the available actions. The processing circuitry is further configured to cause the training node to cause the selected action to be executed in the environment, and to obtain an observation of the environment following execution of the selected action, the observation comprising a measure of task performance by the environment. The processing circuitry is further configured to cause the training node to using a second policy function to generate an updated current belief state of the environment based on the obtained observation, and on a probability that an explanation for the selected action would be accepted by an entity querying the selected action, the explanation being generated using the explanation tree for the selected action.
According to another aspect of the present disclosure, there is provided a training node for training a policy for managing an environment in a communication network, wherein the environment is operable to perform a task. The training node comprises a policy module for using a first policy function to select an action for execution in the environment, which action will maximise a future predicted value of a reward function for task performance, given a current belief state of the environment. The current belief state of the environment comprises a probability distribution over possible operational states in which the environment may exist. The training node also comprises an explanation module for recording an explanation tree for the selected action, wherein the explanation tree comprises a representation of the current belief state of the environment, available actions that could be executed in the environment, and predicted reward function values associated with the available actions. The policy module is also for causing the selected action to be executed in the environment. The training node also comprises a transceiver module for obtaining an observation of the environment following execution of the selected action, the observation comprising a measure of task performance by the environment. The policy module is also for using a second policy function to generate an updated current belief state of the environment based on the obtained observation, and on a probability that an explanation for the selected action would be accepted by an entity querying the selected action, the explanation being generated by the explanation module using the explanation tree for the selected action.
Aspects of the present disclosure thus provide a method and nodes that facilitate robust environment management. Examples of the present disclosure incorporate feedback from a querying entity via explanation probabilities, so improving the accuracy of belief states for a management environment, and consequently improving the quality of action that are selected for execution in an environment, and overall environment task performance. The explanation feedback also assists with adapting a policy to changes in an environment.
For a better understanding of the present disclosure, and to show more clearly how it may be carried into effect, reference will now be made, by way of example, to the following drawings in which:
In management of communication network systems by RAL agents, the provision of explanations for executed actions has typically been decoupled from operation and training of the RL policy itself, which can lead to deteriorating performance. Erroneous belief states are stored and cascaded, reducing the reliability of RL models in real deployments, as well as stakeholder confidence in such models. Examples of the present disclosure propose to include the stakeholder (or a model of stakeholder behaviour) as part of the environment. An explanation tree conveying information about the current state of the RL agent or policy is introduced as an additional output of a decision making step, in addition to an action for execution in the environment. A function of a response of a stakeholder, modelled as a querying entity, to an action explanation based on this explanation tree is included as an additional environment observation. The function of a querying entity response comprises a probability that the action explanation would be accepted by the querying entity. This additional observation promotes updating of belief states to those that are consistent with “correct” explanations, as validated or accepted by the querying entity. The additions to standard POMDP model proposed in the present disclosure consequently enable faster learning convergence for the RL policy, as well as improved quality of explanations provided to a querying entity.
Actions executed in the system 104 are output by the autonomous agent to a network management function 106 and observed by one or more stakeholders 108. Network management function 106 can provide various analytics insights, and with machine reasoning algorithms, zero-touch decisions for actuation can also be incorporated. As discussed, a stakeholder may request explanations for actions executed in the system, and provision for this is illustrated in part (b) of
The method 200 is performed by a training node, which may comprise a physical or virtual node, and may be implemented in a computer system, computing device or server apparatus and/or in a virtualized environment, for example in a cloud, edge cloud or fog deployment. Examples of a virtual node may include a piece of software or computer program, a code fragment operable to implement a computer program, a virtualised function, or any other logical entity. The training node may for example be implemented in a core network of the communication network, and may be implemented in the Operation Support System (OSS). The training node may be implemented in an Orchestration And Management (OAM) system or in a Service Management and Orchestration (SMO) system. In other examples, the training node may be implemented in a Radio Access node, which itself may comprise a physical node and/or a virtualized network function that is operable to exchange wireless signals. In some examples, a Radio Access node may comprise a base station node such as a NodeB, eNodeB, gNodeB, or any future implementation of this functionality. The training node may be implemented as a function in an Open Radio Access Network (ORAN) or Virtualised Radio Access Network (vRAN). The controller node may encompass multiple logical entities, as discussed in greater detail below, and may for example comprise a Virtualised Network Function (VNF). In some examples, the training node may encompass a POMDP agent and an explainer agent.
Referring to
In step 220, the method comprises recording an explanation tree for the selected action. As illustrated at 220a, the explanation tree comprises a representation of:
The explanation tree consequently provides a representation of the factors on the basis of which the first policy function selected an action at step 210.
In step 230, the method 200 comprises causing the selected action to be executed in the environment, and in step 240, the method 200 comprises obtaining an observation of the environment following execution of the selected action, the observation comprising a measure of task performance by the environment. The observation may for example comprise one or more KPIs for the environment and/or measurements of environment task performance, operation, etc. In step 250, the method 200 comprises using a second policy function to generate an updated current belief state of the environment based on the obtained observation, and on a probability that an explanation for the selected action would be accepted by an entity querying the selected action, the explanation being generated using the explanation tree for the selected action.
The method 200 introduces the concept of an explanation probability, that is a probability that an explanation for a selected action would be accepted by an entity querying the selected action, the explanation being generated using the explanation tree for the selected action. This explanation probability is used to update the belief state of the environment during training of a policy for managing the environment. The method 200 thus distinguishes between an explanation tree and an explanation. An explanation tree is recorded for each action selected by the policy. An explanation for an action is provided to an entity querying an action, and is generated using the explanation tree for that action. As discussed in greater detail below with reference to
In summary, the training node executing the method 200 uses a first policy function to select an action that is predicted to maximise a future value of a reward function. The training node also records an explanation tree for the selected action, on the basis of which an explanation for why the action was selected may be generated. The training node then uses a second policy function to update the belief state of the environment following execution of the action. The second policy function uses both an environment observation and explanation acceptance probabilities to update the belief state. The observation provides a clue directly from the environment as to what state the environment may be in, and the explanation acceptance probabilities represent input from a querying entity, which may be sparse and/or highly periodic, but which may assist with interpreting the observation to arrive at the correct belief state.
The method 200 offers increased accuracy of belief state updating, by including when updating the belief state of an environment the probability that an explanation for a selected action will be accepted by a querying entity. The querying entity could be any entity that is authorized and able to query an action executed in the environment. The querying entity may be physical or virtual, including for example a stakeholder of the environment such as a network operator, network service provided or network user, and/or a model of such a stakeholder. If an explanation for an action is accepted by a querying entity, this suggests that the action was acceptable under the circumstances in which it was selected, and consequently that the belief state on which the action selection was based was correct. Conversely, if an explanation is not accepted, then the selected action was inappropriate for the environment circumstances, suggesting that the belief state on which the action selection was based may be incorrect. The probability that an explanation will be accepted therefore provides an indication of the likelihood that the current belief state is correct, and incorporating this probability into the updating of the belief state can improve the accuracy of the belief state, and consequently improve the quality of actions selected by the policy.
Referring to
In step 320, the training node records an explanation tree for the selected action, wherein the explanation tree comprises a representation of the current belief state of the environment, available actions that could be executed in the environment, and predicted reward function values associated with the available actions. As illustrated at 320a, the explanation tree for the selected action may comprise a representation of a sequence of historical belief states of the environment, the sequence leading to the current belief state of the environment, and a plurality of action branches departing from the current belief state of the environment. Each action branch may correspond to a possible action that could be executed in the environment in the current belief state, and may comprise a predicted reward function value associated with the possible action. In some examples, the sequence of belief states leading to the current belief state may comprise only those belief states that have been identified as “valid”, after an explanation for an action taken from the belief state and queried by an entity has been accepted. Identification of valid belief states is discussed in greater detail below. As illustrated at 320b, an action branch of the explanation tree may comprise at least one sub-branch, the sub-branch corresponding to a subsequent possible action that could be executed in the environment and a predicted reward function value associated with the subsequent possible action. Multiple sub-branches, representing possible actions and predicted reward values for future time steps, may be present in the explanation tree. The explanation tree may be recorded by saving the explanation tree to a memory, sending the explanation tree to another logical entity for storage, or in any other suitable manner.
In step 330, the training node causes the selected action to be executed in the environment. This may comprise initiating execution of the action, either through direct communication with the relevant environment entity, or by sending an instruction to an environment system controller or other control function.
Referring now to
Example observations obtained at step 340 may include:
The observations may be specific to the environment or to a part of the environment, or they may encompass observations that reflect performance of the network on a wider level.
Example KPIs that may be included as observations may include:
Observations may be obtained from different network nodes, depending on the environment to be managed. For example, for network devices, the observations may be obtained from counters or logs at the individual devices. In addition, network monitoring packets may be interspersed with data packets to estimate observations. For radio access nodes, cell tower logs provide historical data. In addition, signalling between a UE and radio access nodes provides data on the channel information that may be used for signal strength. In robotics use cases, robotic sensors provide observations of location, imaging and mapping of environment.
In some examples, features may be extracted from the obtained observations before continuing with subsequent method steps.
Referring still to
In step 350, the training node uses a second policy function to generate an updated current belief state of the environment based on the obtained observation, and on a probability that an explanation for the selected action would be accepted by an entity querying the selected action, the explanation being generated using the explanation tree for the selected action. As illustrated at 350a, this may comprise computing the updated belief state as the product of first and second conditional probability distributions over possible operational states in which the environment may exist, wherein:
Further discussion of belief state update calculation is provided below, with reference to example implementations of methods of the present disclosure.
In step 352, the training node receives from an entity a query relating to a selected action. As discussed above, the entity may comprise any logical entity that is authorized and able to query an action executed in the environment. The querying entity may be physical or virtual, including for example a stakeholder of the environment such as a network operator, network service provided or network user, and/or a model of such a stakeholder.
Referring now to
The step 356 of generating an explanation for the queried action from the retrieved explanation tree may be carried out as a series of sub-steps, as illustrated in
In another example, the environment may be subject to a plurality of operational requirements, and the reward function may be operable to reward compliance with the operational requirements. Operational requirements may for example include specifications of specific actions to be taken in specific states, states to be avoided, etc. In such examples, as illustrated on the right of
Referring again to
Referring now to
The rollback and pruning process path involves incrementally rolling back belief states leading to actions whose explanation was rejected and trying alternative actions until an explanation for a queried action is accepted. Incorrect action branches in the policy space of the policy, that is actions that resulted in an incorrect belief state, may be pruned from the policy space, to ensure an incorrect action is not selected again from the same belief state.
Referring to
As illustrated in step 372, the training node may additionally update the policy space of the policy such that the first policy function is prevented from selecting the incorrect action (action an−1) from the belief state in which the incorrect action was selected (bn−1). This may comprise pruning the policy space of the policy to remove the action branch that corresponds to selection of the incorrect action (an−1) in the relevant belief state (bn−1). For the purposes of the present disclosure, the policy space of the policy comprises the actions available for selection by the policy in the various belief states. Preventing the policy from selecting the incorrect action from the belief state in which it was selected may therefore be achieved by removing the policy branch that yielded the incorrect explanation. In the event of additional queries received and explanations rejected, the training node may repeat the rollback and pruning of steps 364 to 372 until an explanation for a queried action is accepted. The number of times that rollback and pruning has already been conducted determines how much further back in the belief states of the policy space the training node should explore. As discussed above, rollback and pruning will continue until an explanation for a queried action is accepted, or until a previously validated belief state is reached.
Referring still to
As illustrated in the centre of
Referring still to
Regardless of the nature of the feedback received (acceptance or rejection of the explanation), and of the process path followed in the event of rejection, the training node then, following either step 372, step 378, or step 382, updates an explanation acceptance probability distribution over available actions for execution in the environment on the basis of the feedback from the entity. In this manner, the feedback from the entity informs the value of the explanation probability that is used to update belief state at each iteration of the method 200 (following each action selection). Thus, while the feedback from a querying entity may be relatively sparse, for example if only a small proportion of executed actions are queries, this feedback informs each belief state update via the updated explanation probability distribution.
As discussed above, the training node carrying out examples of the methods 200, 300 may encompass multiple logical entities. In some examples, these logical entities may comprise a POMDP agent and an explainer agent. Examples of the present disclosure also propose methods carried out by such agents, which methods cooperate to result examples of the methods 200, 300.
According to one example of the present disclosure, there is provided a computer implemented method for training a policy for managing an environment in a communication network, wherein the environment is operable to perform a task, the method, performed by a POMDP agent, comprising:
According to another example of the present disclosure, there is provided a computer implemented method for facilitating training a policy for managing an environment in a communication network, wherein the environment is operable to perform a task, the method, performed by an explainer agent, comprising:
The method may further comprise identifying as either a valid or an invalid belief state the belief state in which the queried action was selected.
For the purpose of the present disclosure, an Agent comprises a physical or virtual entity that is operable to implement a policy for the selection of actions on the basis of an environment state. Examples of a physical entity may include a computer system, computing device, server etc. Examples of a virtual entity may include a piece of software or computer program, a code fragment operable to implement a computer program, a virtualised function, or any other logical entity. A virtual entity may for example be instantiated in a cloud, edge cloud or fog deployment.
As discussed above, the methods 200 and 300 may be performed by a training node, and the present disclosure provides a training node that is adapted to perform any or all of the steps of the above discussed methods. The training node may be a physical or virtual node, and may for example comprise a virtualised function that is running in a cloud, edge cloud or fog deployment. The training node may for example comprise or be instantiated in any part of a logical core network node, network management centre, network operations centre, Radio Access node etc. Any such communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the training node may be instantiated in one or more logical or physical functions of a communication network node.
Referring to
e discussed above provide an overview of methods which may be performed according to different examples of the present disclosure. These methods may be performed by a training node, as illustrated in
However, in the illustrated architecture, this equation is modified to take account of explanation probabilities, as discussed below.
The explanation agent provides, on request from a stakeholder of the environment, explanations for the actions executed in the environment. These explanations are either accepted or rejected by the stakeholder in explanation feedback provided to the explainer agent. This explanation feedback is used both to validate or invalidate belief states, and to improve the updating of belief states at each time step, by using a probability that a given explanation will be accepted as an additional input to the belief update equation. In this manner, the explanation probabilities act as an additional observation, representing stakeholder input. Additionally, invalid belief states can be pruned from the policy tree, and valid belief states can be saved as conceptual “milestones”, representing a moment at which input from the stakeholder has confirmed that the belief state is correct. Using the underlying belief states, explanations are generated (assuming correct reasoning through abduction or deduction) on the basis of an explanation tree recorded at the time an action is selected by the POMDP agent. The feedback received on these explanations is used as a baseline to propagate correct beliefs. Gradually, this will result in improved action selection, and consequently fewer rejections of explanations for queried actions, as actions are selected on the basis of belief states that have been validated, or generated using explanation probabilities based on stakeholder feedback.
The following sections discuss in detail the operation of the POMDP model, that is the use of first and second policy functions to select actions and update belief states, the generation of explanations, and the processing of explanation feedback.
As discussed above, a Partially Observable Markov Decision Process (POMDP) models the relationship between an agent and its environment.
Formally, a POMDP consists of:
As the agent does not directly observe the environment's state, the agent must make decisions under uncertainty of the true environment state. However, by interacting with the environment and receiving observations, the agent may update its belief in the true state by updating the probability distribution of the current state. After having taken the action a and observing o, an agent needs to update its belief in the state the environment may be in. The state is probabilistic, and so maintaining a belief over the states only requires knowledge of the previous belief state, the action taken, and the current observation.
The belief update process from an initial belief b(s) after taking action a and observation o is given by:
The value iteration over the state—action space is contingent on identifying the correct belief set that maximizes rewards.
Examples of the present disclosure propose updating the POMDP model described above with additional (multi-modal) explanation probabilities as observations. These probabilities are based on explanation feedback received from a querying entity, such as a stakeholder, and the POMDP model is updated with the probabilities that a given explanation will be accepted, based on the explanation feedback received. These explanation probabilities are used for belief state updates:
Initial belief: b0(s)=P(S=s). This remains the same as for a conventional POMDP.
Transition probabilities: T(s,a,s′)=P (s′|s,a). These are also the same as for a conventional POMDP.
Actions: Action a at state s enhanced to <a, e> where a is the action, and e is the explanation tree around s (past and future paths). This is not a design time construct but depends on the policy, so whenever a is selected by the policy, what is executed is <a, e>: action a is output to the managed system or environment for execution, and explanation tree e is output to a memory or other storage function. The explanation tree will serve as the basis for an explanation of the action a, should that action be queried.
Observation probabilities: O(s′,a,o)=P (o|s′,a). These remain the same as for a conventional POMDP.
Explanation feedback probabilities: E(s′,a,e)=P (e|s′, a). This is an additional (multi-modal) observation input provided to POMDP. Explanation probabilities are the probability that an explanation provided at a particular state would be accepted by the querying entity or stakeholder.
Belief state updating: b′(s′)=P(s′|o, a, b) P(s′le, a, b). This is the product of a term using conventional observations and a term incorporating the explanation probabilities, and consequently incorporating insight from the explanation feedback.
Rewards: R(s,a,e). The rewards are updated with the explanation feedback incorporated.
The Bellman update function for the POMDP is then given by:
The value iteration function is therefore updated by making use of the explanation feedback received from the stakeholders. The action and corresponding explanation go hand-in-hand. Negative feedback for an explanation triggers alternative actions/rollbacks as discussed below. Positive feedback validates belief states and can be used to construct explanations.
Explanations concerning the outputs of POMDP models may be created in the form of policy graphs or decision trees as illustrated in
The explanations are typically provided as a response to questions raised on the reason for a particular action, KPI level or path. The explanation tree on which an explanation is based may make use of the decision tree formalism, wherein the tree to N levels prior to current action presents the sets of beliefs, possible states, KPIs and objective functions that expose the reasoning behind the current decisions.
Current decision depends upon the current belief state, and the way it evolves in the future depending upon possible inputs. There are consequently two sub-questions that are to be answered as part of the explanation process:
An explanation three structure thus resembles a sequence leading to the current belief state and then a fan out of the future. As illustrated in
Updating Belief States with Explanation Probabilities
Belief states using an additional input observation in the form of explanation probabilities. Given a set of explanations E and the corresponding explanation probabilities Ep, the belief update equation (eq. 1) now becomes:
This adds in the probability that an explanation for a given action will be accepted by a querying entity or stakeholder as an additional factor to consider for belief updates. It will be appreciated that there may be some transitions without any explanation probabilities, and in such transitions the original belief update equation, taking account only of observation and transition probabilities, may be used to update the belief state.
While the explanation probabilities are used at each time step to update belief states, not every action will be queried, and so explanation feedback may be relatively sparse. On receiving explanation feedback (accepted or rejected explanation), there are various options for how this feedback may be incorporated into the policy training process:
Correct explanations matching with the underlying belief states may result in updating and/or addition of explanation probabilities, and/or additional reward being added to decisions that led to correct beliefs. Similar updating to explanation probabilities and reduction of reward may be performed following feedback regarding an incorrect explanation.
Two possible policy paths following feedback indicating a rejected explanation are illustrated in
Explanation feedback may also be used to update the reward function. Rejection of an explanation may indicate either that there is change in the environment or that the reward has not been correctly specified. This can be modified through explanations that identify defects in the underlying belief model.
In another example, an Epsilon-Greedy approach may be taken to belief update. While generating a management policy, an agent generally acts according to its current knowledge in a greedy manner in the pursuit of rewards, a process referred to as exploitation. Acting greedily based on limited knowledge of the environment makes it very unlikely however that an agent will learn the optimal behaviour in the environment.
When an agent performs exploration, it does not necessarily act tin accordance with its knowledge, instead it explores different options available, as dictated by some exploration strategy. Epsilon-greedy (ε—greedy) is one of the most popular strategies to balance exploration and exploitation. Here, with probability & the agent takes a random action, and with probability 1-8 the agent takes the best action according to its current knowledge. It will be appreciated that exploration is more important when the agent does not have enough information about the environment its interacting with.
While traditional epsilon-greedy approaches decay in a uniform manner, examples of the present disclosure propose, in the example policy path illustrated in the lower part of
Explanation acceptance ranges from −1 to +1, and the above procedure adds increased exploration when the explanations are not accepted. Explanations are also improved with increased exploration, and this process assists with ensuring that the feedback from explanations are correctly incorporated into the POMDP update model.
Given a target training domain Dr and learning task TT, using the knowledge inputs provided by the stakeholder explanation feedback ST, the predictive function in task TT could be arrived at more quickly than with training data alone. This concept is similar to transfer learning, in which accurate inputs from other domains (here the explanation feedback from the stakeholder) can improve the learning process significantly.
As discussed above, the explanations provided to a querying entity may take different forms, including an extract of the explanation tree for a given action, or a mapping to operational requirements for the environment. In some examples, the query that triggers generation of the explanation may make use of templated questions of the contrastive form. The templates may for example take the form: “Why A rather than B?”, where A is the fact (what occurred in the plan) and B is the foil (the hypothetical alternative expected by the stakeholder). The formal questions may be templated as follows:
It will be appreciated that is the explanation provided in response to the above queries is accepted by the querying entity, then no changes will be enforced in the policy as a consequence of the query. The above noted changes are conditional upon the explanation for the respective query being rejected by the querying entity.
According to one example of the present disclosure, explanations may be generated using the Easy Approach to Requirements Syntax (EARS) process. EARS has been shown to drastically reduce or even eliminate the main problems typically associated with natural language (NL) requirements. EARS defines various types of operational requirements for the environment or system under consideration:
If no useful explanation feedback is provided (for example either the explanation feedback arrives after training is completed or the explanation feedback is itself incomplete), the POMDP agent can train exhaustively on the environment dataset without any stakeholder input. If suitable feedback is provided at certain periodic intervals or following certain events in the training phase, this external knowledge is incorporated within the model. This may improve the training for example by pruning away incorrect belief state branches. Off-policy updates could also be performed using collected batch inputs from the stakeholder.
It will be appreciated that, as discussed above, a querying entity, which may be a stakeholder of the environment may not be a human entity, for example a person asking questions, but is rather an interface between the environment and the training agent. Alternative implementations of this interface may be envisaged, for example statistics of valid/invalid explanations could be used to model a stakeholder “mimic” system that intervenes at appropriate periods during training.
For the avoidance of doubt, it will be appreciated that there exists an explicit difference between rewards, explanations, explanation feedback, and the explanation probabilities that are presented according to examples of the present disclosure as additional observations for the purpose of updating belief states. Rewards are pre-defined for each state and are dependent on the distribution of the environment/agent training inputs. Explanations are generated in response to queries of specific actions, and explanation feedback is consequently sparce, providing a more updated view of the environment.
This can be used to modify the beliefs and the outputs of the agent with respect to the current state. Explanation feedback is also used to update the explanation probabilities that are used at each time step to update belief states. The use of explanation feedback allows for better adaptation to distribution shifts or dynamicity in the environment. As the agent has “partial observability” of the environment, including this feedback allows for better adaptation to distribution shifts or unforeseen events. Using explanation feedback may also reduce the need for repeated re-training owing to minor changes in the environment. Beliefs may be updated dynamically using explanation feedback, and this understanding flows through each time step using the explanation probabilities.
There now follows a discussion of some example use cases for the methods of the present disclosure, as well as description of implementation of the methods of the present disclosure for such example use cases. It will be appreciated that the use cases presented herein are not exhaustive, but are representative of the type of problem within a communication network which may be addressed using the methods presented herein.
Many suitable use cases for the present disclosure fall into the class of use cases that may be described as network parameter optimization.
The first use case considers a set of networking devices as illustrated in
Considering initially the transition, observation and action probability of device one, the internal state of the device is formulated to define to have granular states: low load, mid load, high load, and the device can perform internal configuration changes based on observations from internal sensors or neighboring devices. The relevant parameters for device 1 can be specified in POMDP format, suitable for input to value iteration solvers:
Transition probabilities between states defined above (3×3 matrix describing state transitions with respect to each other following performance of a given action)
T: limit_incoming_traffic (state transition probabilities following execution of this action)
The probability of observing a given observation if a specific action is taken in any possible state
The probability that an explanation for a given action will be accepted when the action is taken in any possible state
It will be appreciated that is the above code snippet the Correct_Explanation is included as a further observation within the POMDP formulation of rewards. The second reward, based on receipt of a correct explanation feedback, is only available when an action has been queried and the explanation accepted by the querying entity. The training node may consequently update the total reward for a given action to include the additional reward from the correct explanation feedback. The explanations and explanation probabilities represent the additional knowledge gained from the feedback provided by a querying entity.
The following example steps may be used to integrate the explanation feedback:
Why is action limit_incoming_traffic used, rather than action process_more_traffic ?”
Explanation Feedback: (From environment inputs or internal model):
This feedback is incorporated as an additional observation to update the belief probabilities.
At timestamp_(t−1) the belief state of the agent was: Device1_High_load The action taken at timestamp_(t−1) was limit_traffic to maintain KPI The current belief state at timestamp_(t) is Device1_Medium_load
The accuracy of the explanations is consequently also improved by belief propagation and revision.
According to one aspect of the present disclosure, there is provided a computer implemented method for training a policy for managing configuration of a communication network node, wherein the node is operable to process an input data flow, the method, performed by a training node, comprising:
the current belief state of the network node;
A detailed example of the steps that may be followed according to examples of the present disclosure is presented below.
As for the first use case, the POMDP model is updated with additional (multi-modal) explanations as observations, and with explanation feedback received. This feedback is used to generate explanation probabilities that are used for belief state updates.
Initial belief: b0(s)=P(S=s). This remains the same as for a conventional POMDP.
The states are compound states indicating whether Video QoS requirements are met or unmet, and whether network resource constraints are met or exceeded.
Action a at state s is enhanced to <a, e> where a is the original action, and e is the explanation tree around s (past and future paths). As discussed above, this is not a design time construct but depends on the policy, so whenever a is selected by the policy, what is executed is <a, e>, a for the system and e is recorded for future use in generating an explanation for a.
Transition probabilities: T(s,a,s′)=P (s′|s,a). Transition probabilities between states defined above (3×3 matrix describing state transitions with respect to each other following performance of a given action). This remains the same as for a conventional POMDP.
T: Change_5QI_Priority <Explanation: Priority Change> (transitions following execution of this action)
O(s′,a,o)=P (o|s′,a). The probability of observing a given observation if a specific action is taken in any possible state. This remains the same as for a conventional POMDP.
Explanation feedback probabilities: E(s′,a,e)=P (e|s′, a). This is an additional (multi-modal) observation input provided to the explanation feedback. This represents the probability that an explanation provided at a particular state would be accepted by the stakeholder.
O: Change_5QI_Packet_Delay <Explanation Change Packet Delay>: Correct_Explanations 0.8
b′(s′)=P(s′|o, a, b) P(s′|e, a, b). This is the product of conventional observations as well as the explanation feedback received, and follows from the above descriptions. The agent receives an observation from the environment. Periodically, it can also receive explanation feedback from the stakeholder. The feedback is used to update the explanation probabilities that are factored in to the belief updates.
Correct explanations matching with the underlying belief states may result in additional rewards being added to decisions that led to correct beliefs, as well as update of explanation probabilities. Given a correct explanation and its validated belief state, explanations may be composed hierarchically based on such valid belief states (also referred to as milestones). The milestones may represent links between valid belief states and correct explanations:
In some examples, only parts of the decision tree that consist of correct beliefs, or belief states that have been identified as valid, may be used to generate the explanation. This increases the accuracy of the explanation and reduces state-space explosion over the belief space.
Incorrect beliefs, identified via rejected explanations for a given action, may result in pruning of some parts of the policy generated. This will roll back the beliefs to certain stages, for example, before re-computing the optimal policy, given additional explanation observations. The following example, taken from the present use case, illustrates how an incorrect explanation may result in a change of belief state:
At time t1, Observation QoS deterioration is obtained
A query of this action is received from an entity, and explanation is provided based on the explanation tree output with the selected action, and explanation feedback is received rejecting the explanation provided.
Explanation reward: R: Change_5QI_Packet_Delay <Explanation Change Packet Delay>: Video_QoS_Unmet_Resource_Constraints_Met:*:Incorrect_Explanation −20
Belief Update: Re-formulate Belief State
Other use cases that could be envisaged for methods according to the present disclosure include Radio Access Network optimization such as Cell shaping (antenna power and Remote Electronic Tilt), PO Nominal PUSCH, Downlink Power Control, and Maximum Transmission Power, all of which could be optimized with respect to network level performance. Robotics use cases may also be envisaged, including agent trajectory learning, optimization of configuration parameters, etc.
Example methods according to the present discloser thus propose exposing a part of the POMDP and policy, in the form of an explanation tree, as part of an action of the POMDP. The explanation tree for a given action may then serve as the basis for generating an explanation for the action if the action is queried. Feedback from explanations provided for queried actions may be incorporated to validate/invalidate calculated belief states, and as additional observations into the POMDP model, in the form of explanation probabilities. Validated belief states may be used as “milestones” to improve and compose explanations for POMDP agent behavior. Examples of the present disclosure thus seek to increase the probability of identifying the correct belief state, and consequently of selecting appropriate actions, using explanation feedback, as well as improving the explanations provided to a querying entity for any given action.
Incorporating insight from an external entity via belief correction and updates can help to improve both training and runtime policy deployment. Training time can be substantially reduced as incorrect belief states are pruned, and the result is a more accurate policy with lower computations. In addition, POMDP agents and explainer agents can be trained on cloud deployments for faster processing. Structuring explanations around validated belief states can also reduce the chance of explanation rejection. Overall system performance can be improved, resulting in improved trust in the policy on the part of the stakeholder.
The methods of the present disclosure may be implemented in hardware, or as software modules running on one or more processors. The methods may also be carried out according to the instructions of a computer program, and the present disclosure also provides a computer readable medium having stored thereon a program for carrying out any of the methods described herein. A computer program embodying the disclosure may be stored on a computer readable medium, or it could, for example, be in the form of a signal such as a downloadable data signal provided from an Internet website, or it could be in any other form.
It should be noted that the above-mentioned examples illustrate rather than limit the disclosure, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims or numbered embodiments. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim or embodiment, “a” or “an” does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims or numbered embodiments. Any reference signs in the claims or numbered embodiments shall not be construed so as to limit their scope.
Number | Date | Country | Kind |
---|---|---|---|
202141024878 | Jun 2021 | IN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/076661 | 9/28/2021 | WO |