The present disclosure is directed to implementing deep learning in real-world applications.
Embodiments described herein involve a method for providing human understandable explanations for an action in a machine reinforcement learning framework. A policy based on a compound reward function is learned through a reinforcement learning algorithm at a learning network. The policy is used to choose an action of a plurality of possible actions. A state-action value network is established for each of the two or more reward terms. The state-action value networks are separated from the learning network. A human-understandable output is produced to explain why the action was taken based on each of the state action value networks.
Embodiments described herein involve a system comprising a processor and a memory storing computer program instructions which when executed by the processor cause the processor to perform operations. The operations comprise learning, through a reinforcement learning algorithm at a learning network, a policy based on a compound reward function, the compound reward function comprising a sum of two or more reward terms. The policy is used to choose an action of a plurality of possible actions. A state-action value network is established for each of the two or more reward terms. According to various embodiments, the state-action value networks are separated from the learning network. A human-understandable output is produced to explain why the action was taken based on each of the state action value networks.
The above summary is not intended to describe each embodiment or every implementation. A more complete understanding will become apparent and appreciated by referring to the following detailed description and claims in conjunction with the accompanying drawings.
The figures are not necessarily to scale. Like numbers used in the figures refer to like components. However, it will be understood that the use of a number to refer to a component in a given figure is not intended to limit the component in another figure labeled with the same number.
Embodiments described herein involve a way of using reward factorization in an auxiliary network to get explanations of deep learning-based reinforcement learners without compromising convergence of the network. These explanations help to explain why the agent did what it did. Embodiments described herein can be combined with state-of-the-art innovations in policy gradient learning (e.g., AC3) to get efficient, powerful learners that work with unstructured state and action spaces.
Embodiments described herein involve contexts where deep learning has been applied to high-dimensional visual inputs to solve tasks without encoding any task specific features. For instance, the deep Q-learning network (DQN)) can be trained on screen images of the Atari Pong game to learn how to move the paddles to score well in the game. The network learns to visually recognize and attend to the ball in screen images in order to play. The same network can be trained on screen images of Atari Space Invaders without any change. In this case, the network learns to visually recognize and attend to the aliens. The ability to automatically learn representations of the world and extended sequential behaviors from only an objective function is a very attractive and exciting prospect.
Researchers have observed that small changes to these Atari games, can result in somewhat random behavior. For instance, deleting the ghosts from Pacman, which should make it easy for the agent to collect points without fearing attacks from ghosts results in an agent that wanders somewhat aimlessly suggesting that the system is not learning the same kinds of representations of the domain that a human does. For this reason, many researchers have been investigating ways of extracting explanations of agent behavior to understand if the agent's representations are likely to generalize.
Perturbation based saliency methods, originally developed for image classification networks, attempt to get at these representations by determining how changes to coherent regions of the input image change the agent's action choices. Information about what visual features are being used can be helpful when trying to determine if the appropriate visual features are being represented. Saliency features, however, are not useful when trying to reason about why the agent chooses one action or another in a given situation. Researchers have attempted to uncover the structure of agent behavior by clustering latent state embeddings created by the networks, finding transitions between these clusters and then using techniques from finite automata theory to minimize these state machines to make them more interpretable. These methods may rely on humans to supply semantic interpretations of the states based on watching agent's behavior and trying to puzzle out how it relates to the abstract integer state of the finite automata. It is also unclear how interpretable these will be if the state machine becomes at all complex (which is likely as integer state machines do not factorize environment state resulting in combinatorial complexity as domain state variables interact). They also fail to shed light on how a particular action choice relates to the agent's goals.
One approach exploits semantics of the reward function structure. The human engineer architects the reward function for a problem to explicitly relate features of the state, such as the successful kill of an alien in the game, to a reward value used to optimize the agent's policy. In many domains, this reward function can have rich structure. An agent might be trying to avoid being killed while simultaneously trying to minimize travel time, minimize artillery use, capture territory, and maximize the number of killed opponents. These terms may appear separately in the reward function. Researchers have exploited this structure to make behavior more interpretable. They observe that the linearity of the Q-function (which represents the expected future value of taking an action in a state) allows it to be decomposed. The Bellman function defines how the value of an action in a state is equal to the immediate reward R(s,a), plus the reward of states the agent might get to in the future as shown in (1).
If the reward can be decomposed into terms for each concern of the agent (death, travel, bullets, etc), the Q function can be expressed in terms of this decomposition as shown in (2).
Because Q-values are a linear function of rewards, the Q-function itself can be decomposed. The expected value can be computed with respect to a single concern as shown in (3).
The total Q-value of an action in a state can then be expressed as the sum of concern specific Q functions as shown in (4).
Q(s,a)=Qdeath(s,a)+Qtravel(s,a)+Qbullets(s,a)+ . . . (4)
This allows an understanding of the value of a local atomic action in terms of its contribution to future reward associated with specific concerns. So an action might dominate at time t because it reduces travel or avoids death. At a high-level the idea is to find the minimal set of positive rewards for an action that dominate the negative rewards of alternative actions and use this as an explanation.
One of the challenges of applying this to high dimensional visual inputs is that it is already difficult and time consuming to train the networks using the diffuse signal provided by sparse rewards. Adding a large number of additional separate networks that each have their own errors and variances that will be added together will make it much harder to optimize. Second, in continuous action domains it is difficult to use Q-learning as one would have to maximize a non-linear Q-function to obtain actions and define distributions over actions for exploration. Policy gradient methods, which do not compute Q-values are therefore widely used in these contexts. For both of these reasons, this technique has not seen wide application to practical problems. This is unfortunate as the only explicit semantic grounding present in the deep RL framework is the human engineered reward function.
Embodiments describe herein use the benefits of factored rewards for explanation while maintaining good convergence and being able to use policy gradients. This can be done by separating the learning and explanation functions while still retaining faithfulness of representation. This allows use of state-of-the art learning algorithms while getting good convergence and still being able to get insight into why the agent does what it does. Embodiments described herein can be used to implement this concept for a policy gradient algorithm which is the basis of many modern deep RL learners such as AC3 and proximal policy optimization (PPO).
In policy gradient algorithms, a network is used, traditionally described by πθ(a|s), to assign a value to various actions. Gradient descent is used to tune the parameters of this network to maximize the expected return ∇θJ(θ). The policy gradient algorithms rely on the policy gradient theorem which allows the computation of the gradient of return without needing to take the derivative of the stationary distribution dπ(s) and replacing an explicit expectation with samples drawn from the environment under the policy in question Eπ. The gradient ∇θJ(θ) can then be used to update the policy network to maximize reward.
In deep policy networks, this is implemented by passing images through convolutional neural networks to create latent features and then using a fully connected network, or perhaps two layers followed by a softmax layer to calculate policy probabilities.
Unfortunately, a textbook implementation may be unstable. Modern methods typically use an estimate of the value of states as a baseline in the action value calculation. The state value estimate function 160 is the maximum action value at each state (Vθ(s)=maxaVθ(s, a)). The bias term in the policy loss 140 used to optimize the policy 130 can be updated using standard Bellman loss 170. The overall flow is captured in
As shown in
While the theory behind policy gradient is concise and elegant, getting deep network-based reinforcement learning agents to converge in practice requires a number of tricks and patience to tune many hyperparameters. A single training episode can take days or weeks. It therefore may be undesirable to increase the complexity of the network by adding additional structure. Early on, adding extra network outputs can create noise when updating the core CNN representation that makes learning harder.
Using embodiments described herein, the agent is trained using the base version of the policy gradient algorithm or one of its many derivatives (e.g., AC3) to get an optimal policy πθ*. This creates a policy 330 that the agent can follow to maximize the reward sum 350. The Q-value 360 is averaged over the episodes. Similarly to
In
(7) illustrates the effect of substituting samples drawn from the environment for the expectation over transitions and using learning rate α.
These factorized Q-values can then be used to explain the long-term contribution of any local action to the agent's overall goals as defined by the reward function. These extra networks may be referred to herein as an auxiliary factored Q function. The gradient blocking node prevents training of the auxiliary network from affecting the underlying policy network preserving optimality and stability. The coupling of the Q-network to the base implementation feature generation CNNs aligns the representation used for calculating Q-values with that used for calculating policy probabilities leading to increased levels of faithfulness and likely better generalization.
According to embodiments described herein, it may be useful to explicitly plot actions values in a tradeoff space as shown in
Using embodiments described herein, the agent is trained using the base version of the policy gradient algorithm or one of its many derivatives (e.g., AC3) to get an optimal policy πθ* based on visual input 310 received at a CNN 320. The output of the CNN provides high level features to a policy network 330 that the agent can follow to maximize the reward sum 350. The Q-value 360 is averaged over the episodes. Similarly to
In
According to embodiments described herein, one can use the value node from the original policy gradient algorithm to provide a bootstrap estimate of the auxiliary Q-value functions during updates. This should accelerate convergence of the auxiliary networks compared to using an independent update as shown in (8).
Q
death(s,a)=Qdeath(s,a)−α[[Rdeath(s,a)+γVpolicy_network(s′)]−Qdeath(s,a)] (8)
According to various embodiments, one can alter the Q-networks so that they take both an action and a state as input ƒ(s,a) to allow for continuous actions. These may be more difficult to optimize as gradient ascent may be used to find an action that obtains the local maximum in value. In some embodiments, the policy learning and auxiliary explanation learning can be run at the same time. Due to the gradient blocking node, training of the Q-value functions will not affect learning or convergence of the agent. This could be useful in debugging the learning of the agent before it is fully converged. One could understand what tradeoffs an agent is making and whether these are rational or not.
According to embodiments described herein, the explanation network might not share a representation with the underlying policy learner as shown in
In
Similarly to
The above-described methods can be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in
Unless otherwise indicated, all numbers expressing feature sizes, amounts, and physical properties used in the specification and claims are to be understood as being modified in all instances by the term “about.” Accordingly, unless indicated to the contrary, the numerical parameters set forth in the foregoing specification and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by those skilled in the art utilizing the teachings disclosed herein. The use of numerical ranges by endpoints includes all numbers within that range (e.g. 1 to 5 includes 1, 1.5, 2, 2.75, 3, 3.80, 4, and 5) and any range within that range.
The various embodiments described above may be implemented using circuitry and/or software modules that interact to provide particular results. One of skill in the computing arts can readily implement such described functionality, either at a modular level or as a whole, using knowledge generally known in the art. For example, the flowcharts illustrated herein may be used to create computer-readable instructions/code for execution by a processor. Such instructions may be stored on a computer-readable medium and transferred to the processor for execution as is known in the art.
The foregoing description of the example embodiments have been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the inventive concepts to the precise form disclosed. Many modifications and variations are possible in light of the above teachings. Any or all features of the disclosed embodiments can be applied individually or in any combination, not meant to be limiting but purely illustrative. It is intended that the scope be limited by the claims appended herein and not with the detailed description.