DEVICE AND METHOD FOR CONTROLLING A ROBOT

Information

  • Patent Application
  • 20240198518
  • Publication Number
    20240198518
  • Date Filed
    November 13, 2023
    a year ago
  • Date Published
    June 20, 2024
    7 months ago
Abstract
A method for training a control policy. The method includes estimating the variance of a value function which associates a state with a value of the state or a pair of state and action with a value of the pair by solving a Bellman uncertainty equation, wherein, for each of multiple states, the reward function of the Bellman uncertainty equation is set to the difference of the total uncertainty about the mean of the value of the subsequent state following the state and the average aleatoric uncertainty of the value of the subsequent state and biasing the control policy in training towards regions for which the estimation gives a higher variance of the value function than for other regions.
Description
CROSS REFERENCE

The present application claims the benefit under 35 U.S.C. § 119 of European Patent Application No. EP 22 21 3403.3 filed on Dec. 14, 2022, which is expressly incorporated herein by reference in its entirety.


FIELD

The present disclosure relates to devices and methods for controlling a robot.


BACKGROUND INFORMATION

Reinforcement Learning (RL) is a machine learning paradigm that allows a machine to learn to perform desired behaviours with respect to a task specification, e.g., which control actions to take to reach a goal location in a robotic navigation scenario. Learning a policy that generates these behaviours with reinforcement learning differs from learning it with supervised learning in the way the training data is composed and obtained: While in supervised learning the provided training data consists of matched pairs of inputs to the policy (e.g., observations like sensory readings) and desired outputs (actions to be taken), there is no fixed training data provided in case of reinforcement learning. The policy is learned from experience data gathered by interaction of the machine with its environment whereby a feedback (reward) signal is provided to the machine that scores/asses the actions taken in a certain context (state). To effectively improve the control policy, the respective reinforcement agent should explore regions of the space of states of the controlled technical system where epistemic uncertainty is high. Therefore, efficient approaches for determining epistemic uncertainty in the training of a control policy are desirable.


The paper “The Uncertainty Bellman Equation and Exploration” by Brendan O'Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih, in International Conference on Machine Learning, pages 3836-3845, 2018, in the following referred to as reference 1, describes an uncertainty Bellman equation, which may be seen as a Bellman-style relationship that propagates the uncertainty (variance) of a Bayesian posterior distribution over Q-values across multiple time-steps.


The paper “Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization” by Qi Zhou, HouQiang Li, and Jie Wang, in AAAI Conference on Artificial Intelligence, volume 34, pages 6941-6948, April 2020, in the following referred to as reference 2, describes solving an uncertainty Bellman equation for getting an upper bound of the value function variance.


SUMMARY

According to various embodiments of the present invention, a method for training a control policy is provided, including estimating the variance of a value function which associates a state with a value of the state or a pair of state and action with a value of the pair by solving a Bellman uncertainty equation, wherein, for each of multiple states, the reward function of the Bellman uncertainty equation is set to the difference of the total uncertainty about the mean of the value of the subsequent state following the state and the average aleatoric uncertainty of the value of the subsequent state and including biasing the control policy in training towards regions for which the estimation gives a higher variance of the value function than for other regions.


According to various embodiments of the present invention, in other words, an uncertainty Bellman equation is solved for determining uncertainty in reinforcement learning where, in contrast to the approach of reference 2, the reward function is set to the difference of the total uncertainty about the mean values of the next state and the average aleatoric uncertainty. Redefining the reward function in the uncertainty Bellman equation with respect to reference 2 allows getting a tight estimation (rather than merely an upper bound) for the value function variance. Thus, more exact variance estimates can be obtained which may be used for exploration (e.g. by means of determining the policy by optimistic optimization) to achieve lower total regret and better sample-efficiency in exploration for tabular reinforcement learning RL and to increase sample-efficiency and stability during learning for continuous-control tasks.


In particular, biasing the policy in training towards regions with high value function variance, based on this estimation of the value function variance, i.e., preferring states or pairs of states and actions, respectively, with high variance of the value function over states or pairs of states and actions, respectively, with low variance of the value function as given by this estimation, when exploring allows achieving a more efficient training, i.e., less episodes, may be required to achieve the same quality of control or a better control policy which performs better in practical application may be found.


According to various embodiments of the present invention, the reward function is a local uncertainty reward function. The solution of the uncertainty Bellman equation is the variance of the value function.


In the following, various examples of the present invention are given.


Example 1 is the method as described above.


Example 2 is the method of example 1, wherein the value function is a state value function and the control policy is biased in training towards regions of the state space for which the estimation gives a higher variance of the values of states than for other regions of the state space or wherein the value function is a state-action value function and the control policy is biased in training towards regions of the space of state-action pairs for which the estimation gives a higher variance of the value of pairs of states and actions than for other regions of the space of state-action pairs.


So, the approach may operate with a state value function as well as a state-action value function for selecting actions in training for exploration. The biasing may be done by considering not only the value of a state or state action pair when selecting an action but also its (estimated) value function variance.


Example 3 is the method of example 1 or 2, including setting the uncertainty about the mean of the value of the subsequent state following the state to an estimate of the variance of the mean of the value of the subsequent state and setting the average aleatoric uncertainty to the mean of an estimate of the variance of the value of the subsequent state.


Thus, the uncertainty about the mean of the value of the subsequent state and the average aleatoric uncertainty may be easily determined from data gathered in the training.


Example 4 is the method of any one of examples 1 to 3, wherein estimating the variance of the value function includes selecting one of multiple neural networks, wherein each neural network is trained to output information about a probability distribution of the subsequent state following a state input to the neural network and of a reward obtained from the state transition and determining the value function from outputs of the selected neural network for a sequence of states.


This makes it in possible to model uncertainties of the transitions by sampling from a set of multiple neural networks a neural network that gives the transition probabilities. In particular, the mean of the variance of a subsequent state may in this way be estimated by sampling from multiple neural networks. The one or more neural networks may be trained during the training from the observed data (i.e. observed transitions).


Example 5 is the method of any one of examples 1 to 4, including solving the Bellman uncertainty equation by means of a neural network trained to predict a solution of the Bellman uncertainty equation in response to the input of a state or pair of state and action value.


For example, in case of the value function being a state-action value function, a neural network may be used that receives as input state-action pairs and outputs the predicted long-term variance of the Q-value for the given input (i.e. pair of state and action).


The variance of the value function may be determined for a certain episode (from data from the episode and earlier episodes) and updating the control policy for the next episode using the result of the determination. For a state-action value function, for example, this may include optimizing optimistic estimates of the Q-values, i.e., the Q-values may be enlarged by adding the predicted standard deviation (the square root of the neural network output for solving the uncertainty Bellman equation).


Example 6 is a method for controlling a technical system, including training a control policy according to any one of examples 1 to 5 and controlling the technical system according to the trained control policy.


Example 7 is a controller, configured to perform a method of any one of examples 1 to 5.


Example 8 is a computer program including instructions which, when executed by a computer, makes the computer perform a method according to any one of examples 1 to 5.


Example 9 is a computer-readable medium including instructions which, when executed by a computer, makes the computer perform a method according to any one of examples 1 to 5.





BRIEF DESCRIPTION OF THE DRAWINGS

In the figures, similar reference characters generally refer to the same parts throughout the different views. The figures are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the present invention. In the following description, various aspects are described with reference to the figures.



FIG. 1 shows a control scenario according to an example embodiment of the present invention.



FIG. 2 illustrates a simple Markov Reward Process for illustration of the above.



FIG. 3 shows a flow diagram illustrating a method for training a control policy according to an example embodiment of the present invention.





DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

The following detailed description refers to the accompanying drawings that show, by way of illustration, specific details and aspects of this disclosure in which the present invention may be practiced. Other aspects may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the present invention. The various aspects of this disclosure are not necessarily mutually exclusive, as some aspects of this disclosure can be combined with one or more other aspects of this disclosure to form new aspects.


In the following, various examples will be described in more detail.



FIG. 1 shows a control scenario.


A robot 100 is located in an environment 101. The robot 100 has a start position 102 and should reach a goal position 103. The environment 101 contains obstacles 104 which should be avoided by the robot 100. For example, they may not be passed by the robot 100 (e.g. they are walls, trees or rocks) or should be avoided because the robot would damage or hurt them (e.g. pedestrians).


The robot 100 has a controller 105 (which may also be remote to the robot 100, i.e. the robot 100 may be controlled by remote control). In the exemplary scenario of FIG. 1, the goal is that the controller 105 controls the robot 100 to navigate the environment 101 from the start position 102 to the goal position 103. For example, the robot 100 is an autonomous vehicle but it may also be a robot with legs or tracks or other kind of propulsion system (such as a deep sea or mars rover).


Furthermore, embodiments are not limited to the scenario that a robot should be moved (as a whole) between positions 102, 103 but may also be used for the control of a robotic arm whose end-effector should be moved between positions 102, 103 (without hitting obstacles 104) etc.


Accordingly, in the following, terms like robot, vehicle, machine, etc. are used as examples for the “object”, i.e. computer-controlled system (e.g. machine), to be controlled. The approaches described herein can be used with different types of computer-controlled machines like robots or vehicles and other. The general term “robot device” is also used in the following to refer to all kinds of technical system which may be controlled by the approaches described in the following. The environment may also be simulated, e.g. the control policy may for example be a control policy for a virtual vehicle or other movable device, e.g. in a simulation for testing another policy for autonomous driving.


Ideally, the controller 105 has learned a control policy that allows it to control the robot 101 successfully (from start position 102 to goal position 103 without hitting obstacles 104) for arbitrary scenarios (i.e. environments, start and goal positions) in particular scenarios that the controller 105 has not encountered before.


Various embodiments thus relate to learning a control policy for a specified (distribution of) task(s) by interacting with the environment 101. In training, the scenario (in particular environment 101) may be simulated but it will typically be real in deployment.


An approach to learn a control policy is reinforcement learning (RL) where the robot 100, together with its controller 105, acts as reinforcement agent.


The goal of a reinforcement learning (RL) agent is to maximize the expected return via interactions with an a priori unknown environment. In model-based RL (MBRL), the agent learns a statistical model of the environment 101, which can then be used for efficient exploration. The performance of deep MBRL algorithms was historically lower than that of model-free methods, but the gap has been closing in recent years. Key to these improvements are models that quantify epistemic and aleatoric uncertainty and algorithms that leverage model uncertainty to optimize the policy. Still, a core challenge in MBRL is to quantify the uncertainty in the long-term performance predictions of a policy, given a probabilistic model of the dynamics. Leveraging predictive uncertainty of the policy performance during policy optimization facilitates deep exploration—methods that reason about the long-term information gain of rolling out a policy—which has shown promising results in the model-free and model-based settings.


Various embodiments use a Bayesian perspective on MBRL, which maintains a posterior distribution over possible MDPs given observed data. This distributional perspective of the RL environment 101 induces distributions over functions of interest for solving the RL problem, e.g., the expected return of a policy, also known as the value function. This perspective differs from distributional RL, whose main object of study is the distribution of the return induced by the inherent stochasticity of the MDP and the policy. As such, distributional RL models aleatoric uncertainty, whereas various embodiments focus on the epistemic uncertainty arising from finite data of the underlying MDP, and how it translates to uncertainty about the value function. In particular, the variance of such a distribution of values is taken into account in the training of the control policy.


It should be noted that upper-bounds on the posterior variance of the values can be obtained by solving a so-called uncertainty Bellman equation (UBE), see reference 1 and reference 2. However, these bounds over-approximate the variance of the values and thus may lead to inefficient exploration when used for uncertainty-aware optimization (e.g., risk-seeking or risk-averse policies). In principle, tighter uncertainty estimates have the potential to improve data-efficiency


In the following, an approach is described which is based on the fact that (under two assumptions given below), the posterior variance of the value function obeys a Bellman-style recursion exactly. According to various embodiments, the controller 105 learns to solve Bellman recursion and integrates it into training (e.g. an actor-critic framework) as an exploration signal.


In the following, an agent is considered acting in an infinite-horizon MDP custom-character={(custom-character,custom-character,p,ρ,r,γ} with finite state space |custom-character|=S, finite action space |custom-character|=A, unknown transition function p:custom-character×custom-character→Δ(S) that maps states and actions to the S-dimensional probability simplex, an initial state distribution ρ:custom-character→[0,1], a known and bounded reward function r:custom-character×custom-character→R, and a discount factor γ∈[0,1).


As stated above, the agent may correspond to the robot 101 together with its controller and the state space may include states that include states (also referred to as configurations) of the robot as well as states of the environment (e.g. the position of obstacles or, e.g. in a bin-picking scenario, of objects to be grasped or in general manipulated by the robot).


Although a known reward function is considered, the approach described below can be easily extended to the case where it is learned alongside the transition function. The one-step dynamics p(s′|s,a) denote the probability of going from state s to state s′ after taking action a. In general, the agent selects actions from a stochastic policy π:custom-character→Δ(A) that defines the conditional probability distribution π(a|s). At each time step of episode t the agent is in some state s, selects an action a˜π(·|s), receives a reward r(s, a), and transitions to a next state s′˜p(·|s,a). The state value function Vπ,p:custom-character→R of a policy π and transition function p are defined as the expected sum of discounted rewards under the MDP dynamics,












V

π
,
p


(
s
)

=


E

τ
~
P


[





Σ



h
=
0





γ
h



r

(


s
h

,

a
h


)




s
0


=
s

]


,




(
1
)







where the expectation is taken under the random trajectories τ drawn from the trajectory distribution P(τ)=Πh=0π(ah|sh)p(sh+1|sh,ah).


The training includes performing many such episodes (e.g. 1000 or more), wherein in each episode, the agent starts from a starting state (which may vary) and performs actions and reaches new states as described above until it reaches a terminal state (which may also be that a limit of the number of actions is reached). It is desirable that the agent efficiently explores the state space during these episodes such that it learns to properly react to all kinds of states that it may encounter in practical application. Therefore, for training, the control policy π(·|s) includes a degree of exploration which means that the agent does not necessarily select the action which gives the highest value (according to its current knowledge) but tries out alternate paths (i.e. “explores” to possible find better actions that those it currently knows of). In practical application (i.e. after training), this may be removed i.e. the agent then selects the action which gives the highest value.


The controller 105 uses a Bayesian setting where the transition function p is a random variable with some known prior distribution Φ0. The transition data observed up to episode t is denoted as custom-charactert. The controller 105 updates its belief about the random variable p by applying Bayes' rule to obtain the posterior distribution conditioned on custom-charactert, which is denoted as Φt. The distribution of transition functions naturally induces a distribution over value functions. The main focus in the following explanation is an approach that the controller 105 may use for estimating the variance of the state value function Vπ,p under the posterior distribution Φt, namely VqP˜Φt[Vπ,p(s)]. The controller 105 may then use this quantity for exploring the state space (in particular the environment 101). It should be noted that the estimation of the variance is explained for the state value function V(s) but naturally extends to the case of state-action value functions Q(s, a). Herein, the term “value function” is used to refer to both a “state value function” and a “state-action value function”.


The approach for estimating the variance of the state value function is based on the following two assumptions:

    • 1) (Independent transitions). p(s′|x,a) and p(s′|y,a) are independent random variables if x≠y.
    • 2) (Acyclic MDP). The MDP custom-character is a directed acyclic graph, i.e., states are not visited more than once in any given episode.


The first assumption holds naturally in the case of discrete state-action spaces with a tabular transition function, where there is no generalization. The second assumption is non-restrictive as any finite-horizon MDP with cycles can be transformed into an equivalent time-inhomogeneous MDP without cycles by adding a time-step variable h to the state-space. Similarly, for infinite-horizon MDPs an effective horizon






H
=


1
1

-
γ





may be considered to achieve the same. The direct consequence of these assumptions is that the random variables Vπ,p(s′) and p(s′|s,a) are uncorrelated.


Other quantities considered in the following are the posterior mean transition function starting from the current state-action pair (s, a),












p
¯

t

(


·

|
s


,
a

)

=


E

p
~

Φ
t



[

p

(


·


s


,
a

)

]





(
2
)







and the posterior mean value function for any s∈custom-character,













V
¯

t
π

(
s
)

=


E

p
~

Φ
t



[


V

π
,
p


(
s
)

]


,




(
3
)







where the subscript t is included to be explicit about the dependency on custom-charactert of both quantities. Using (2) and (3), the local uncertainty is defined as












w
t

(
s
)

=


V

p
~

Φ
t



[







a
,

s






π

(

a
|
s

)



p

(



s



s

,
a

)





V
¯

t
π

(

s


)


]


,




(
4
)







and solving the UBE












W
t
π

(
s
)

=



γ
2




w
t

(
s
)


+


γ
2








a
,

s






π

(

a
|
s

)





p
¯

t

(



s



s

,
a

)




W
t
π

(

s


)




,




(
5
)







gives a unique solution Wtπ which satisfies Wtπ≥Vp˜Φt[Vπ,p(s)], i.e. is an upper bound for the variance of the value function (see reference 2).


In the following, the gap between the upper bound and the variance of the value function is discussed and a UBE is given whose fixed-point solution is equal to the variance of the value function and thus allows a better estimation of the variance of the value function. As stated above, this is explained for the state value function V but may be analogously done for the state-action value function Q.


The values Vπ,p are the fixed-point solution to the Bellman expectation equation, which relates the value of the current state s with the value of the next state s′. Further, under the above two assumptions, applying the expectation operator to the Bellman recursion results in Vtπ(s)=Vπ,pt(s). The Bellman recursion propagates knowledge about the local rewards r(s,a) over multiple steps, so that the value function encodes the long-term value of states if the policy π is followed. Similarly, a UBE is a recursive formula that propagates a notion of local uncertainty, ut(s), over multiple steps. The fixed-point solution to the UBE, which are denoted as the U-values, encodes the long-term epistemic uncertainty about the values of a given state.


It should be noted that the approach for determining the upper bound (according to (5) and reference 2) differs from the approach given in the following only in the definition of the local uncertainty and result in U-values that upper-bound the posterior variance of the values. However, in the following approach, ut is defined such that the U-values converge exactly to the variance of values.


Specifically, it can be shown that: under the two assumptions given above, for any s∈custom-character and policy π, the posterior variance of the value function, Utπ=Vp˜Φt[Vπ,p] obeys the uncertainty Bellman equation











U
t
π

(
s
)

=



γ
2




u
t

(
s
)


+


γ
2








a
,

s






π

(

a
|
s

)





p
¯

t

(



s



s

,
a

)




U
t
π

(

s


)







(
7
)







where ut(s) is the local relative uncertainty defined as











u
t

(
s
)

=



V

a
,


s


~
π

,


p
¯

t



[



V
¯

t
π

(

s


)

]

-



E

p
~

Φ
t



[


V

a
,


s


~
π

,
p


[


V

π
,
p


(

s


)

]

]

.






(
8
)







One may interpret the U-values from (7) as the associated state-values of an alternate uncertainty MDP, custom-charactert={custom-character,custom-character,pt,ρ,γ2ut2}, where the agent accumulates uncertainty rewards and transitions according to the mean dynamics pt.


Further, for a connection to the upper bound estimation according to (5), it can be shown that: under the above two assumptions, for any s∈custom-character and policy π, it holds that ut(s)=wt(s)−gt(s), where gt(s)=Ep˜Φt[Va,s′˜π,p[Vπ,p(s′)]−Va,s′˜π,p[Vtπ(s′)]].


Furthermore, the gap gt(s) is non-negative, thus ut(s)≤wt(s).


The gap gt(s) can be interpreted as the average difference of aleatoric uncertainty about the next values with respect to the mean values. The gap vanishes only if the epistemic uncertainty goes to zero, or if the MDP and policy are both deterministic.


The above two results can be directly connected via the equality














V

a
,


s


~
π

,


p
_

t



[



V
¯

t
π

(

s


)

]



total

=





w
t

(
s
)



epistemic

+




E

p
~

Φ
t



[


V

a
,


s


~
π

,
p


[



V
_

t
π

(

s


)

]

]



aleatoric



,




(
9
)







which gives an interpretation: the uncertainty reward defined in (8) has two components: the first term corresponds to the total uncertainty about the mean values of the next state, which is further decomposed in (9) into an epistemic and aleatoric components. When the epistemic uncertainty about the MDP vanishes, then wt(s)→0 and only the aleatoric component remains.


Similarly, when the MDP and policy are both deterministic, the aleatoric uncertainty vanishes and Va,s′˜π,pt[Vtπ(s′)]=wt(s). The second term of (8) is the average aleatoric uncertainty about the value of the next state. When there is no epistemic uncertainty, this term is non-zero and exactly equal to the aleatoric term in (9) which means that ut(s)→0. Thus, ut(s) can be interpreted as a relative local uncertainty that subtracts the average aleatoric noise out of the total uncertainty around the mean values. It should be noted that a negative ut(s) may be allowed as a consequence of highly stochastic dynamics and/or action selection.


It can further be seen that the upper bound (from (5)) is a consequence of the over approximation of the reward function used to solve the UBE. Second, the gap between the exact reward function ut(s) and the approximation wt(s) is fully characterized by gt(s) and brings interesting insights. In particular, the influence of the gap term depends on the stochasticity of the dynamics and the policy. In the limit, the term vanishes under deterministic transitions and action selection. In this scenario, the upper-bound from (5) becomes tight.


So, by solving (7), the controller may determine the exact epistemic uncertainty about the values by considering the inherent aleatoric uncertainty of the MDP and the policy. The controller 105 can thus explore regions of high epistemic uncertainty, where new knowledge can be obtained. In other words, solving (7) allows the controller 105 to disentangle the two sources of uncertainty, which allows effective exploration. If the variance estimate fuses both sources of uncertainty, then the agent could be guided to regions of high uncertainty but with little information to be gained.



FIG. 2 illustrates a simple Markov Reward Process for illustration of the above.


The random variables δ and β indicate epistemic uncertainty about the MRP. State sT is an absorbing (terminal) state.


Assume that δ and β to be random variables drawn from a discrete uniform distribution δ˜Unif({0.7,0.6}) and β˜Unif({0.5,0.4}). As such, the distribution over possible MRPs is finite and composed of all possible combinations of δ and β.


It should be noted that the example satisfies the above two assumptions. Table 1 gives the results for the uncertainty rewards and solution to the respective UBEs (the results for s1 and s3 are trivially zero).















TABLE 1







States
u(s)
w(s)
Wπ(s)
Uπ(s)






















s0
−0.6
5.0
21.3
15.7



s2
25.0
25.0
25.0
25.0










For state s2, the upper-bound Wπ is tight and Wπ(s2)=Uπ(s2). In this case, the gap vanishes not because of lack of stochasticity, but rather due to lack of epistemic uncertainty about the next values. Indeed, the MRP is entirely known beyond state s2. On the other hand, for state so the gap is non-zero and Wπ overestimates the variance of the value by ˜36%. The UBE formulation of (8) prescribes a negative reward to be propagated in order to obtain the correct posterior variance. The U-values converge to the true posterior variance of the values, while Wπ obtains an upper-bound.


The controller 105 may leverage the uncertainty quantification of Q-values for reinforcement learning as follows. It should be noted that the following is explained for the state-action value function Q rather than the state value function used in (8). However, as stated above, the above results analogously hold for the state-action value function Q by using.











U
t
π

(

s
,
a

)

=



γ
2




u
t

(

s
,
a

)


+


γ
2









a




s






π

(


a




s



)





p
¯

t

(



s



s

,
a

)




U
t
π

(


s


,

a



)







(
10
)













and




u
t

(

s
,
a

)


=



V


a


,


s


~
π

,


p
¯

t



[



Q
¯

t
π

(


s


,

a



)

]

-


E

p
~

Φ
t



[


V


a


,


s


~
π

,
p


[


Q

π
,
p


(


s


,

a



)

]

]






(
11
)







instead of (8) and (9), respectively.


A general setting with unknown rewards assumed and Γt denotes the posterior distribution over MDPs, from which the controller 105 can sample both reward and transition functions. Let Ûtπ be an estimate of the posterior variance over Q-values for some policy π at episode t (in particular Ûtπ may be the solution of (10) which the controller 105 determines). Then, the controller 105 determines the policy (for training) by solving the upper-confidence bound (UCB) optimization problem











π
t

=


arg

max
π



Q
¯

t
π


+

λ




U
^

t
π





,




(
12
)







where Qtπ is the posterior mean value function and λ is a parameter that trades off exploration and exploitation.


After training, the controller 105 may simply use π=argmaxπQTπ where QTπ is the final state-action value function the controller 105 obtains from training (i.e. from the last episode).


According to various embodiments, the controller 105 uses algorithm 1 to estimate Qtπ and Ûtπ.












Algorithm 1 Model-based Q-variance estimation















1: Input: Posterior MDP Γt, policy π.


2: {pi, ri}t=1N ← sample_mdp(Γt)


3: Qtπ, {Qi}i=1N ← solve_bellman ({pi, ri}i=1N, π)


4: Ûtπ ← qvariance ({pi, ri, Qi}i=1N, Qtπ, π)









So, to estimate the posterior variance of the Q-values, algorithm 1 takes as input: (i) a posterior distribution over plausible MDPs Γt and (ii) some current version of the policy π that is evaluated. Then algorithm 1 proceeds as follows:

    • 1. sample a finite number of MDPs from Γt
    • 2. From the sampled MDPs, solve the Bellman expectation equation for each sampled MDP and compute the mean Qtπ
    • 3. Using all the above information, approximately solve the UBE (10) to obtain the variance estimate.


This variance estimation can then be plugged into a standard policy iteration methods to solve the general optimistic RL problem of (12).


The controller 105 samples an ensemble of N MDPs from the current posterior Γt in line 2 of algorithm 1 and uses it to solve the Bellman expectation equation in line 3, resulting in an ensemble of N Q functions and the posterior mean Qtπ. Lastly, it estimates Ûtπ in line 4 via a generic variance estimation method qvariance for which it may use one of three implementations

    • ensemble-var computes a sample-based approximation of the variance given by custom-character[Qi];
    • pombu uses the solution to the UBE (5); and
    • exact-ube uses the solution to the UBE (10) according to the approach described above.


The controller 105 may alternate between these three estimation approaches but, according to various embodiments, uses exact-ube for at least some episodes.


In practice, typical RL techniques for model learning may break the theoretical assumptions. For tabular implementations, flat prior choices like a Dirichlet distribution violate the second assumption while function approximation introduces correlations between states and thus violates the first assumption. A challenge arises in this practical setting: exact-ube may result in negative U-values, as a combination of (i) the assumptions not holding and (ii) the possibility of negative uncertainty rewards. While (i) cannot be easily resolved, the controller 105 may use a practical upper-bound on the solution of (7) or (10), respectively, such that the resulting U-values are non-negative and hence interpretable as variance estimates. Specifically, clipped uncertainty rewards ũt=max(umin,ut(s)) with corresponding U-values Ũtπ may be used. It can be seen that, if umin=0, then Wtπ(s)≥Ũtπ(s)≥Utπ(s), which means that using Ũtπ still results in a tighter upper-bound on the variance than Wtπ, while preventing non-positive solutions to the UBE.


In the following, this notation is dropped and it is assumed all U-values are computed from clipped uncertainty rewards. It should be noted that pombu does not have this problem, since wt(s) is already non-negative.


The controller 105 may use a Dirichlet prior on the transition function and a standard Normal prior for the rewards which leads to closed-form posterior updates. After sampling N times from the MDP posterior (line 2 of algorithm 1), it obtains the Q-functions (line 3) in closed-form by solving the corresponding Bellman equation. For the UBE-based approaches, the controller 105 may estimate uncertainty rewards via approximations of the expectations and variances therein. Lastly, the controller 105 may solve (12) via policy iteration until convergence is achieved or until a maximum number of steps is reached.


According to various embodiments, a deep RL implementation is used, i.e. the controller 105 implements a deep RL architecture to scale for a continuous state-action space. In that case, approximating the sum of cumulative uncertainty rewards allows for uncertainty propagation.


For this, a baseline MBPO (Model-Based Policy Optimization) may be used. In contrast to the tabular implementation, maintaining an explicit distribution over MDPs from which the controller 105 sample is intractable. Instead, the controller 105 considers Γt to be a discrete uniform distribution of N probabilistic neural networks, denoted pθ, that output the mean and covariance of a Gaussian distribution over next states and rewards. In this case, the output of line 2 of algorithm 1 includes sampling from the set of neural networks.


MBPO trains Q-functions represented as neural networks via TD-learning on data generated via model-randomized k-step rollouts from initial states sampled from custom-charactert. Each forward prediction of the rollout comes from a randomly selected model of the ensemble and the transitions are stored in a single replay buffer custom-charactermodel, which is then fed into a model-free optimizer like soft actor-critic (SAC). SAC trains a stochastic policy represented as a neural network with parameters ϕ, denoted πϕ. The policy's objective function is similar to (12) but with entropy regularization instead of the uncertainty term. In practice, the argmax is replaced by G steps of stochastic gradient ascent, where the policy gradient is estimated via mini-batches drawn from custom-charactermodel. requires a few modifications from the MBPO methodology. To implement line 3 of algorithm 1, in addition to custom-charactermodel, the controller 105 implements N buffers {custom-charactermodeli}i=1N filled with model-consistent rollouts, where each k-step rollout is generated under a single model of the ensemble, starting from initial states sampled from custom-charactert. The controller 105 trains an ensemble of N value functions {Qi}i=1N, parameterized by {ψi}i=1N and minimizes the residual Bellman error with entropy regularization
















(

ψ
i

)

=


𝔼


(

s
,
a
,
r
,

s



)

~

𝒟
t
i



[

(


y
i

-


Q
i

(

s
,

a
;

ψ
i



)


)



)

2

]

,




(
13
)







where yi=r+γ(Qi(s′, a′;ψi)−a log πϕ(a′|s′)) and ψi are the target network parameters updated via polyak averaging for stability during training. The mean Q-values, Qtπ, are estimated as the average value of the Q-ensemble.


Further, according to various embodiments, to approximate the solution to the UBE, the controller 105 trains a neural network parameterized by a vector φ, denoted Uφ (informally, the U-net). The controller 105 biases the output of the U-net to be positive (and thus interpretable as a variance) by using a softplus output layer. The controller 105 carries out training by minimizing the uncertainty Bellman error:













(
φ
)

=


E


(

s
,
a
,
r
,

s



)

~

𝒟
mode



[


(

z
-

U

(

s
,

a
;
φ


)


)

2

]


,




(
14
)







with targets z=γ2u(s,a)+γ2U(s′,a′;φ) and target parameters φ updated like in regular critics. Lastly, the controller optimizes πϕ as in MBPO via SGD (stochastic gradient descent) on the SAC policy loss, but also adding the uncertainty term from (12). Algorithm 2 gives a detailed example of this approach.












Algorithm 2 MBPO-style optimistic learning
















1:
Initialize policy πϕ, predictive model pθ, critic ensemble {Qi}i=1N, uncertainty



net Uψ (optional), environment dataset




custom-character
t, model datasets custom-charactermodel and {custom-characterimodel}i=1.N



2:
global step ← 0


3:
for episode t = 0, . . . , T − 1 do


4:
 for E steps do


5:
  if global step % F = = 0 then


6:
   Train model pθ on custom-charactert via maximum likelihood


7:
   for M model rollouts do


8:
    Perform k-step model rollouts starting from s ~ custom-charactert; add to custom-charactermodel and {custom-characterimodel}i=1N


9:
  Take action in environment according to πϕ; add to custom-charactert


10:
  for G gradient updates do


11:
   Update {Qi}i=1N with mini-batches from {custom-characterimodel}i=1,N via SGD on (10)


12:
   (Optional) Update Uψ with mini-batches from custom-charactermodel, via SGD on (11)


13:
   Update πϕ with mini-batches from custom-charactermodel. via stochastic gradient ascent on the



   optimistic values of (9)


14:
 global step ← global step +1









In particular,

    • In line 8, the controller 105 performs a total of N+1 k-step rollouts corresponding to both the model-randomized and model-consistent rollout modalities.
    • In line 11, the controller 105 updates the ensemble of Q-functions on the corresponding model-consistent buffer. MBPO trains twin critics (as in SAC) on mini-batches from custom-charactermodel.
    • In line 12, the controller 105 updates the U-net for the UBE-based variance estimation methods.
    • In line 13, the controller 105 updates πϕ by maximizing the optimistic Q-values. MBPO maximizes the minimum of the twin critics (as in SAC). Both approaches include an entropy maximization term.


Table 2 gives the values of hyperparameters for an example application.











TABLE 2







Hyperparameter



















T - # episodes
75



E - # steps per episode
400



G - policy updates per step
20



M - # model rollouts per step
400



F - frequency of model
250



retraining (# steps)



retain updates
1



N - ensemble size
5



λ - exploration gain
1



k - rollout length
30



Model network
MLP with 4 layers of size 200,




SiLU nonlinearities



Policy network
MLP with 2 layers of size 64,




tanh nonlinearities



Q and U networks
MLP with 2 layers of size 256,




tanh nonlinearities










It should be noted that in tabular RL, exact-ube solves N+2 Bellman equations, pombu solves two and ensemble-var solves N+1. In deep RL (such as algorithm 2), UBE-based approaches have the added complexity of training the neural network for solving the uncertainty Bellman equation, but it can be parallelized with the Q-ensemble training. Despite the increased complexity, the UBE-based approach performs well for small N, reducing the computational burden.


In summary, according to various embodiments, a method is provided as illustrated in FIG. 3.



FIG. 3 shows a flow diagram 300 illustrating a method for training a control policy according to an embodiment.


In 301, the variance of a value function which associates a state with a value of the state or a pair of state and action with a value of the pair is estimated. This is done by solving a Bellman uncertainty equation, wherein, for each of multiple states, the reward function of the Bellman uncertainty equation is set to the difference of the total uncertainty about the mean of the value of the subsequent state following the state and the average aleatoric uncertainty of the value of the subsequent state.


In 302, the control policy is biased in training towards regions for which the estimation gives a higher variance of the value function than for other regions.


The approach of FIG. 3 can be used to compute a control signal for controlling a technical system, like e.g. a computer-controlled machine, like a robot, a vehicle, a domestic appliance, a power tool, a manufacturing machine, a personal assistant or an access control system. According to various embodiments, a policy for controlling the technical system may be learnt (by reinforcement learning) and then the technical system may be operated according to the learned (i.e. trained) policy.


The uncertainty Bellman equation can be seen as a Bellman equation that is applied to variance of a value function (rather than the value function itself), i.e. it is a recursive equation (in the form of a Bellman equation) for the variance of a value function. The value function is replaced by the variance of the value function and the reward function is replaced (or set to) the difference of the total uncertainty about the mean of the value of the subsequent state following the state (see equation (8) and equation (11) for concrete examples of the reward and equations (9) and (10) for the resulting Bellman equation for the case of the state value function and the state-action value function, respectively).


The approach of FIG. 3 may be used within the context of model-based reinforcement learning (MBRL), where the objective is to train a model of the environment and use it for optimizing a policy, whose objective is to maximize some notion of reward. It allows quantifying uncertainty in this context and leveraging the uncertainty representation to guide the policy training process, in order to effectively explore the environment and improve sample efficiency. This is useful in practice because in many application (especially involving physical systems) gathering data is expensive, and it is desirable to minimize the amount of interactions with the controlled system (including the environment) for training a performant policy.


Various embodiments may receive and use image data (i.e. digital images) from various visual sensors (cameras) such as video, radar, LiDAR, ultrasonic, thermal imaging, motion, sonar etc., for obtaining a discrete or continuous signal that provides information about the environment, for example to obtain observations about the states and/or rewards.


According to one embodiment, the method is computer-implemented.


Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific embodiments shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific embodiments discussed herein.

Claims
  • 1. A method for training a control policy, comprising the following steps: estimating a variance of a value function which associates: (i) a state with a value of the state, or (ii) a pair of state and action with a value of the pair, by solving a Bellman uncertainty equation, wherein, for each of multiple states, a reward function of the Bellman uncertainty equation is set to a difference of a total uncertainty about a mean of a value of a subsequent state following the state and an average aleatoric uncertainty of the value of the subsequent state; andbiasing the control policy in training towards regions for which the estimation gives a higher variance of the value function than for other regions.
  • 2. The method of claim 1, wherein: (i) the value function is a state value function, and the control policy is biased in training towards regions of a state space for which the estimation gives a higher variance of values of states than for other regions of the state space, or (ii) the value function is a state-action value function and the control policy is biased in training towards regions of a space of state-action pairs for which the estimation gives a higher variance of a value of pairs of states and actions than for other regions of the space of state-action pairs.
  • 3. The method of claim 1, further comprising: setting an uncertainty about the mean of the value of the subsequent state following the state to an estimate of the variance of the mean of the value of the subsequent state, and setting the average aleatoric uncertainty to the mean of an estimate of the variance of the value of the subsequent state.
  • 4. The method of claim 1, wherein the estimating of the variance of the value function includes selecting one of multiple neural networks, wherein each of the neural networks is trained to output information about a probability distribution of a subsequent state following a state input to the neural network and of a reward obtained from a state transition and determining the value function from outputs of the selected neural network for a sequence of states.
  • 5. The method of claim 1, further comprising: solving the Bellman uncertainty equation using a neural network trained to predict a solution of the Bellman uncertainty equation in response to an input of a state or pair of state and action value.
  • 6. A method for controlling a technical system, comprising the following steps: training a control policy including: estimating a variance of a value function which associates: (i) a state with a value of the state, or (ii) a pair of state and action with a value of the pair, by solving a Bellman uncertainty equation, wherein, for each of multiple states, a reward function of the Bellman uncertainty equation is set to a difference of a total uncertainty about a mean of a value of a subsequent state following the state and an average aleatoric uncertainty of the value of the subsequent state, andbiasing the control policy in training towards regions for which the estimation gives a higher variance of the value function than for other regions; andcontrolling the technical system according to the trained control policy.
  • 7. A controller configured to train a control policy, the controller configured to: estimate a variance of a value function which associates: (i) a state with a value of the state, or (ii) a pair of state and action with a value of the pair, by solving a Bellman uncertainty equation, wherein, for each of multiple states, a reward function of the Bellman uncertainty equation is set to a difference of a total uncertainty about a mean of a value of a subsequent state following the state and an average aleatoric uncertainty of the value of the subsequent state; andbias the control policy in training towards regions for which the estimation gives a higher variance of the value function than for other regions.
  • 8. A non-transitory computer-readable medium on which are stored instructions for training a control policy, the instructions, when executed by a computer, causing the computer to perform the following steps: estimating a variance of a value function which associates: (i) a state with a value of the state, or (ii) a pair of state and action with a value of the pair, by solving a Bellman uncertainty equation, wherein, for each of multiple states, a reward function of the Bellman uncertainty equation is set to a difference of a total uncertainty about a mean of a value of a subsequent state following the state and an average aleatoric uncertainty of the value of the subsequent state; andbiasing the control policy in training towards regions for which the estimation gives a higher variance of the value function than for other regions.
Priority Claims (1)
Number Date Country Kind
22 21 3403.3 Dec 2022 EP regional