J. Ho and S. Ermon: “Generative adversarial imitation learning”, in D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pp. 4565-4573. Curran Associates, Inc., 2016, describe aspects of generative adversarial imitation learning (GAIL). GAIL is a method for training a strategy that is intended to imitate expert behavior.
A. Doerr, C. Daniel, D. Nguyen-Tuong, A. Marco, S. Schaal, T. Marc, and S. Trimpe: “Optimizing long-term predictions for model-based policy search”, in S. Levine, V. Vanhoucke, and K. Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, Volume 78 of Proceedings of Machine Learning Research, pages 227-238. PMLR, 13-15 Nov. 2017, describe aspects of long-term predictions for model-based learning of such strategies.
Kamyar Azizzadenesheli et al.: “Sample-Efficient Deep AL with Generative Adversarial Tree Search,” ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, N.Y. 14853, Jun. 15, 2018 (2018-06-15), describe a generative adversarial tree search (GATS) algorithm that is a data-efficient reinforcement learning (DRL) process.
Ted Xiao et al.: “Generative Adversarial Networks for Model Based Reinforcement Learning with Tree Search,” Jan. 1, 2016 (2016-01-01), describe a generative adversarial network for learning a dynamic model that is then used for an online tree search. Also proposed is a model-based reinforcement learning framework that uses a combination of video prediction with GANs and online tree search methods.
Xinshi Chen et al.: “Generative Adversarial User Model for Reinforcement Learning Based Recommendation System,” ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, N.Y. 14853, Dec. 27, 2018 (2018-12-27), describe a model-based reinforcement learning framework for recommendation systems in which a generative adversarial network is used to imitate the dynamic of the user behavior and to learn its reward function.
Nir Baram et al.: “End-to-End Differentiable Adversarial Imitation Learning,” PROCEEDINGS OF MACHINE LEARNING RESEARCH, Vol. 70, Aug. 6, 2017 (2017-08-06), pp. 390-399, describe an improved generative adversarial imitation learning (MGAIL) algorithm.
Vaibhav Saxena et al.: “Dyna-AIL: Adversarial Imitation Learning by Planning”, ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLINLIBRARY CORNELL UNIVERSITY ITHACA, N.Y. 14853, Mar. 8, 2019, describe a continuously differentiable algorithm for adversarial imitative learning in a dyna-similar context, in order to switch between model-based planning and model-free learning from expert data.
It is desirable to further improve aspects of these procedures.
This is achieved by way of the methods and apparatuses in accordance with example embodiments of the present invention.
In accordance with an example embodiment of the present invention, a computer-implemented method for training a parametric model of an environment, in particular a deep neural network, provides that the model determines a new model state depending on a model state, on an action, and on a reward; the reward being determined depending on an expert trajectory and on a model trajectory determined in particular in accordance with a strategy depending on the model state; and at least one parameter of the model being determined depending on the reward. The result is that a long-term behavior of the model which is a particularly good match for a true system behavior of a modeled system is learned.
In accordance with an example embodiment of the present invention, provision is preferably made that a discriminator determines the reward depending on the expert trajectory and on the model trajectory; at least one parameter of the discriminator being determined with a gradient descent method depending on the expert trajectory and on the model trajectory. In a rollout, the expert trajectory is used as a reference; a generator, i.e., a specification device, determines in accordance with any strategy, depending on the model state, the model trajectory for comparison with the reference. The discriminator is parameterized with the gradient descent method. This enables parameterization of the discriminator in a first step independently of the training of the model.
The at least one parameter of the model is preferably learned, depending on the reward, with an episode-based policy search or a policy gradient method, in particular REINFORCE or TRPO. This allows the model to be trained in a second step independently of the training of the discriminator. Preferably, firstly the discriminator is trained and then the reward determined by the discriminator is used to train the model. These steps are preferably repeated alternatingly.
The reward is preferably determined depending on a true expected value for a system dynamic of the environment, and depending on a modeled expected value for the model. The expected values represent an approximation, based on the training data points, of the actual system dynamic; this makes possible more-efficient calculation of the training.
The expert trajectory is preferably determined in particular depending on a demonstration; an expert action, which specifies an expert in an environment state in particular in accordance with an expert strategy, being acquired; the environment being converted by the expert action, with a probability, into a new environment state; and the environment state, the expert action, and the new environment state being determined as a data point of the expert trajectory. Monitored learning can thereby be implemented particularly efficiently.
In accordance with an example embodiment of the preswnet invention, preferably, the action that is specified depending on a strategy is acquired in a model state; the model being converted by the action, with a probability, into the new model state; the reward being determined depending on the model state, on the action, and on the new model state. This allows the discriminator to be trained with model trajectories that the generator determines depending on the model. The model changes during training. The model trajectories of the generator therefore change even if the strategy that the generator uses is unchanged. This allows the discriminator to be adapted to the changed model. Training thus becomes more effective overall.
In accordance with an example embodiment of the present invention, a parametric model of an environment that encompasses a controlled system is preferably trained depending on the controlled system; at least one state variable or manipulated variable for applying control to the controlled system being determined depending on the model and depending on at least one acquired actual variable or observed state variable of the controlled system. The model can be used in untrained or partly trained fashion depending on the controlled system. Conventional methods for open- or closed-loop control can thereby be significantly improved, especially in terms of learning the model.
Preferably an action is determined, in particular by way of an agent, depending on a model state of the model, in accordance with a strategy; a reward being determined depending on the strategy, on the action, or on a new model state; the strategy being learned, depending on the reward, in a reinforcement learning process. A strategy can thereby also be learned more efficiently. The purpose of the agent is to maximize the reward in the reinforcement learning process. The reward is determined, for example, depending on an indication of the conformity of the strategy with a specified reference strategy, or the conformity of a model behavior with an actual behavior of the environment or with a reference model behavior.
In accordance with an example embodiment of the present invention, a computer-implemented method for applying control to a robot provides that in accordance with the method set forth above, the parametric model of the environment is trained, the strategy is learned, and control is applied to the robot depending on the parametric model and on the strategy. This means that control is applied to a robot in such a way that it imitates a human behavior. The strategy for that is learned. To learn the strategy, the model, i.e. an environment model, which is also learned, is used.
In accordance with an example embodiment, an apparatus for applying control to a robot is embodied to execute the computer-implemented method for applying control to the robot. This apparatus is embodied to learn an environment model and a strategy with which human behavior can be imitated.
Further advantageous embodiments of the present invention are evident from the description below and from the figures.
The apparatus for applying control to the robot encompasses at least one memory for instructions, and at least one processor for executing the instructions. Execution of the instructions results in a determination of an action for the robot depending on a strategy for applying control to the robot. The apparatus encompasses, for example, at least one control application device for applying control to the robot depending on the action.
The robot can be an at least partly autonomous vehicle that performs the action.
The apparatus is embodied to acquire an expert trajectory τE for an environment 102. Expert trajectory τE encompasses several triplets (sU,aE,s′U) that together yield expert trajectory τE=(sE0,aE0,sE1,aE1, . . . , sET). The apparatus encompasses a first specification device 104 that is embodied to determine, depending on an expert strategy πE(aE|sU), an expert action aE for an environment state sU of environment 102. Expert action aE causes environment 102 to be converted, with a probability p(s′U|aE,sU), into a new environment state s′U. Specification device 104 can have a human-machine interface that is embodied to output state sU of environment 102 to an expert, and to acquire expert action aE depending on an input of the expert.
In the example, the apparatus is embodied to determine expert trajectory τE depending on a demonstration. During the demonstration, an expert action aE that specifies an expert in an environment state sU is acquired. Environment state sU, expert action aE, and new environment state s′U are determined as a data point of expert trajectory τE. The apparatus is embodied, for example, to repeat these steps in order to acquire data points for expert trajectory τE until the demonstration has ended. In the example, a first memory 106 for expert trajectory τE is provided.
Specification device 204 is embodied to acquire, during the training of model 202, a model trajectory T constituting a sequence of several (sM,a,s′M) triplets that together yield model trajectory T=(sM0,aM0,sM1,aM1, . . . ,sMT). The apparatus is embodied, for example, to repeat these steps in order to acquire data points for model trajectory τ until training of the model has ended. In the example, a second memory 206 for model trajectories τ is provided.
The apparatus encompasses a discriminator 208 that is embodied to determine reward r depending on an expert trajectory τE and on model trajectory τ.
The apparatus encompasses a training device 210 that is embodied to determine at least one parameter w of discriminator 208, with a gradient descent method, depending on an expert trajectory τE and on model trajectory τ.
Training device 210 is embodied to learn at least one parameter el of model 202, depending on reward r, with an episode-based policy search or with a policy gradient method, in particular REINFORCE or TRPO.
Training device 210 is embodied to determine reward r depending on a true expected value p for a system dynamic of environment 102 and depending on a modeled expected value p
In one aspect, a device for model-based open- or closed-loop control is also provided.
A device for model-based learning of a strategy is schematically depicted in
In a step 502, expert trajectory τE is determined in particular depending on a demonstration.
In the example, expert action aE that specifies the expert in an environmental state sU is acquired. Environment state sU, expert action aE, and new environment state s′U are determined, for example, as a data point of expert trajectory τE. This step is repeated, for example, until the demonstration has ended.
A step 504 is then executed. In step 504, discriminator 208 and model 202 are initialized, for example using random values for the respective parameters.
A step 506 is then executed. In step 506, a new model trajectory is generated using the model parameters of model 202. For example, new model states sM′ depending on the respective model state sM, and respective actions a for model trajectory τ, are determined for model 202.
A step 508 is then executed. In step 508, reward r is determined depending on expert trajectory τE and on model trajectory τ. In the example, reward r is determined for the generated model trajectory τ. Discriminator 208 is used for this.
A step 510 is then executed. In step 510, model 202 is trained with the aid of reward r or with an accumulated reward R. In this context, at least one parameter θ of model 202 is learned, for example, with an episode-based policy search or with a policy gradient method, in particular REINFORCE or TRPO.
A step 512 is then executed. In step 512, discriminator 208 is trained with the aid of expert trajectories and model trajectories. In the example, at least one parameter w of discriminator 208 is determined with a gradient descent method, depending on expert trajectory τE and on model trajectory τ.
In the example, the steps are repeated. The accumulated reward R is determined from the rewards r that were determined for the various model trajectories.
The initially unknown reward r is determined proceeding from a Markov decision process M=(S,A,p,rπ,μ0,γ) from which it is assumed that the expert trajectories were created in the decision process M. The reward is, for example, a binary value, i.e. true or false. The purpose of the subsequent optimization is to learn a function pθ≈p, where p(sU,aE,s′U) indicates the actual system behavior of the environment.
The Markov decision process below, with an unknown reward r, is used for this:
{tilde over (M)}=({tilde over (S)},Ã,{tilde over (p)}π
where
action space Ã=S,
state space {tilde over (S)}={(s,a), s∈S, a∈A},
initial distribution {tilde over (μ)}0({tilde over (s)}0)={tilde over (μ)}0(s0,a0)=μ0(s0)πE(a0|s0)
dynamic monitoring probability
If a GAIL method is used in this decision process, reward r is determinable and model pθ is learnable.
Reward r is determined, for example, depending on a true expected value p for a system dynamic of environment 102 and depending on a modeled expected value p
For the gradient descent method, for example, rollouts are provided using any strategy it, where
∇w[p[log(1−Dw(sU,a,sU′))]+p
where
Dw=discriminator having at least one parameter w.
For example, log(D) is used as a reward r in order to train a model pθ, where
∇θp
where
Q(ŝM,â,ŝM′)=π
where
π
=expected value for the i-th model trajectory, proceeding from a model state s0, an action a0, and a new model state s0′.
An exemplifying algorithm that is based on an episode-based policy search is indicated below as pseudocode. The algorithm proceeds from an expert trajectory τE that is determined by way of expert strategy πE:
for i=1 to . . . do
Sample model parameters θi[j]˜Πω
for each model parameter θi[j] do
end for
Update discriminator from wi to wi+1 by maximizing
p
[log(Dw
Update model parameter proposal distribution from w1 to wi+1 using weighted maximum likelihood with the weights R[j] end for.
The following definitions of R, r, γ and λ are used in the algorithm:
In a subsequent step 704, action a is determined in particular by an agent 402 depending on model state sM of model 202, in accordance with strategy π.
In a step 706, a reward rπ is then determined based on strategy π, on action a, or on new model state s′M. In a step 708, strategy π is then determined depending on reward rπ in a reinforcement learning process.
This reward rπ is, for example, dependent on the task that is to be learned at the end of strategy π. The reward can be defined as a function r(s,a) or r(s). The result is to reward the degree to which an action a in a state s contributes to performance of the task, or how good it is at performing the task in a specific state.
If the task is, for example, navigating to a destination in a 2D environment that is modeled by model 202, the reward can then be, for example, the distance to the destination. A strategy for navigation for a vehicle can be correspondingly learned.
Steps 702 to 708 are preferably repeated recursively. Provision can be made to repeat steps 704 and 706 recursively before step 708 is executed.
A computer-implemented method for applying control to a robot is described below. The method provides that parametric model 202 of environment 102 is trained in accordance with the method described with reference to
In one aspect, model 202 is trained, in accordance with the method described with reference to
Number | Date | Country | Kind |
---|---|---|---|
10 2019 203 634.1 | Mar 2019 | DE | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2020/055900 | 3/5/2020 | WO | 00 |