The present application claims the benefit under 35 U.S.C. § 119 of German Patent Application No. DE 10 2022 209 845.5 filed on Sep. 19, 2022, which is expressly incorporated herein by reference in its entirety.
The present invention relates to methods for training an agent.
Reinforcement learning (RL) is a machine learning paradigm that enables an agent, e.g. a robot, to learn to perform a desired behavior relative to a task specification, e.g., what control measures to actuate to reach a destination in a robot navigation scenario.
Architectures that combine planning with reinforcement learning can be used effectively for decision problems (e.g., controlling a vehicle or a robot based on sensory inputs). They enable the incorporation of prior problem knowledge (e.g., a map of the environment) and can enable generalization across different problem instances (e.g., different environment layouts) through the planning part, while retaining the ability to deal with high-dimensional observations and unknown dynamics through the RL part.
The paper “Value Propagation Networks” by Nardelli et al., 2019 (https://arxiv.org/pdf/1805.11199.pdf), hereinafter referred to as Reference 1, describes an architecture with a planning module containing a neural network that, given a discrete map (image) of the environment and a target map, outputs a propagation map and a reward map that are used for the iterative planning of a value map. For the action selection and training, an actor-critic control strategy is added to the planning part, which receives (excerpts from) the value map as input. By back-propagating the gradients that result from the actor-critic losses, including through the planning part, the entire architecture is trained throughout. The VProp (for “Value Propagation”) or the variant MVProp (for “Max-Propagation”) also described in the paper is proposed for discrete state and action space problems, because a discretized map is required as input and the highest-value state from a neighborhood in the planned value map must be selected as action.
The paper “Addressing Function Approximation Error in Actor-Critic Methods” by Fujimoto et al, 2018 (https://arxiv.org/pdf/1802.09477.pdf), hereinafter referred to as reference 2, describes an off-policy actor-critic algorithm referred to as TD3 (Twin Delayed Deep Deterministic Policy Gradient).
Training methods for agents that further improve agent performance, especially in special environments, e.g., with different terrain types, are desirable.
According to various embodiments of the present invention, a method is provided for training an agent that includes performing a plurality of control passes, wherein in each control pass:
the planning component being trained to reduce an auxiliary loss that contains, for each of a plurality of coarse-scale state transitions caused by the ascertained actions from a coarse-scale state to a coarse-scale successor state, a loss representing a deviation between a value outputted by the planning component for the coarse-scale state and the sum of a reward received for the coarse-scale state transition and at least a portion of the value of the coarse-scale successor state.
Auxiliary loss, also referred to as planning component loss or, in the example embodiment of the present invention based on MVProp described below, MVProp auxiliary loss, improves the training in that better performance of the trained agent (higher success rate for completing a task and lower variance in performance between agents trained in independent training processes) when applied to a decision process for a controlling, such as a robot navigation task. In particular, the auxiliary loss for training the planning component (also referred to herein as the planning module) enables high success rates in application scenarios with different terrain types that require learning a diversified propagation factor map for the environment.
The (fine-scale) states that can be taken in the environment are, for example, positions (e.g., in the case of a navigation task in which the environment is simply a 2D or 3D environment of positions). However, the state space can also be more complicated (e.g., may include orientations), such that each state that can be taken in the environment has more than one position in the environment (e.g., a pair of position and orientation, e.g., when controlling a robotic arm). Whether and how well a state can be traversed (i.e., the information about the traversability of a state, e.g. in the form of a propagation factor) can be understood in such a way that the state (e.g., a certain orientation at a certain position) can be assumed and, starting from this state, another state can also be reached again. Here, with respect to (coarse-scale) states, intermediate values (e.g. for the propagation factors) can result (e.g. between 0 and 1) that express how likely such a transition is (e.g. the risk of getting stuck in muddy terrain) or with what relative speed (e.g. slowing down of the movement in sandy terrain) the state can be traversed.
The plurality of states reached in the environment by the agent starting from an output of the neural actor network need not be all the states reached by the agent (in the control pass). In contrast, actions can be ascertained randomly for some of the states for an exploration. The coarse-scale state transitions caused by these actions can also be included in the loss for the training of the planning component. In other words, each state transition is due to an action of the agent, which is either selected according to the learned policy (strategy), i.e. based on the output of the actor network, or is just randomly selected for exploration purposes.
For example, the layout information includes information about the subsurface of the environment, obstacles in the environment, and/or one or more destinations in the environment.
The formulation “at least a portion of the value of the coarse-scale successor state” is to be understood to mean that the value of the coarse-scale successor state as it occurs in summation can be discounted as is standard in reinforcement learning (i.e., weighted by a discount factor standardly (as below) designated Υ, less than 1).
Various exemplary embodiments of the present invention are given below.
Embodiment 1 is a method for training an agent as described above.
Embodiment 2 is a method according to embodiment 1, wherein the planning component is trained to reduce an overall loss that includes, in addition to the auxiliary loss, an actor loss that penalizes when the neural actor network selects actions that a critic network evaluates as low.
Thus, in the training of the planning component the requirements for high performance of the controlling are taken into account by the actions outputted by the actor network.
Embodiment 3 is a method according to embodiment 1, wherein the planning component is trained to reduce an overall loss, which in addition to the auxiliary loss includes a critic loss that penalizes deviations of evaluations, provided by a critic network, of state-action pairs from evaluations that include sums of the rewards actually obtained by performing the actions of the state-action pairs in the states of the state-action pairs, and discounted evaluations, provided by a critic network, of successor state-successor action pairs, the successor actions to be used for the successor states being determined with the aid of the actor network for the successor states.
Thus, in the training of the planning component, the requirements for high accuracy of the critic network (or critic networks if, for example, a target critic network is also used) are taken into account.
Embodiment 4 is a method according to embodiment 1, wherein the planning component is trained to reduce an overall loss that includes, in addition to the auxiliary loss, an actor loss that penalizes when the neural actor network selects actions that a critic network gives a low evaluation, and a critic loss that penalizes deviations of evaluations, provided by a critic network, of state-action pairs from evaluations that include sums of the rewards actually obtained by performing the actions of the state-action pairs in the states of the state-action pairs, and discounted evaluations, provided by a critic network, of successor state-successor action pairs, the successor actions to be used for the successor states being determined with the aid of the actor network for the successor states.
Thus, in the training of the planning component both the requirements for high performance of the controlling through the actions outputted by the actor network and the requirements for high accuracy of the critic network are taken into account.
Embodiment 5 is a method according to one of the embodiments 1 to 4, wherein the layout information includes information about the location of different terrain types in the environment, and the representation for each terrain type includes a map having binary types that indicates, for each of a plurality of locations in the environment, whether the terrain type is present at the location.
This improves performance for application scenarios with a plurality of terrain types compared to using a single channel for all terrain types.
Embodiment 6 is a method according to one of embodiments 1 to 5, wherein the values ascertained by the planning component for the neighborhood of coarse-scale states are normalized with respect to the mean value of these ascertained values and the standard deviation of these ascertained values.
This improves the performance of the trained agent for environments that are larger than those that occur in the training.
Embodiment 7 is a control device set up to carry out a method according to one of embodiments 1 to 6.
Embodiment 8 is a computer program having instructions that, when executed by a processor, cause the processor to carry out a method according to one of embodiments 1 to 6.
Embodiment 9 is a computer-readable medium that stores instructions that, when executed by a processor, cause the processor to carry out a method according to one of embodiments 1 to 6.
In the figures, similar reference signs generally refer to the same parts in all the different views. The figures are not necessarily to scale, the emphasis being instead generally on illustrating the principles of the present invention. In the following description, various aspects of the present invention are described with reference to the figures.
The following detailed description refers to the figures, which for the purpose of explanation show specific details and aspects of the present disclosure in which the present invention may be carried out. Other aspects may be used, and structural, logical, and electrical changes may be made, without departing from the scope of protection of the present invention. The various aspects of the present disclosure are not necessarily mutually exclusive, as some aspects of the present disclosure may be combined with one or more other aspects of the present disclosure to form new aspects.
Various examples are described in more detail below.
A controlled object 100 (e.g. a robot or a vehicle) is located in an environment 101. The controlled object 100 has a start position 102 and is supposed to reach a destination position 103. There are obstacles 104 in the environment 101 that are to be traveled around by controlled object 100. For example, the obstacles are not to be passed by controlled object 100 (e.g. walls, trees or rocks), or are to be avoided because the agent would damage or injure them (e.g. pedestrians).
The controlled object 100 is controlled by a control device 105 (where control device 105 may be located in the controlled object 100 or may be provided separately from it, i.e. the controlled object may be remotely controlled). In the example scenario of
Moreover, the embodiments are not limited to the scenario in which a controlled object such as a robot (as a whole) is to be moved between the positions 102, 103, but may also be used, for example, to control a robot arm whose end effector is to be moved between the positions 102, 103 (without running into obstacles 104), etc.
Accordingly, in the following, terms such as robot, vehicle, machine, etc., are used as examples of the object to be controlled or of the computer-controlled system (e.g. a robot with objects in its workspace). The approaches described here can be used with various types of computer-controlled machines such as robots or vehicles and others. The general term “agent” is also used below to refer in particular to all types of physical systems that can be controlled using the approaches described below. However, the approaches described below can be applied to any type of agent (e.g., including an agent that is only simulated and does not physically exist).
In the ideal case, control device 105 has learned a control strategy that allows it to successfully control the controlled object 100 (from the start position 102 to the destination position 103 without meeting obstacles 104) for any scenarios (i.e. environments, start and destination positions) that the control device 105 has not yet encountered (during training), i.e. to select an action (here, movement in the 2D environment) for each position. Mathematically, this can be formulated as a Markov decision process.
According to various embodiments, a control strategy is trained together with a plan module using reinforcement learning.
By combining a planning algorithm with a reinforcement learning algorithm, such as TD3 (Twin Delayed Deep Deterministic Policy Gradient), efficient learning can be achieved by combining the advantages of both approaches. Neural networks are able to learn approximations of the value function and the dynamics of an environment 101. These approximations can also be used to approximate planning operations such as value iteration. Neural networks trained for the approximation of these planning operations are called differentiable planning modules. These modules are fully differentiable because the modules are neural networks. Therefore, these models can be fully trained using reinforcement learning.
MVProp (Max Value Propagation) is an example of such a differentiable planning module. It mimics a value iteration using a propagation map and a reward map, and assigns a propagation factor and a reward factor to each state. The propagation factor p represents how well a state propagates (i.e. how well the agent can traverse the state): if a state does not propagate at all because it is an end state 103 or it corresponds to an obstacle 104 (i.e. it cannot be taken by the particular robotic device), the propagation factor should be close to zero. If, on the other hand, the state propagates (i.e. can be taken (and thus also traversed) by the agent, the propagation factor should be close to 1. The propagation map models the transition function between two states. The reward factor represents the reward for entering a state and models the reward function. The value ν (in the sense of a “usefulness” for reaching the relevant destination; this can be viewed as the expected return of the state) of a state (indexed by a pair of indices ij) is iteratively (indexed by k) ascertained using the reward factor (z) of the state according to
According to various embodiments, the planning module, here MVProp, is trained together with an actor-critic RL method. This is explained in more detail below with reference to
As explained above, the architecture includes a planning module (here, MvProp) 201 with a neural network (referred to as a propagation network) 202 that is to be trained to ascertain, from feature information (or layout information)
of the state. For this purpose, planning module 201 (according to (1) and (2)) uses the reward factors
Here, the states z refer to coarse-scale (or “abstract” or “high-level”) states of a coarse-scale planning on a coarse, discrete (map) representation of the environment 101. For example, the coarse, discrete representation of the environment is a grid 106 (shown as dashed lines in
According to various embodiments, items of feature information L are divided into binary feature maps for each feature type, i.e. for example a bitmap indicating for each state whether the state contains an obstacle (i.e. is not passable) and another bitmap indicating for each state whether the state is a destination state. These bitmaps together are referred to as a split feature layout
To train the planning module, the transitions between abstract states (from control passes) are stored in a replay buffer 207 for the abstract states, denoted by Bz. This replay buffer 207 stores tuples of layout information, abstract state, abstract reward, next abstract state (i.e. abstract successor state), and the information as to whether the control pass ends with this transition: (
The sum of (fine-scale) rewards (i.e. rewards from state transitions s to s′) obtained by the transition from z to z′ (which, because of the coarser grid, may require many state transitions on the low grid), is designated abstract reward:
Using the values ν from the value map and the layout information
F
π(ν,
F
c(ν,
The functions Fπ and Fc first map the state s (with a corresponding function M) to an abstract state z:
M(s)=z (5)
Using the abstract state z and the layout information L, the control device 105 can ascertain feature layout neighborhood information of z, ν (z) (which contains the values in the neighborhood of z). Here the neighborhood of a state is formed by the horizontally, vertically, and diagonally adjacent states (i.e. tiles in the representation of
The value neighborhood information ν (z) is normalized by a normalizer 208 to form normalized value neighborhood information
(z)={ (z)} (7)
Here μ and σ are the mean value and standard deviation of the values in the neighborhood, respectively.
By subtracting the mean and dividing by the standard deviation, actor 205 and critic 206 are trained on the relative values of the neighborhood instead of the absolute values. Fπ sets the improved state for the actor ζπ by concatenation of the normalized value neighborhood information
replay buffer 208 for the fine-scale states, which is designated by Bs. The successor state s′ results here from the interaction of the action a with the environment 210.
Different losses are used to train the propagation network 202, the actor 205, and the critic 206. In the following, we assume that double-Q learning is used for the critic and therefore there are two critic networks with two critic losses Lcritic
Here Φϕ designates the (trainable) mapping from
where, if target networks are used (i.e. two versions of the network per network, one being the destination (or target) network whose parameters follow those of the other), here the network parameters ψ, ϕ are those of the target networks. The corresponding parameters are marked with the index “target” in the following.
Another loss that is used is called MVProp auxiliary loss. It is a TD0 loss with respect to the abstract states and is given by
The networks are trained (i.e. the parameters θ, ψ, ϕ adjusted) such that these losses are minimized (for training batches B sampled from replay buffers 207, 209).
However, not all parameters need to be adjusted for each loss. In the following, four variants are described.
The first variant, referred to as CDPM-0 (CDPM: Continuous Differentiable Planning Module) minimizes each loss only over the parameters of the network to which the loss relates:
The second variant, referred to as CDPM actor, differs from CDPM-0 in that the parameters ϕ of the planning network 202 are also trained based on the actor loss:
In contrast, in the third variant, referred to as CDPM critic, the parameters ϕ of the planning network 202 are also trained based on the critic loss:
In the fourth variant, referred to as CDPM-B, the parameters ϕ of the planning network 202 are trained based on what is known as the MVProp loss, the actor loss, and the critic loss:
When using target networks, the parameters of the target networks are updated using the polyak mean value, for example. For example, the target planning network is updated according to
ϕtarget←τϕ+(1−τ)ϕtarget. (17)
When the agent implemented by control device 105 is initialized in an environment 101, it receives the layout information L and the abstract reward rz is set to zero. The control device 105 then transforms the layout information
Algorithm 1 describes the generation of the training state transitions in pseudocode (with the standard English keywords if, for, while, return, not, end, etc.).
s
z
Algorithm 2 describes the training in pseudocode (with the standard English keywords if, for, while etc.).
s and
z
s
|(0, 0.1), −c, c) −max-a,
s and z to
In the above example, HER (Hindsight Experience Replay) is used. This refers to a technique in which the identifier of a target state is changed so that the agent can learn from control runs in which the target state was not reached. This is optional.
In the use of the trained agent, the processing takes place analogously to the training. However, the critic 206 is then no longer used and no exploration takes place (i.e. the actions are always those selected by the (trained) actor 205).
As described above, the architecture of
Compared to MVProp, according to one embodiment the original off-policy actor-critic RL algorithm using importance weighting is replaced by the off-policy actor-critic TD3 algorithm described in reference 2. This further improves the training compared to MVProp.
In addition, as described above, an additional auxiliary loss (denoted in the above example by LMVProp) is used directly for the planning module 201, whose value is a function only of the output of the network 202 that outputs the propagation factors (i.e. the outputs of the other networks do not enter into this loss)
In addition, MVProp is adapted to use it with continuous states and continuous actions, in that
Moreover, according to various embodiments, the representations of inputs of the neural networks are adjusted:
These adjustments improve the performance (success rate) for specific application scenarios. For example, encoding the environment according to different terrain types (i.e. using different channels for different terrain types) improves the performance in application scenarios with multiple terrain types compared to using a single channel in which different terrain types are simply assigned different integers. Normalization improves generalization when applied to larger environments (layouts) than those that occur in training.
In summary, according to various embodiments, a method is provided as shown in
In 301, a plurality of control passes are performed, where, in each control pass,
In 306, the planning component is trained to reduce a loss (i.e. adjusted to reduce the loss) that includes, for each of a plurality of coarse-scale state transitions from a coarse-scale state to a coarse-scale successor state caused by the determined actions, an auxiliary loss that represents (or includes) a deviation between a value output by the planning component for the coarse-scale state and the sum of a reward received for the coarse-scale state transition and at least a portion of the value of the coarse-scale successor state.
According to one embodiment, a plurality of control passes are performed, wherein in each control pass
The method of
Thus, according to various embodiments, the method is in particular computer-implemented.
For example, the approach of
Various embodiments can receive and use sensor signals from various sensors such as video, radar, lidar, ultrasound, motion, thermal imaging, etc., for example to obtain sensor data regarding states and configurations and scenarios (including the layout of the environment, i.e. layout information). The sensor data can be processed. This can include classifying the sensor data or carrying out a semantic segmentation on the sensor data, for example in order to detect the presence of objects (in the environment in which the sensor data were obtained). Embodiments can be used to train a machine learning system and to control a robot, e.g. autonomous robotic manipulators, in order to achieve different manipulation tasks in different scenarios. In particular, embodiments are applicable to controlling and monitoring the execution of manipulation tasks, e.g. in assembly lines.
Although specific embodiments have been shown and described herein, it will be recognized by those skilled in the art that the specific embodiments shown and described herein may be exchanged for a variety of alternative and/or equivalent implementations without departing from the scope of protection of the present invention. The present application is intended to cover any adaptations or variations of the specific embodiments discussed herein.
Number | Date | Country | Kind |
---|---|---|---|
10 2022 209 845.5 | Sep 2022 | DE | national |