The present application claims the benefit under 35 U.S.C. § 119 of German Patent Application No. DE 102020211648.2 filed on Sep. 17, 2020, which is expressly incorporated herein by reference in its entirety.
Various exemplary embodiments relate in general to a device and a method for controlling a robotic device.
Robotic devices may be controlled using robot control models. For this purpose, a robot control model may be trained with the aid of machine learning, for example reinforcement learning. In the process, with the aid of an objective-directed policy for a present state of the robotic device, the robot control model may select an action to be carried out by the robotic device. The policy for a particular state of multiple states is mapped onto an action of multiple actions. The policy may be updated during the training of the robot control model and/or during the inference of the trained robot control model. It may be desirable and/or necessary for there to be a similarity between the initial policy and the updated policy within a predefined region (a trust region, for example).
Trust region policy optimization (TRPO) is described in Schulman et al., “Trust Region Policy Optimization,” ICML, Proceedings of Machine Learning Research, 37, 2015, in which a policy update takes place under a condition in such a way that the updated policy is within a trust region. The condition is a heuristic approximation of the Kullback-Leibler (KL) divergence between the initial policy and the updated policy, using an average KL divergence.
However, in reinforcement learning the exploration-exploitation compromise (also referred to as exploration-exploitation dilemma) must be taken into account.
It is described in Abdolmaleki et al., “Model-based relative entropy stochastic search,” Advances in Neural Information Processing Systems, 2015, that if the entropy of the updated policy is not taken into account in updating the policy, this may result in a premature policy convergence due to intensified exploitation. For the policy update within the trust region, the entropy of the policy may also be considered as an additional condition.
Akrour et al., “Projections for Approximate Policy Iteration Algorithms,” ICLR, 2019, build on the TRPO method described by Schulman et al. and the additional condition of policy entropy described by Abdolmaleki et al., an updated condition-limited policy being projected onto a condition-unlimited policy.
However, the condition used for TRPO, and thus also the projection of same, is based on the average KL divergence over all states. Therefore, individual states of the projected policy could violate the condition of the trust region (for example, could be outside the trust region). As a result, it could be necessary to provide a method that is able to ensure the trust region for each state during the update of the policy.
In addition, the described trust region policy optimization and the projection of the policy are limited to the averaged KL divergence. Therefore, it could be helpful and/or necessary for the projection of a policy onto a trust region to use other mathematical methods, for example methods that are better suited mathematically (for example, mathematical methods that require less computing complexity, such as mathematical methods that may be solved in closed form).
Furthermore, it could be advantageous, and/or for an end-to-end training of the robot control model, necessary, to provide a method for projecting the policy onto a trust region, with the aid of which the policy projection may be implemented as one or multiple differentiable layers in a neural network.
A method is described in Amos and Kolter, “OptNet: Differentiable Optimization as a Layer in Neural Networks,” 34th International Conference on Machine Learning, 2017, which allows optimization problems to be integrated as differentiable layers into a neural network (OptNet).
A method and device in accordance with example embodiments of the present invention may allow a robot control model to be trained in such a way that a trust region (a particular trust region, for example) is ensured for each state of the robot control model during an update of the policy of the robot control model.
Consequently, the device and the method for controlling a robotic device in accordance with example embodiments of the present invention are able to train the robot control model more efficiently (for example more quickly, for example with greater accuracy, for example with an improved exploration-to-exploitation ratio).
A robot control model may be a model that is based on machine learning. The robot control model may include a reinforcement learning algorithm, for example. According to various exemplary embodiments, at least a portion of the robot control model may be implemented with the aid of a neural network.
A robotic device may be any type of computer-controlled device, such as a robot (for example, a manufacturing robot, a maintenance robot, a household robot, a medical robot, etc.), a vehicle (an autonomous vehicle, for example), a household appliance, a production machine, a personal assistant, an access control system, etc.
Due to projecting the updated policy of the robot control model in such a way that the trust region is ensured for each state of the robot control model, for example the exploration-exploitation compromise may be controlled (for example improved, for example optimized) during the reinforcement learning.
In accordance with an example embodiment of the present invention, the ascertainment of the updated policy using the carried-out sequence of actions may include: ascertaining a particular reward for each carried-out action of the carried-out sequence of actions by applying a reward function to the particular resulting state; and ascertaining the updated policy, using the initial policy and the ascertained rewards, in such a way that an expected reward is maximized. The features described in this paragraph in combination with the first example form a second example.
In accordance with an example embodiment of the present invention, the projection of the updated policy onto the projected policy may include: projecting the updated policy onto the projected policy in such a way that for each state of the plurality of states of the projected policy: a similarity value according to the similarity metric between the projected policy and the updated policy is maximized, a similarity value according to the similarity metric between the projected policy and the initial policy is greater than the predefined threshold value, and an entropy of the projected policy is greater than or equal to a predefined entropy threshold value. The features described in this paragraph in combination with the first example or the second example form a third example.
The condition that the entropy for each state of the plurality of states of the projected policy is greater than or equal to a predefined entropy threshold value may result, for example, in not only the covariance, but also the expected value of the multivariate normal distribution of the projected policy being changed during updating of the policy.
In accordance with an example embodiment of the present invention, the initial policy may include an initial multivariate normal distribution of the plurality of actions. The updated policy may include an updated multivariate normal distribution of the plurality of actions. The projected policy may include a projected multivariate normal distribution of the plurality of actions. The projection of the updated policy onto the projected policy may include: projecting the updated policy onto the projected policy in such a way that for each state of the plurality of states of the projected policy: a similarity value according to the similarity metric between the projected multivariate normal distribution and the updated multivariate normal distribution is maximized, and a similarity value according to the similarity metric between the projected multivariate normal distribution and the initial multivariate normal distribution is greater than the predefined threshold value. The features described in this paragraph in combination with one or multiple of the first example through the third example form a fourth example.
The projection of the updated policy onto the projected policy may include: projecting the updated policy onto the projected policy in such a way that for each state of the plurality of states of the projected policy: a similarity value according to the similarity metric between the projected multivariate normal distribution and the updated multivariate normal distribution is maximized, and a similarity value according to the similarity metric between the projected multivariate normal distribution and the initial multivariate normal distribution is greater than the predefined threshold value; and an entropy of the projected multivariate normal distribution is greater than or equal to the predefined entropy threshold value. The features described in this paragraph in combination with the third example and the fourth example form a fifth example.
The projection of the updated policy onto the projected policy in such a way that for each state of the plurality of states of the projected policy, a similarity value according to the similarity metric between the projected multivariate normal distribution and the updated multivariate normal distribution is maximized, and a similarity value according to the similarity metric between the projected multivariate normal distribution and the initial multivariate normal distribution is greater than the predefined threshold value, may include: ascertaining the projected multivariate normal distribution using the initial multivariate normal distribution, the updated multivariate normal distribution, and the predefined threshold value with the aid of the Mahalanobis distance and the Frobenius norm. The features described in this paragraph in combination with the fourth example or the fifth example form a sixth example.
The projection of the updated policy onto the projected policy in such a way that for each state of the plurality of states of the projected policy, a similarity value according to the similarity metric between the projected multivariate normal distribution and the updated multivariate normal distribution is maximized, and a similarity value according to the similarity metric between the projected multivariate normal distribution and the initial multivariate normal distribution is greater than the predefined threshold value, may include: ascertaining the projected multivariate normal distribution using the initial multivariate normal distribution, the updated multivariate normal distribution, and the predefined threshold value with the aid of the Wasserstein distance. The features described in this paragraph in combination with the fourth example or the fifth example form a seventh example.
Use of the Mahalanobis distance and the Frobenius norm according to the sixth example or the Wasserstein distance according to the seventh example has the effect that the projection of the updated policy may be ascertained in a mathematically closed form. For example, the projected policy ascertained in this way may be integrated as a layer (or multiple layers) into a neural network.
The projection of the updated policy onto the projected policy in such a way that for each state of the plurality of states of the projected policy, a similarity value according to the similarity metric between the projected multivariate normal distribution and the updated multivariate normal distribution is maximized, and a similarity value according to the similarity metric between the projected multivariate normal distribution and the initial multivariate normal distribution is greater than the predefined threshold value, may include: ascertaining the projected multivariate normal distribution using the initial multivariate normal distribution, the updated multivariate normal distribution, and the predefined threshold value with the aid of a numerical optimizer. The features described in this paragraph in combination with the fourth example or the fifth example form an eighth example.
The numerical optimizer may ascertain the projected multivariate normal distribution using the Kullback-Leibler divergence. The feature described in this paragraph in combination with the eighth example forms a ninth example.
The ascertainment of the projected multivariate normal distribution may include a Lagrange multiplier method. The feature described in this paragraph in combination with one or multiple of the sixth example through the ninth example forms a tenth example.
The robot control model may be a neural network. The feature described in this paragraph in combination with one or multiple of the first example through the tenth example forms an eleventh example.
The projection of the updated policy onto the projected policy may be implemented as one or multiple layers (as differentiable layers, for example) in the neural network. The feature described in this paragraph in combination with the eleventh example forms a twelfth example.
The integration of the projection of the policy into a trust region for each state as one or multiple differentiable layers into a neural network has the effect that the neural network may be trained end-to-end using the policy projection, the condition of the trust region being ensured for each state during the training.
The adaptation of the robot control model for implementing the projected policy may include an adaptation of the robot control model with the aid of a gradient method. The feature described in this paragraph in combination with one or multiple of the first example through the twelfth example forms a thirteenth example.
The control of the robotic device using the adapted robot control model may include: carrying out one or multiple actions by the robotic device, using the adapted robot control model; updating the policy with the aid of a regression, using the carried-out one or multiple actions. The features described in this paragraph in combination with one or multiple of the first example through the thirteenth example form a fourteenth example.
The control of the robotic device using the adapted robot control model may include: carrying out one or multiple actions by the robotic device, using the adapted robot control model; updating the policy, using the carried-out one or multiple actions, in such a way that a difference between an expected reward and a similarity value according to the similarity metric between the projected policy and the updated policy is maximized. The features described in this paragraph in combination with one or multiple of the first example through the thirteenth example form a fifteenth example.
In accordance with an example embodiment of the present invention, a method for controlling a robotic device may include: carrying out a sequence of actions by the robotic device using a robot control model, the carrying out of each action of the sequence of actions including: ascertaining an action for a present state of a plurality of states of the robotic device with the aid of the robot control model, using an initial policy, carrying out the ascertained action by the robotic device, and ascertaining the state of the robotic device resulting from the carried-out action; ascertaining an updated policy using the carried-out sequence of actions; ascertaining a projected policy in such a way that a difference between a reward expected for the projected policy and a similarity value according to the similarity metric between each state of the plurality of states of the projected policy and the updated policy is maximized; adapting the robot control model for implementing the projected policy; and controlling the robotic device using the adapted robot control model. The method having the features described in this paragraph forms a sixteenth example.
In accordance with an example embodiment of the present invention, a method for controlling a robotic device may include: carrying out a sequence of actions by the robotic device using a robot control model, the carrying out of each action of the sequence of actions including: ascertaining an action for a present state of a plurality of states of the robotic device with the aid of the robot control model, using an initial policy, carrying out the ascertained action by the robotic device, and ascertaining the state of the robotic device resulting from the carried-out action; ascertaining an updated policy using the carried-out sequence of actions; ascertaining a projected policy in such a way that a difference between a reward expected for the projected policy and a similarity value according to the similarity metric between each state of the plurality of states of the projected policy and the updated policy is maximized; and controlling the robotic device with the aid of the robot control model, using the projected policy. The method having the features described in this paragraph forms a seventeenth example.
A computer program product may store program instructions which, when executed, carry out the method according to one or multiple of the first example through the seventeenth example. The computer program product having the features described in this paragraph forms a nineteenth example.
A nonvolatile memory medium may store program instructions which, when executed, carry out the method according to one or multiple of the first example through the seventeenth example. The nonvolatile memory medium having the features described in this paragraph forms a twentieth example.
A nonvolatile memory medium may store program instructions which, when executed, carry out the method according to one or multiple of the first example through the seventeenth example. The nonvolatile memory medium having the features described in this paragraph forms a twenty-first example.
Exemplary embodiments of the present invention are illustrated in the figures and explained in greater detail below.
In one specific embodiment, a “computer” may be understood as any type of logic-implementing entity, which may be hardware, software, firmware, or a combination thereof. Therefore, in one specific embodiment a computer may be a hard-wired logic circuit or a programmable logic circuit, such as a programmable processor, for example a microprocessor (for example, a CISC (processor including a large instruction set) or a RISC (processor including a reduced instruction set)). A computer may include one or multiple processors. A computer may also be software that is implemented or executed by a processor, for example any type of computer program, for example a computer program that uses a virtual machine code such as Java. In accordance with one alternative specific embodiment, any other type of implementation of the particular functions, described in greater detail below, may be understood as a computer.
Robotic devices may be controlled using reinforcement learning-based robot control models. To ensure an improved (optimal, for example) compromise of exploration and exploitation when updating the policy of the robot control model, it may be necessary to update the policy within a trust region. Various exemplary embodiments relate to a device and a method for controlling a robotic device which are able to train a robot control model in such a way that an updated policy is present within the trust region for each state of the robotic device. The trust region for each state of the robotic device may be taken into account and ensured when updating the policy of the robot control model.
Robotic device 101 includes robot members 102, 103, 104 and a base (or in general a mounting) 105 via which robot members 102, 103, 104 are supported. The term “robot member” refers to the movable parts of robotic device 101, whose actuation allows a physical interaction with the surroundings, for example to perform a task, for example to carry out an action.
For control, robotic device system 100 includes a control device 106 that is configured to achieve the interaction with the surroundings according to a control program. Last element 104 (viewed from base 105) of robot members 102, 103, 104 is also referred to as an end effector 104, and may include one or multiple tools such as a welding torch, a gripping tool, a painting device, or the like.
The other robot members 102, 103 (closer to base 105) may form a positioning device, so that together with end effector 104, a robot arm (or articulated arm) with end effector 104 at its end is provided. The robot arm is a mechanical arm that may fulfill functions similarly to a human arm (possibly including a tool at its end).
Robotic device 101 may include connecting elements 107, 108, 109 that connect robot members 102, 103, 104 to one another and to base 105. A connecting element 107, 108, 109 may include one or multiple articulated joints, each of which may provide a rotational movement and/or a translational movement (i.e., a displacement) for associated robot members relative to one another. The movement of robot members 102, 103, 104 may be initiated with the aid of actuators that are controlled by control device 106.
The term “actuator” may be understood as a component that is suitable for influencing a mechanism as a response to the actuator being driven. The actuator may convert instructions (so-called activation), output by control device 106, into mechanical movements. The actuator, for example an electromechanical converter, may be configured to convert electrical energy into mechanical energy as a response to the actuator being activated.
The term “control device” may be understood as any type of logical implementation unit, which may include a circuit and/or a processor, for example, that is able to execute software, firmware, or a combination of same stored in a memory medium, and that may issue instructions, for example to an actuator in the present example. The control device may be configured to control the operation of a system, in the present example a robot, using program code (software, for example).
In the present example, control device 106 includes a computer 110, and a memory 111 that stores code and data on the basis of which computer 110 controls robotic device 101. According to various specific embodiments, control device 106 controls robotic device 101 based on a robot control model 112 stored in memory 111.
According to various specific embodiments, robotic device system 100 may include one or multiple sensors 113. The one or multiple sensors 113 may be configured to provide sensor data that characterize a state of the robotic device. For example, the one or multiple sensors 113 may include an imaging sensor such as a camera (for example, a standard camera, a digital camera, an infrared camera, a stereo camera, etc.), a radar sensor, a LIDAR sensor, a position sensor, a speed sensor, an ultrasonic sensor, an acceleration sensor, a pressure sensor, etc.
Robotic device 101 may be in a state st of a plurality of states. According to various specific embodiments, at any point in time robotic device 101 may be in a present state of the plurality of states. The particular state of the plurality of states may be ascertained using the sensor data provided by the one or multiple sensors 113.
Robotic device 101 may be configured to carry out a plurality of actions. The actions of the plurality of actions may, for example, be predefined in the program code of control device 106. One or multiple actions of the plurality of actions may include, for example, a mechanical movement of one or multiple robot members 102, 103, 104. One or multiple actions of the plurality of actions may include, for example, an action of the end effector (for example gripping, for example releasing, etc.). According to various specific embodiments, carried-out action at in a present state st of robotic device 101 may result in a resulting state of the plurality of states of robotic device 101.
Robot control model 112 may be a reinforcement learning-based model. For example, robot control model 112 may implement a reinforcement learning algorithm.
Robot control model 112 may be configured to ascertain an action of the plurality of actions for a state of the plurality of states. For example, robot control model 112 may output an action of the plurality of actions to an input of a state of the plurality of states. Robot control model 112 may map from a state of the plurality of states onto an action of the plurality of actions. The states of the plurality of states may form a state space. The actions of the plurality of actions may form an action space. Robot control model 112 may map from the state space onto the action space.
According to various specific embodiments, robot control model 112 may include a policy 7. For example, robot control model 112 may pursue a policy at any point in time. A particular policy may be associated with an objective and/or a task. For example, a particular policy may be a policy for achieving the objective or for fulfilling the task. According to various specific embodiments, a policy may output an action of the plurality of actions to an input of a state of the plurality of states. The policy used by robot control model 112 may map from the state space onto the action space.
A particular probability distribution (a normal distribution, for example) of the plurality of actions may be associated with each state of the plurality of states. According to various specific embodiments, a policy may include or be a multivariate normal distribution (also referred to as a multidimensional normal distribution and/or as a multivariate Gaussian distribution). A multivariate normal distribution may be defined by an expected value vector and a covariance matrix. The expected value vector of the multivariate normal distribution of a policy may include an expected value for each state of the plurality of states. The covariance matrix (also referred to herein as covariance) of the multivariate normal distribution of a policy may be dependent on the plurality of states (for example, may be a function of same).
According to various specific embodiments, control device 106 may be configured to control robotic device 101 in such a way that robotic device 101 executes and/or carries out the action ascertained by robot control model 112 for the present state of robotic device 101, using the present policy.
Control device 106 may be configured to ascertain a reward for the state of robotic device 101 resulting from the carried-out action. According to various specific embodiments, control device 106 may ascertain the reward for a resulting state, using a reward function. The algorithm for carrying out the reward function may be stored in memory 111, for example. For example, robot control model 112 may be configured to carry out the reward function. The reward ascertained for the resulting state may, for example, be associated with or be the carried-out action in conjunction with the initial state of robotic device 101.
According to various specific embodiments, robotic device 101 may carry out a sequence of actions, using robot control model 112. Control device 106 may be configured to ascertain each action of the sequence of actions, using an initial policy
Control device 106 may be configured to ascertain a particular reward for each carried-out action of the carried-out sequence of actions.
Control device 106 (for example, computer 110 of control device 106) may be configured to ascertain an updated policy πθ, using the carried-out sequence of actions. Control device 106 may be configured to ascertain updated policy πθ in such a way that the expected reward given in equation (1) is increased (maximized, for example):
where τ=s0, a0, . . . is the trajectory of states st and of actions at that are run through using the policy for achieving the objective or for fulfilling the task, where γ is the discount factor, and where s0˜ρ(s0), at˜π(at|st) and st+1˜P(st+1|st,at).
A policy πθ may be defined by parameters θ of robot control model 112; for example, πθ may be parameterized by θ.
According to various specific embodiments, an updated policy πθ may be ascertained using equation (2):
where
is the initial policy (for example, the previously used policy), and where Aπ(at,st) is the advantage function. The advantage function may be ascertained, for example, by Aπ(at,st)=Qπ(at,st)−Vπ(st), where Qπ(at,st) is the action value function and is Vπ(st) the value function.
According to various specific embodiments, updated policy πθ may be ascertained using importance sampling.
According to various specific embodiments, updated policy πθ may be subject to one or multiple conditions (for example boundary conditions, for example constraints) with regard to initial policy
For example, a condition may be that updated policy πθ is within a trust region with regard to initial policy
(for example, robot control model 112 may implement trust range-based reinforcement learning). For example, a condition may be that a similarity value according to a similarity metric between updated policy πθ and initial policy
for each state st of the plurality of states is greater than a predefined threshold value. For example, a similarity value according to the similarity metric between the policy to be used and initial policy
for each state st of the plurality of states may be greater than a predefined threshold value if a distance d between updated policy πθ and initial policy
is less than or equal to predefined threshold value ϵ. According to various specific embodiments, a particular predefined threshold value ϵ may be associated with each state of the plurality of states. For example, with regard to equation (2), the condition (s.t.) that a similarity value according to the similarity metric between updated policy πθ and initial policy
for each state st of the plurality of states is greater than predefined threshold value ϵ may be described according to equation (3):
d(πθ
The updated policy may be limited for each point (state, for example) in the state space. Updating the policy within the trust region has the effect that the policy approaches the optimal policy in increments that are not too large (for example, converges with same). A measure for a change in the policy used may be limited.
According to various specific embodiments, a condition may be that an entropy of updated policy πθ for each state st of the plurality of states is greater than or equal to a predefined entropy threshold value β. For example, the condition with regard to equation (2) may be described according to equation (4):
(π(⋅|st))≥β(st) (4).
Use of the conditions regarding updated policy πθ according to equation (3) and optionally also equation (4) allows control over exploration and exploitation of the reinforcement learning of robot control model 112.
According to various specific embodiments, equation (2) together with the conditions according to equations (3) and (4) may be defined with the aid of equation (5):
With reference to
(in 202). As described herein, control device 106 may ascertain updated policy πθ according to equation (2) (in 204), it being possible for updated policy πθ to be limited by the conditions defined in equation (3) and optionally also in equation (4). Updated policy πθ may be a limited updated policy πθ. Updated policy πθ may be limited for each state of the plurality of states (for example, subject to the condition according to equation (3)). Updated policy πθ may be an updated policy πθ that is limited for each individual state. For example, each state of the plurality of states may have a particular predefined threshold value ϵ, so that predefined threshold value ϵ may be a predefined threshold value vector.
Control device 106 may be configured to ascertain a projected policy {tilde over (π)} (in 206). Control device 106 may be configured to project updated policy πθ onto a projected policy {tilde over (π)}. Control device 106 may be configured to project updated policy πθ onto projected policy {tilde over (π)} in such a way that for each state of the plurality of states of projected policy {tilde over (π)}, a similarity value according to a similarity metric between projected policy {tilde over (π)} and updated policy πθ is increased (maximized, for example). Control device 106 may be configured to project updated policy πθ onto projected policy {tilde over (π)} in such a way that for each state of the plurality of states of projected policy {tilde over (π)}, a similarity value according to the similarity metric between projected policy {tilde over (π)} and updated policy πθ is increased (maximized, for example), and that for each state of the plurality of states of projected policy {tilde over (π)}, a similarity value according to the similarity metric between projected policy {tilde over (π)} and initial policy
is greater than predefined threshold value ϵ. Control device 106 may be configured to project updated policy πθ onto projected policy {tilde over (π)} in such a way that for each state of the plurality of states of projected policy {tilde over (π)}, a similarity value according to the similarity metric between projected policy {tilde over (π)} and updated policy πθ is increased (maximized, for example), that for each state of the plurality of states of projected policy {tilde over (π)}, a similarity value according to the similarity metric between projected policy {tilde over (π)} and initial policy
is greater than predefined threshold value ϵ, and that for each state of the plurality of states of projected policy {tilde over (π)}, entropy of updated policy πθ for each state st of the plurality of states is greater than or equal to predefined entropy threshold value β.
According to various specific embodiments, a particular policy may be described with the aid of an associated multivariate normal distribution. For example, initial policy
may include an initial multivariate normal distribution of the plurality of actions. The initial multivariate normal distribution may be described by:
(a|s)=(a|μold(s),Σold(s)), where μold(s) is the initial expected value vector and Σold is the initial covariance of the initial multivariate normal distribution. For example, updated policy πθ may include an updated multivariate normal distribution of the plurality of actions. The updated multivariate normal distribution may be described by: πθ(a|s)=
(a|μ(s),Σ(s)), where μ is the updated expected value vector and Σ is the updated covariance of the updated multivariate normal distribution. The initial expected value vector, the initial covariance, the updated expected value vector, and/or the updated covariance may be a function of the plurality of states. For example, projected policy {tilde over (π)} may include a projected multivariate normal distribution of the plurality of actions. The projected multivariate normal distribution may be described by:
{tilde over (π)}(a|s)=(a|{tilde over (μ)}(μold,μ,Σold,Σ,ϵ(s)),{umlaut over (Σ)}(μold,μ,Σold,Σ,ϵ(s),β(s))),
where {tilde over (μ)} is the projected expected value vector and {tilde over (Σ)} is the projected covariance of the projected multivariate normal distribution.
The projected expected value vector may be a function of the initial expected value vector, the updated expected value vector, the initial covariance, the updated covariance, the predefined threshold value, and/or the plurality of states. The projected covariance may be a function of the initial expected value vector, the updated expected value vector, the initial covariance, the updated covariance, the predefined threshold value, the plurality of states, and/or the predefined entropy threshold value.
Control device 106 may be configured to project updated policy πθ onto projected policy {tilde over (π)} in such a way that for each state of the plurality of states of projected policy {tilde over (π)}, a similarity value according to a similarity metric between the projected multivariate normal distribution and the updated multivariate normal distribution is increased (maximized, for example). Control device 106 may be configured to project updated policy πθ onto projected policy {tilde over (π)} in such a way that for each state of the plurality of states of projected policy {tilde over (π)}, a similarity value according to the similarity metric between the projected multivariate normal distribution and the updated multivariate normal distribution is increased (maximized, for example), and that for each state of the plurality of states of projected policy {tilde over (π)}, a similarity value according to the similarity metric between the projected multivariate normal distribution and the initial multivariate normal distribution is greater than predefined threshold value ϵ. Control device 106 may be configured to project updated policy πθ onto projected policy {tilde over (π)} in such a way that for each state of the plurality of states of projected policy {tilde over (π)}, a similarity value according to the similarity metric between the projected multivariate normal distribution and the updated multivariate normal distribution is increased (maximized, for example), that for each state of the plurality of states of projected policy Fr, a similarity value according to the similarity metric between the projected multivariate normal distribution and the initial multivariate normal distribution is greater than predefined threshold value ϵ, and that for each state of the plurality of states of the projected multivariate normal distribution, entropy of the updated St [sic] multivariate normal distribution for each state st of the plurality of states is greater than or equal to predefined entropy threshold value β.
As described herein, a similarity value may be described according to the similarity metric, using a distance d. According to various specific embodiments, the similarity value according to the similarity metric between the projected multivariate normal distribution of projected policy {tilde over (π)} and the updated multivariate normal distribution of updated policy πθ may have a distance d between the projected multivariate normal distribution of projected policy {tilde over (π)} and the updated multivariate normal distribution. According to various specific embodiments, the similarity value according to the similarity metric between the projected multivariate normal distribution of projected policy {tilde over (π)} and the initial multivariate normal distribution of initial policy
may have a distance d between the projected multivariate normal distribution of projected policy {tilde over (π)} and the initial multivariate normal distribution of initial policy
According to various specific embodiments, the optimization problem for ascertaining projected policy {tilde over (π)} may be described according to equations (6) through (8):
According to various specific embodiments, the projection of updated policy πθ onto projected policy {tilde over (π)} may take place in such a way that projected policy {tilde over (π)} is an unlimited projected policy {tilde over (π)}. Projected policy {tilde over (π)} may be ascertained in such a way that the conditions (cf. equation (6), for example) are met. Updated policy πθ is projected onto projected policy {tilde over (π)} in such a way that the projected multivariate normal distribution is as close as possible to the updated multivariate normal distribution (for example, that the distance between the projected multivariate normal distribution and the updated multivariate normal distribution is minimal), and that the projected multivariate normal distribution (and thus projected policy {tilde over (π)}) meets the conditions.
The projected multivariate normal distribution may be described with the aid of projected expected value vector {tilde over (μ)} and projected covariance {tilde over (Σ)}. The projection of updated policy πθ onto projected policy {tilde over (π)} may include ascertaining projected expected value vector {tilde over (μ)} and projected covariance {tilde over (Σ)}.
The updated multivariate normal distribution may be projected onto the projected multivariate normal distribution under the one or multiple conditions described herein. The updated multivariate normal distribution may be projected onto the projected multivariate normal distribution [in such a way] that the one or multiple conditions described herein are met for each state of the plurality of states.
Three examples of projection methods for ascertaining projected expected value vector {tilde over (μ)} and projected covariance {tilde over (Σ)} are described below:
According to various specific embodiments, equation (6) of the optimization [ahalanobis distance and objected expected value vector {tilde over (μ)} and projected covariance {tilde over (Σ)} according to equation (9):
According to various specific embodiments, the expected value vector and the covariance may be independent of one another. For example, the expected value vector and the covariance may be considered independently. For example, the condition according to equation (7) for a predefined threshold value ϵμ of the expected value vector may be considered according to equation (10), and a predefined threshold value ϵΣ of the covariance may be considered according to equation (11):
(μold−{tilde over (μ)})TΣold−1(μold−{tilde over (μ)})≤ϵμ (10)
∥Σold−{tilde over (Σ)}∥F2≤ϵΣ (11).
A similarity value according to the similarity metric may be considered with regard to the expected value vector, and a similarity value according to the similarity metric may be considered with regard to the covariance. According to various specific embodiments, the optimization problem described according to equations (9) through (11) may be solved using a Lagrange multiplier method. For example, the Lagrange duality of equations (9) through (11) may be described according to Lagrange function ({tilde over (μ)}, {tilde over (Σ)}, ω, η) of equation (12):
({tilde over (μ)},{tilde over (Σ)},ω,η)=(μ−{tilde over (μ)})TΣold−1(μ−{tilde over (μ)})+∥Σ−{tilde over (Σ)}∥F2+ω((μ−{tilde over (μ)})TΣold−1(μ−{tilde over (μ)})−ϵμ)+η(∥Σ−{tilde over (Σ)}∥F2−ϵΣ) (12),
where ω and η are Lagrange multipliers.
Solving equation (12) results in the projected expected value vector according to equation (13) and the projected covariance according to equation (14):
where ω may be ascertained according to equation (15), and η may be ascertained according to equation (16).
(II) Second Projection Method
According to various specific embodiments, equation (6) of the optimization problem may be described according to equation (17), using the Wasserstein distance (for example, scaled Wasserstein distance) with regard to projected expected value vector {tilde over (μ)} and projected covariance {tilde over (Σ)}:
where tr is the trace of the matrix.
For two normal distributions, the Wasserstein distance includes a Euclidian distance of the expected values of the two normal distributions. Multiplying by initial covariance Σold and scaling the Wasserstein distance results in the Mahalanobis distance (in this regard, cf. equation (17), for example).
As described herein, the expected value vector and the covariance may be considered independently of one another. For example, the condition according to equation (7) may be considered for predefined threshold value ϵμ of the expected value vector according to equation (18), and for predefined threshold value ϵΣ of the covariance, according to equation (19):
where is the identity matrix (also referred to as information matrix).
According to various specific embodiments, the optimization problem described according to equations (17) through (19) may be solved using a Lagrange multiplier method. Reference is made to equations (13) and (15) for the solution with regard to projected expected value vector {tilde over (μ)}.
According to various specific embodiments, the optimization problem may be solved with regard to the root of the projected covariance. For example, the Lagrange duality of equations (17) and (19) may be described according to Lagrange function ({tilde over (Σ)}1/2,η) by equation (20).
({tilde over (Σ)}1/2,η)=tr(Σold−1Σ+{tilde over (Σ)}1/2Σold−1{tilde over (Σ)}1/2−2Σ1/2Σold−1{tilde over (Σ)}1/2)+η(tr(
+{tilde over (Σ)}1/2Σold−1{tilde over (Σ)}1/2−2Σold−1/2{tilde over (Σ)}1/2)−ϵΣ) (20)
Solving equation (20) results in the projected covariance according to equation (21):
where η may be ascertained according to equation (22).
The first projection method and the second projection method may thus be solved in a closed form (the projected multivariate normal distribution may be ascertained in a closed form).
(III) Third Projection Method
According to various specific embodiments, the optimization problem may be solved according to equations (6) through (8) with the aid of a numerical optimizer.
A multivariate normal distribution may be described with the aid of canonical parameter q (also referred to as natural parameter) and cumulant-generating function Λ.
Numerical optimizer 302 may be configured to solve the optimization problem according to equations (6) through (8) for a canonical parameter q and a cumulant-generating function Λ, in that numerical optimizer 302 ascertains a first optimized Lagrange multiplier η* and a second optimized Lagrange multiplier ω* for canonical parameter q and cumulant-generating function Λ. For example, numerical optimizer 302 may ascertain first optimized Lagrange multiplier η* and second optimized Lagrange multiplier ω*, using the KL divergence.
According to various specific embodiments, for updated covariance Σ 304, control device 106 may ascertain updated cumulant-generating function Λ 306 according to Λ=Σ−1. For example, control device 106 may ascertain updated canonical parameter q 310 according to q=Λμ for updated expected value vector μ 308 and updated cumulant-generating function Λ 306.
Numerical optimizer 302 may be configured to ascertain first optimized Lagrange multiplier η* and second optimized Lagrange multiplier ω*, using updated cumulant-generating function Λ 306 and updated canonical parameter q 310. Numerical optimizer 302 may be configured to ascertain first optimized Lagrange multiplier η* 316 and second optimized Lagrange multiplier ω* 318, using updated cumulant-generating function Λ 306, updated canonical parameter q 310, a first Lagrange multiplier η 312, and a second Lagrange multiplier ω 314. For example, first Lagrange multiplier η 312 and/or second Lagrange multiplier ω 314 may be predefined (set, for example). For example, numerical optimizer 302 may be configured to ascertain first Lagrange multiplier η 312 and/or second Lagrange multiplier ω 314.
Initial cumulant-generating function Λold may be ascertained using initial covariance Σold (for example, based on Λ=Σ−1). Initial canonical parameter qold may be ascertained using the initial cumulant-generating function and the initial expected value vector (for example, based on q=Λμ).
Projected canonical parameter {tilde over (q)} 320 may be ascertained according to equation (23):
Projected cumulant-generating function {tilde over (Λ)} 322 may be ascertained according to equation (24):
Projected covariance {tilde over (Σ)} 324 may be ascertained using projected cumulant-generating function {tilde over (Λ)} 322 (for example, based on Λ=Σ−1). Projected expected value vector {tilde over (μ)} may be ascertained using projected canonical parameter {tilde over (q)} 320 and projected cumulant-generating function {tilde over (Λ)} 322 (for example, based on q=Λμ).
It is pointed out that the projection of updated policy πθ onto projected policy {tilde over (π)} may also take place with the aid of other than the three projection methods described herein by way of example.
According to various specific embodiments, the projected policy is an optimized policy, and an optimal sequence of states of robotic device 101 and executed and/or carried-out actions of robotic device 101 may be ascertained with the aid of the projected policy and carried out by robotic device 101.
According to various specific embodiments, robot control model 112 may include or be a neural network. The projection of updated policy πθ onto projected policy {tilde over (π)} may be implemented as one or multiple layers in the neural network. For example, the projection of updated policy πθ onto projected policy {tilde over (π)} may be implemented as one or multiple differentiable layers in the neural network. According to various specific embodiments, the one or multiple layers may be configured in such a way that the projection described herein is carried out if one of the conditions for the updated policy is not met.
According to various specific embodiments, control device 106 may be configured to adapt robot control model 112 for implementing projected policy {tilde over (π)}.
According to various specific embodiments, robot control model 112 may include a neural network, and the adaptation of robot control model 112 may be a training of the neural network. For example, the neural network may be trained using a gradient method (a policy gradient method, for example). According to various specific embodiments, one or multiple gradients may be ascertained using projected policy {tilde over (π)} and initial policy
The adaptation of the neural network using projected policy {tilde over (π)} may be an iteration of the training of the neural network. According to various specific embodiments, multiple iterations may be carried out. For example, the method described herein for adapting robot control model 112 may be carried out multiple times.
For example, the neural network of robot control model 112 may be adapted (trained, for example) with the aid of the gradient method, using the ascertained one or multiple gradients.
The first projection method and the second projection method may be solved in closed form. The one or multiple gradients may be ascertained directly. For the third projection method, the one or multiple gradients may be ascertained using the OptNet method described by Amos and Kolter. A layer of the neural network may have the following Lagrange duality:
According to various specific embodiments, the one or multiple gradients may be ascertained (computed, for example) by deriving the appropriate Karush-Kuhn-Tucker (KKT) conditions.
The stationary KKT condition may be described with the aid of equation (25), for example:
where λ1 is a first KKT multiplier and λ2 is a second KKT multiplier.
The complementary slackness of the KKT [multipliers] may be described with the aid of equation (26), for example:
λ1(−η*)=0λ2(−ω*)=0 (26).
According to various specific embodiments, the one or multiple gradients may be ascertained by deriving the Karush-Kuhn-Tucker (KKT) conditions. According to various specific embodiments, the one or multiple layers of the neural network may be configured in such a way that the projection is carried out if one of the conditions described herein (for example, the condition according to equation (3), for example the condition according to equation (4)) for updated policy πθ is not met. For example, the one or multiple gradients may be ascertained for the following scenarios:
According to various specific embodiments, if at least one of the conditions is not met, the one or multiple layers of the neural network may project the policy as described herein. According to various specific embodiments, if at least one of the conditions is not met and if the third projection method is used, the one or multiple layers of the neural network may ascertain the one or multiple gradients.
One of the three projection methods may be implemented as one or multiple differentiable layers in a neural network, so that the neural network may be trained end-to-end in such a way that the one or multiple conditions (for example, the condition of the trust region) are ensured (met, for example) for each state of the plurality of states during the training.
According to various specific embodiments, control device 106 may be configured to control robotic device 101, using adapted robot control model 112.
Control device 106 may be configured to ascertain the present state of robotic device 101. Control device 106 may be configured to ascertain, with the aid of adapted robot control model 112, an action to be carried out for the present state, using the projected policy. The action to be carried out may be, for example, the action of the plurality of actions, described by the projected multivariate normal distribution, having the highest probability (for example, the action associated with the expected value of the present state). Control device 106 may be configured to control robotic device 101 corresponding to the action to be carried out, so that robotic device 101 executes and/or carries out the action. According to various specific embodiments, robotic device 101 may carry out one or multiple actions, using adapted robot control model 112.
According to various specific embodiments, control device 106 may update the policy, using the one or multiple carried-out actions. As described herein, an updated policy may be ascertained, and a projected policy may be ascertained using the updated policy. According to various specific embodiments, for example for the inference of robot control model 112 (the neural network, for example), the optimization problem may be solved according to equation (27).
Robot control model 112 may be adapted with the aid of a regression (for example, including one or multiple regression steps), using the one or multiple carried-out actions.
According to various specific embodiments, the projected policy may be ascertained according to the objective function given in equation (28). The projected policy may be ascertained in such a way that a difference between the expected reward (cf. equation (2), for example) and the similarity value according to the similarity metric between the projected policy and the updated policy is increased (maximized, for example). The similarity value according to the similarity metric between the projected policy and the updated policy may be ascertained, for example, via distance d({tilde over (π)}(at|st), πθ(at|st)) between the projected policy and the updated policy. The distance between the projected policy and the updated policy may be ascertained, for example, using the three projection methods described herein.
Method 400 may include carrying out a sequence of actions by the robotic device, using a robot control model (in 402). The carrying out of each action of the sequence of actions may include: ascertaining an action for a present state of a plurality of states of the robotic device with the aid of the robot control model, using an initial policy, carrying out the ascertained action by the robotic device, and ascertaining the state of the robotic device resulting from the carried-out action. According to various specific embodiments, the robot control model may be a reinforcement learning-based model (a reinforcement learning-based neural network, for example).
Method 400 may include an ascertainment of an updated policy, using the carried-out sequence of actions (in 404).
Method 400 may include a projection of the updated policy onto a projected policy (in 406). The projection of the updated policy onto a projected policy may take place in such a way that for each state of the plurality of states of the projected policy, a similarity value according to a similarity metric between the projected policy and the updated policy is increased (maximized, for example), and that for each state of the plurality of states of the projected policy, a similarity value according to the similarity metric between the projected policy and the initial policy is greater than a predefined threshold value. The projection of the updated policy onto a projected policy may take place in such a way that for each state of the plurality of states of the projected policy, a similarity value according to the similarity metric between the projected policy and the updated policy is increased (maximized, for example), that for each state of the plurality of states of the projected policy, a similarity value according to the similarity metric between the projected policy and the initial policy is greater than the predefined threshold value, and that for each state of the plurality of states of the projected policy, an entropy of the projected policy is greater than or equal to a predefined entropy threshold value.
Method 400 may include an adaptation of the robot control model for implementing the projected policy (in 408).
Method 400 may include a control of the robotic device, using the adapted robot control model (in 410).
Method 500 may include carrying out a sequence of actions by the robotic device, using a robot control model (in 502). The carrying out of each action of the sequence of actions may include: ascertaining an action for a present state of a plurality of states of the robotic device with the aid of the robot control model, using an initial policy, carrying out the ascertained action by the robotic device, and ascertaining the state of the robotic device resulting from the carried-out action. According to various specific embodiments, the robot control model may be a reinforcement learning-based model (a reinforcement learning-based neural network, for example).
Method 500 may include an ascertainment of an updated policy, using the carried-out sequence of actions (in 504).
Method 500 may include an ascertainment of a projected policy, so that a difference between a reward expected for the projected policy and a similarity value according to the similarity metric between each state of the plurality of states of the projected policy and the updated policy is increased (maximized, for example) (in 506).
Method 500 may include a control of the robotic device with the aid of the robot control model, using the projected policy (in 508).
According to various specific embodiments, method 500 may include an adaptation of the robot control model for implementing the projected policy, and a control of the robotic device, using the adapted robot control model.
Number | Date | Country | Kind |
---|---|---|---|
102020211648.2 | Sep 2020 | DE | national |
Number | Name | Date | Kind |
---|---|---|---|
5319738 | Shima et al. | Jun 1994 | A |
8019713 | Gupta et al. | Sep 2011 | B2 |
9707680 | Jules et al. | Jul 2017 | B1 |
10751879 | Li et al. | Aug 2020 | B2 |
10786900 | Bohez | Sep 2020 | B1 |
20060184491 | Gupta et al. | Aug 2006 | A1 |
20150217449 | Meier | Aug 2015 | A1 |
20200130177 | Kolouri et al. | Apr 2020 | A1 |
Number | Date | Country |
---|---|---|
4440859 | May 1996 | DE |
112010000775 | Mar 2013 | DE |
102018201949 | Aug 2019 | DE |
102019131385 | May 2020 | DE |
Entry |
---|
Yang, Tsung-Yen, et al. “Projection-Based Constrained Policy Optimization.” International Conference on Learning Representations. Published 2019. |
Abdullah, Mohammed Amin, et al. “Wasserstein robust reinforcement learning.” arXiv preprint arXiv:1907.13196. Published 2019. |
Law, Marc T., et al. “Closed-form training of mahalanobis distance for supervised clustering.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Published 2016. |
Blind Review Yang et al., “Projection-Based Constrained Policy Optimization,” Accessed Jun. 6, 2024. |
OpenReview page detailing the original publication date of Yang et al., accessed Jun. 6, 2024. |
Schulman et al., “Trust Region Policy Optimization,” Proceedings of the 31st International Conference on Machine Learning, JMLR: W&Cp, vol. 37, 2015, pp. 1-9. <http://proceedings.mlr.press/v37/schulman15.pdf> Downloaded Sep. 10, 2021. |
Abdolmaleki et al., “Model-Based Relative Entropy Stochastic Search,” Advances in Neural Information Processing Systems, 2015, pp. 1-9. <https://papers.nips.cc/paper/2015/file/36ac8e558ac7690b6f44e2cb5ef93322-Paper.pdf> Downloaded Sep. 10, 2021. |
Akrour et al., “Projections for Approximate Policy Iteration Algorithms,” Proceedings of the 36th International Conference on Machine Learning, PMLR, vol. 97, 2019, pp. 1-10. <http://proceedings.mlr.press/v97/akrour19a/akrour19a.pdf> Downloaded Sep. 10, 2021. |
Amos et al., “Optnet: Differentiable Optimization as a Layer in Neural Networks,” Proceedings of the 34th International Conference on Machine Learning, PMLR, vol. 70, 2017, pp. 1-10. <https://dl.acm.org/doi/pdf/10.5555/3305381.3305396>Downloaded Sep. 10, 2021. |
Number | Date | Country | |
---|---|---|---|
20220080586 A1 | Mar 2022 | US |