This application is related to obtaining parameters of a target neural network.
Humans possess an ability to adapt their behavior to new situations. Beyond simple tuning, humans can adopt entirely novel ways of moving their bodies, for example walking on crutches, with little to no training after an injury. The learning process that generalizes across all past experience and modes of behavior to rapidly output the needed behavior policy for a new situation is a hallmark of human intelligence.
A neural network model pertaining to a Markov decision process (MDP) may include a policy for determining each articulation of joints in a robot arm several times per second. The policy may be a part of an artificial intelligence machine called an agent in the robot.
A problem in the realm of robots is that a policy, possibly trained with near-optimal reinforcement learning (RL), will not perform well on a related, but different task. The robot may be shipped from a robot factory to a place of deployment (a home or a factory) including the policy installed at the robot factory.
For example, a robot trained to pick up a hammer may not pick up a coffee cup using the hammer policy.
The coffee cup example can be accommodated by storing a separate coffee cup policy in the robot. This approach requires exhaustively anticipating the possible tasks.
However, storing one policy for each possible task is an approach limited to the known tasks before the robot is deployed. The robot will not be able to do a new task. Also, the memory required in the robot will scale to excessively increase with the number of the exhaustively-anticipated tasks.
Embodiments of the present disclosure may solve the above technical problems.
This application provides a strong zero-shot behavior generalization approach based on hypernetworks. Hypernetworks allow a deep hyper-learner to output all parameters of a target neural network.
Embodiments provided herein train on the full solutions of numerous RL problems in a family of MDPs, where either reward or dynamics (often both) can change between task instances. The trained policies, value functions and rolled-out optimal behavior of each source task is the training information from which embodiments learn to generalize.
Hypernetworks of embodiments output the parameters of a fully-formed and highly performing policy without any experience in a related but unseen task, by conditioning on provided task parameters.
The differences between the tasks leads to large and complicated changes in the optimal policy and induced optimal trajectory distribution. Learning to predict new policies from this data requires powerful learners guided by helpful loss functions. Embodiments show that the abstraction and modularity properties afforded by hypernetworks allow them to approximate RL generated solutions by mapping a parametrized MDP family to a set of optimal solutions.
Embodiments achieve strong zero-shot transfer to new rewards and dynamics settings by exploiting commonalities in the MDP structure.
Embodiments are applicable across families of continuous control environments which are parameterized by physical dynamics, task reward, or both.
Embodiments include contextual zero-shot evaluation, where the learner is provided the parameters of the test task, but is not given any training time—rather the very first policy execution at test time is used to measure performance.
Embodiments outperform selected well-known baselines, in many cases recovering nearly full performance without a single time step of training data on the target tasks.
Ablations show that hypernetworks are a critical element in achieving strong generalization and that a structured TD-like loss, see Equation 5, is additionally helpful in training these networks.
Embodiments disclose hypernetworks which are a scalable and practical approach for approximating RL algorithms as a mapping from a family of parameterized MDPs to a family of near optimal policies.
Some embodiments include a TD-based loss for regularization of the generated policies and value functions to be consistent with respect to the Bellman equation.
Embodiments are applicable to a series of modular and customizable continuous control environments for transfer learning.
Provided herein is a method of training a hypernetwork, the method including: initializing the hypernetwork; sampling a mini-batch of system parameter sets from a plurality of system parameter sets; generating, using the hypernetwork, policy weights for a policy; generating, using the hypernetwork, value function weights for a value function; calculating a first loss, L_pred, using the mini-batch; calculating a second loss, L_TD, using the mini-batch; updating the hypernetwork using the first loss and the second loss; and repeating the sampling through the updating until the hypernetwork has converged.
Also provided herein is an apparatus including: one or more processors; and one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least: initialize a hypernetwork; sample a mini-batch of system parameter sets from a plurality of system parameter sets; generate, using the hypernetwork, policy weights for a policy; generate, using the hypernetwork, value function weights for a value function; calculate a first loss, L_pred, using the mini-batch; calculate a second loss, L_TD, using the mini-batch; update the hypernetwork using the first loss and the second loss; and repeatedly perform the sample through update operations until the hypernetwork has converged.
Also provided herein is a non-transitory computer readable medium storing instructions, the instructions configured to cause an apparatus to at least: initialize a hypernetwork; sample a mini-batch of system parameter sets from a plurality of system parameter sets; generate, using the hypernetwork, policy weights for a policy; generate, using the hypernetwork, value function weights for a value function; calculate a first loss, L_pred, using the mini-batch; calculate a second loss, L_TD, using the mini-batch; update the hypernetwork using the first loss and the second loss; and repeatedly perform the sample through update operations until the hypernetwork has converged.
The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of embodiments provided herein.
Some frequently used terms are discussed here.
A hypernetwork is something which is able to synthesize a particular policy from a set of policies. The particular policy is a near-optimal solution for a particular robot arm activity.
Optimal trajectories may be obtained using software modeling or recording real-world data of proper (state, action, reward) points.
Once the Q-Values are known, the optimal policy is that which chooses the highest Q-Value for that state:
where Q*(s, a) is the sum of discounted future rewards the agent can expect on average after it reaches the state s and chooses the action a.
The TD learning algorithm is:
Trajectory: one experience of moving through the MDP.
Policy: an algorithm a software agent uses to determine its actions. The policy may be a neural network. The parameters of the policy are specific to the application of the robot arm.
Context: for the ith MDP Mi ∈Mfamily a context includes rewards Rψ
The Bellman optimality equation is:
for all s.
T(s, a, s′) is the transition probability from state s to state s′, given that the agent 20 chose action a. This may also be referred to as the dynamics function Tμ, where μ represents the policy that is used to collect the coordinate points of the trajectory.
R(s, a, s′) is the reward that the agent 20 receives when it goes from state s to state s′ for the chosen action a. This may also be referred to as Rψ
γ is a discount factor.
At operation S11, the logic obtains the context 6 of the MDP for the specific task 7. At operation S12, the logic generates weights 8 for the policy and value function neural networks using the hypernetwork 10 (Hθ). The policy πk is defined by the weights 8.
S13 indicates that the agent 20 queried at time t for an action to take. The action will be found using the policy πk.
At operation S14, the robot 30, at time t, takes action αt using the policy πk. The robot 30 has now moved to a new state (St+1). The logic returns to S13 to find the next action to take. The series of actions accomplish the task 7 corresponding to πk. Task 7 corresponds to context 6. Thus, the robot takes a concrete action it previously was not configured to perform.
Referring generally to
At the factory, the hypernetwork 10 is trained over the family Mfamily. Each member of the family, Mi, is associated with a Reward function Rψ with parameters ψ and a Dynamics function Tμ with parameters μ. The parameterized family Mfamily is indicated on the left hand portion of
The RL algorithm can be used on a member of the family to find the near-optimal policy and near-optimal value function as shown in Equation 1.
Assuming that MDP Mi can be characterized by its parameters ψi and μi, Equation 1 can be simplified as Equation 2.
The near-optimal policy can be rolled out in an environment to obtain near-optimal trajectories as shown in Equation 3.
Two tasks are related if their reward parameters ψ exhibit cross-correlation above a first predetermined cross-correlation threshold and if their dynamic function parameters exhibit a cross-correlation above a second predetermined cross-correlation threshold.
The near-optimal reinforcement learning solution for any task is listed on the right hand side of
Using the hypernetwork 10, performance similar to that of the RL solution 49 is obtained, see
On the left hand portion of
Step 51 indicates obtaining trajectories. Further details of obtaining the trajectories are given in
After step 54, the robot 30 is deployed from the robot factory.
In use, a task 6 is given to the (trained) hypernetwork 10 to obtain a new policy. For new tasks indexed by 1, . . . , N, new policies πk for k=1, . . . , N are obtained, each of these corresponds to a related, but different robot task.
Pseudocode for building up the dataset 53 is provided in Table 1.
If enough trajectories have been obtained, the logic flows to logic 79 of
Referring generally to
Some embodiments build up the dataset 53 using observed ideal actions, for example, successful robot arm articulations for picking up a coffee cup without crushing the cup and without spilling the coffee. Some embodiments include receiving a plurality of first trajectories; and solving for a plurality of first reward parameters (ψ) and a plurality of first transition dynamics parameters (μ) based on the plurality of first trajectories.
During training some embodiments perform learning to obtain the hypernetwork 10 (Hθ) by generating, using the hypernetwork, the policy weights for a policy (πi) by solving for a plurality of first policy parameters (θ) based on the plurality of first reward parameters (ψ) and the plurality of first transition dynamics parameters (μ) and based on a plurality of second reward parameters (ψ) and a plurality of second transition dynamics parameters (μ), and generating, using the hypernetwork, the value weights for a value function comprises solving for a plurality of first value parameters (ϕ) based on the plurality of first reward parameters (ψ) and the plurality of first transition dynamics parameters (μ) and based on the plurality of second reward parameters (ψ) and the plurality of second transition dynamics parameters (μ).
At operation S72, weights 8 for the policy for the value function produced by the hypernetwork 10 are obtained.
At operation S73, based on the policy and value function, the hypernetwork is updated using L_pred+L_TD (see equations 4 and 5 below).
At operation S74 a convergence test for the hypernetwork 10 is applied. The convergence may be tested by recognizing that the weights no longer change significantly after each mini-batch, or an error from a ground truth value is below predetermined maximum allowable error.
The loss L_pred is given by Equation 4 and the loss L_TD is given by Equation 5.
In Equation 5,
When the hypernetwork 10 has converged, it may be installed in the robot 30 and the robot 30 shipped from the robot factory. If the hypernetwork 10 has not converged, another mini-batch of trajectories is sampled.
Referring generally to
In some embodiments, L_pred is based on predicting a near-optimal action and based on predicting a near-optimal value, and L_TD is based on moving the predicted target value toward a current value estimate.
Referring to operation S73 of
Also referring to operation S73 in
Embodiments improve the performance of models rolled out to perform new robot tasks.
For example, the rewards setting may be changed such as a different speed for a Cheetah environment.
The meta policy is a context-conditioned meta policy; trained to predict actions and evaluated for both zero-shot and few-shot transfer. The context-conditioned meta-policy substitutes the inferred task by the ground-truth task.
A conditional policy is a context-conditioned policy. It is trained to predict actions, similarly to imitation learning methods. The baseline of conditional policy+UVFA uses the TD loss term also.
Hardware for performing embodiments provided herein is now described with respect to
This application claims benefit of priority to U.S. Provisional Application No. 63/434,034 filed in the USPTO on Dec. 20, 2022. The content of the above application is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63434034 | Dec 2022 | US |