Many robots are explicitly programmed to utilize one or more end effectors to manipulate one or more environmental objects. For example, a robot may utilize a grasping end effector such as an “impactive” gripper or “ingressive” gripper (e.g., physically penetrating an object using pins, needles, etc.) to pick up an object from a first location, move the object to a second location, and drop off the object at the second location. Some additional examples of robot end effectors that may grasp objects include “astrictive” end effectors (e.g., using suction or vacuum to pick up an object) and one or more “contigutive” end effectors (e.g., using surface tension, freezing or adhesive to pick up an object), to name just a few.
Some implementations disclosed herein are related to using large-scale reinforcement learning to train a policy model that can be utilized by a robot in performing a robotic task in which the robot interacts with one or more environmental objects. One non-limiting example of such a robotic task is robotic grasping, which is described in various examples presented herein. However, implementations disclosed herein can be utilized to train a policy model for other non-grasping robotic tasks such as opening a door, throwing a ball, pushing objects, etc.
In implementations disclosed herein, off-policy deep reinforcement learning is used to train the policy model, and the off-policy deep reinforcement learning is based on self-supervised data collection (e.g., using only self-supervised data). On-policy deep reinforcement learning can also be used to train the policy model, and can optionally be interspersed with the off-policy deep reinforcement learning as described herein. The self-supervised data utilized in the off-policy deep reinforcement learning can be based on sensor observations from real-world robots in performance of episodes of the robotic task, and can optionally be supplemented with self-supervised data from robotic simulations of performance of episodes of the robotic task. Through off-policy training, large-scale autonomous data collection, and/or other techniques disclosed herein, implementations can learn policies that generalize effectively to previously unseen objects, previously unseen environments, etc.
The policy model can be a machine learning model, such as a neural network model. Moreover, as described herein, implementations of the reinforcement learning utilized in training the neural network model utilize a continuous-action variant of Q-learning. Accordingly, the policy model can represent the Q-function. Implementations disclosed herein train and utilize the policy model for performance of closed-loop vision-based control, where a robot continuously updates its task strategy based on the most recent vision data observations to optimize long-horizon task success. In some of those implementations, the policy model is trained to predict the value of an action in view of current state data. For example, the action and the state data can both be processed using the policy model to generate a value that is a prediction of the value in view of the current state data.
As mentioned above, the current state data can include vision data captured by a vision component of the robot (e.g., a 2D image from a monographic camera, a 2.5D image from a stereographic camera, and/or a 3D point cloud from a 3D laser scanner). The current state data can include only the vision data, or can optionally include additional data such as whether a grasping end effector of the robot is open or closed. The action can include a pose change for a component of the robot, such as pose change, in Cartesian space, for a grasping end effector of the robot. The pose change can be defined by the action as, for example, a translation difference (indicating a desired change in position) and a rotation difference (indicating a desired change in azimuthal angle). The action can further include, for example, a component action command that dictates a target state of a dynamic state of the component, where the dynamic state is in addition to translation and rotation of the object. For instance, the component action command can indicate whether a gripper is to be opened, closed, or adjusted to a target state between opened and closed (e.g., partially closed). The action can further include a termination command that dictates whether to terminate performance of the robotic task.
As described herein, the policy model is trained in view of a reward function that can assign a positive reward (e.g., “1”) or a negative reward (e.g., “0”) at the last time step of an episode of performing a task. The last time step is one where a termination action occurred, as a result of an action determined based on the policy model indicating termination, or based on a maximum number of time steps occurring. Various self-supervision techniques can be utilized to assign the reward. For example, for a grasping task, at the end of an episode the gripper can be moved out of the view of the camera and a first image captured when it is out of the view. Then the gripper can be returned to its prior position and “opened” (if closed at the end of the episode) to thereby drop any grasped object, and a second image captured. The first image and the second image can be compared, using background subtraction and/or other techniques, to determine whether the gripper was grasping an object (e.g., the object would be present in the second image, but not the first)—and an appropriate award assigned to the last time step. In some implementations, the reward function can assign a small penalty (e.g., −0.05) for all time steps where the termination action is not taken. The small penalty can encourage the robot to perform the task quickly.
To enable the policy model to learn generalizable strategies, it is trained on a diverse set of data representing various objects and/or environments. For example, a diverse set of objects can be needed to enable the policy model to learn generalizable strategies for grasping, such as picking up new objects, performing pre-grasp manipulation, and/or handling dynamic disturbances with vision-based feedback. Collecting such data in a single on-policy training run can be impractical. For example, collecting such data in a single on-policy training run can require significant “clock on the wall” training time and resulting occupation of real-world robots.
Accordingly, implementations disclosed herein utilize a continuous-action generalization of Q-learning, which is sometimes reference herein as “QT-Opt”. Unlike other continuous action Q-learning methods, which are often unstable, QT-Opt dispenses with the need to train an explicit actor, and instead uses stochastic optimization to select actions (during inference) and target Q-values (during training). QT-opt can be performed off-policy, which makes it possible to pool experience from multiple robots and multiple experiments. For example, the data used to train the policy model can be collected over multiple robots operating over long durations. Even fully off-policy training can provide improved performance for task performance, while a moderate amount of on-policy fine-tuning using QT-opt can further improve performance. QT-opt maintains the generality of non-convex Q-functions, while avoiding the need for a second maximizer network.
In various implementations, during inference, stochastic optimization is utilized to stochastically select actions to evaluate in view of a current state and using the policy model—and to stochastically select a given action (from the evaluated actions) to implement in view of the current state. For example, the stochastic optimization can be a derivative-free optimization algorithm, such as the cross-entropy method (CEM). CEM samples a batch of N values at each iteration, fits a Gaussian distribution to the best M<N of these samples, and then samples next batch of N from that Gaussian. As one non-limiting example, N can be 64 and M can be 6. During inference, CEM can be used to select 64 candidate actions, those actions evaluated in view of a current state and using the policy model, and the 6 best can be selected (e.g., the 6 with the highest Q-values generated using the policy model). A Gaussian distribution can be fit to those 6, and 64 more actions selected from that Gaussian. Those 64 actions can be evaluated in view of the current state and using the policy model, and the best one (e.g., the one with the highest Q-value generated using the policy model) can be selected as the action to be implemented. The preceding example is a two iteration approach with N=64 and M=6. Additional iterations can be utilized, and/or alternative N and/or M values.
In various implementations, during training, stochastic optimization is utilized to determine a target Q-value for use in generating a loss for a state, action pair to be evaluated during training. For example, stochastic optimization can be utilized to stochastically select actions to evaluate in view of a “next state” that corresponds to the state, action pair and using the policy model—and to stochastically select a Q-value that corresponds to given action (from the evaluated actions). The target Q-value can be determined based on the selected Q-value. For example, the target Q-value can be a function of the selected Q-value and the reward (if any) for the state, action pair being evaluated.
The above description is provided as an overview of only some implementations disclosed herein. These and other implementations are described in more detail herein.
In some implementations, a method implemented by one or more processors of a robot during performance of a robotic task is provided and includes: receiving current state data for the robot and selecting a robotic action to be performed for the robotic task. The current state data includes current vision data captured by a vision component of the robot. Selecting the robotic action includes: performing an optimization over candidate robotic actions using, as an objective function, a trained neural network model that represents a Q-function, and that is trained using reinforcement learning, where performing the optimization includes generating Q-values for a subset of the candidate robotic actions that are considered in the optimization. Generating each of the Q-values is based on processing of the state data and a corresponding one of the candidate robotic actions of the subset using the trained neural network model. Selecting the robotic action further includes selecting the robotic action, from the candidate robotic actions, based on the Q-values generated for the robotic action during the performed optimization. The method further includes providing commands to one or more actuators of the robot to cause performance of the selected robotic action.
These and other implementations may include one or more of the following features.
In some implementations, the robotic action includes a pose change for a component of the robot, where the pose change defines a difference between a current pose of the component and a desired pose for the component of the robot. In some of those implementations, the component is an end effector and the pose change defines a translation difference for the end effector and a rotation difference for the end effector. In some versions of those implementations, the end effector is a gripper and the robotic task is a grasping task.
In some implementations, the robotic action includes a termination command that dictates whether to terminate performance of the robotic task. In some of those implementations, the robotic action further includes a component action command that dictates a target state of a dynamic state of the component, where the dynamic state is in addition to translation and rotation of the component. In some versions of those implementations, the component is a gripper and the target state dictated by the component action command indicates that the gripper is to be closed. In some versions of those implementations, the component action command includes an open command and a closed command that collectively define the target state as opened, closed, or between opened and closed.
In some implementations, the current state data further includes a current status of a component of the robot. In some of those implementations, the component of the robot is a gripper and the current status indicates whether the gripper is opened or closed.
In some implementations, the optimization is a stochastic optimization. In some of those implementations, the optimization is a derivative-free method, such as a cross-entropy method (CEM).
In some implementations, performing the optimization over the candidate robotic actions includes: selecting an initial batch of the candidate robotic actions; generating a corresponding one of the Q-values for each of the candidate robotic actions in the initial batch; selecting an initial subset of the candidate robotic actions in the initial batch based on the Q-values for the candidate robotic actions in the initial batch; fitting a Gaussian distribution to the selected initial subset of the candidate robotic actions; selecting a next batch of the candidate robotic actions based from the Gaussian distribution; and generating a corresponding one of the Q-values for each of the candidate robotic actions in the next batch. In some of those implementations, the robotic action is one of the candidate robotic actions in the next batch, and selecting the robotic action, from the candidate robotic actions, based on the Q-value generated for the robotic action during the performed optimization includes: selecting the robotic action from the next batch based on the Q-value generated for the robotic action being the maximum Q-value of the corresponding Q-values of the next batch.
In some implementations, generating each of the Q-values based on processing of the state data and a corresponding one of the candidate robotic actions of the subset using the trained neural network model includes: processing the state data using a first branch of the trained neural network model to generate a state embedding; processing a first of the candidate robotic actions of the subset using a second branch of the trained neural network model to generate a first embedding; generating a combined embedding by tiling the state embedding and the first embedding; and processing the combined embedding using additional layers of the trained neural network model to generate a first Q-value of the Q-values. In some of those implementations, generating each of the Q-values based on processing of the state data and a corresponding one of the candidate robotic actions of the subset using the trained neural network model further includes: processing a second of the candidate robotic actions of the subset using the second branch of the trained neural network model to generate a second embedding; generating an additional combined embedding by reusing the state embedding, and tiling the reused state embedding and the first embedding; and processing the additional combined embedding using additional layers of the trained neural network model to generate a second Q-value of the Q-values.
In some implementations, a method of training a neural network model that represents a Q-function is provided. The method implemented by a plurality of processors, and the method includes: retrieving a robotic transition, the robotic transition generated based on data from an episode of a robot performing a robotic task. The robotic transition includes: state data that includes vision data captured by a vision component at a state of the robot during the episode; next state data that includes next vision data captured by the vision component at a next state of the robot during the episode, the next state being transitioned to from the state; an action executed to transition from the state to the next state; and a reward for the robotic transition. The method further includes determining a target Q-value for the robotic transition. Determining the target Q-value includes: performing an optimization over candidate robotic actions using, as an objective function, a version of a neural network model that represents the Q-function. Performing the optimization includes generating Q-values for a subset of the candidate robotic actions that are considered in the optimization, where generating each of the Q-values is based on processing of the next state data and a corresponding one of the candidate robotic actions of the subset using the version of the neural network model. Determining the target Q-value further includes: selecting, from the generated Q-values, a maximum Q-value; and determining the target Q-value based on the maximum Q-value and the reward. The method further includes: storing, in a training buffer: the state data, the action, and the target Q-value; retrieving, from the training buffer: the state data, the action, and the target Q-value; and generating a predicted Q-value. Generating the predicted Q-value includes processing the retrieved state data and the retrieved action using a current version of the neural network model, where the current version of the neural network model is updated relative to the version. The method further includes generating a loss based on the predicted Q-value and the target Q-value and updating the current version of the neural network model based on the loss.
These and other implementations of the technology can include one or more of the following features.
In some implementations, the robotic transition is generated based on offline data and is retrieved from an offline buffer. In some of those implementations, retrieving the robotic transition from the offline buffer is based on a dynamic offline sampling rate for sampling from the offline buffer, where the dynamic offline sampling rate decreases as a duration of training the neural network model increases. In some versions of those implementations, the method further includes generating the robotic transition by accessing an offline database that stores offline episodes.
In some implementations, the robotic transition is generated based on online data and is retrieved from an online buffer, where the online data is generated by a robot performing episodes of the robotic task using a robot version of the neural network model. In some of those implementations, retrieving the robotic transition from the online buffer is based on a dynamic online sampling rate for sampling from the online buffer, where the dynamic online sampling rate increases as a duration of training the neural network model increases. In some versions of those implementations, the method further includes updating the robot version of the neural network model based on the loss.
In some implementations, the action includes a pose change for a component of the robot, where the pose change defines a difference between a pose of the component at the state and a next pose of the component at the next state.
In some implementations, the action includes a termination command when the next state is a terminal state of the episode.
In some implementations, the action includes a component action command that defines a dynamic state, of the component, in the next state of the episode the dynamic state being in addition to translation and rotation of the component.
In some implementations, performing the optimization over the candidate robotic actions includes: selecting an initial batch of the candidate robotic actions; generating a corresponding one of the Q-values for each of the candidate robotic actions in the initial batch; selecting an initial subset of the candidate robotic actions in the initial batch based on the Q-values for the candidate robotic actions in the initial batch; fitting a Gaussian distribution to the selected initial subset of the candidate robotic actions; selecting a next batch of the candidate robotic actions based on the Gaussian distribution; and generating a corresponding one of the Q-values for each of the candidate robotic actions in the next batch. In some of those implementations, the maximum Q-value is one of the Q-values of the candidate robotic actions in the next batch and selecting the maximum Q-value is based on the maximum Q-value being the maximum Q-value of the corresponding Q-values of the next batch.
In some implementations, a method implemented by one or more processors of a robot during performance of a robotic task is provided and includes: receiving current state data for the robot, the current state data including current sensor data of the robot; and selecting a robotic action to be performed for the robotic task. Selecting the robotic action includes: performing an optimization over candidate robotic actions using, as an objective function, a trained neural network model that represents a learned optimal policy, where performing the optimization includes generating values for a subset of the candidate robotic actions that are considered in the optimization, and where generating each of the values is based on processing of the state data and a corresponding one of the candidate robotic actions of the subset using the trained neural network model. Selecting the robotic action further includes selecting the robotic action, from the candidate robotic actions, based on the value generated for the robotic action during the performed optimization. The method further includes providing commands to one or more actuators of the robot to cause performance of the selected robotic action.
In some implementations, a method of training a neural network model that represents a policy is provided. The method is implemented by a plurality of processors, and the method includes: retrieving a robotic transition, the robotic transition generated based on data from an episode of a robot performing a robotic task, and the robotic transition including state data and an action. The method further includes determining a target value for the robotic transition. Determining the target value includes performing an optimization over candidate robotic actions using, as an objective function, a version of a neural network model that represents the policy. The method further includes: storing, in a training buffer: the state data, the action, and the target value; retrieving, from the training buffer: the state data, the action data, and the target value; and generating a predicted value. Generating the predicted value includes processing the retrieved state data and the retrieved action data using a current version of the neural network model, where the current version of the neural network model is updated relative to the version. The method further includes generating a loss based on the predicted value and the target value and updating the current version of the neural network model based on the loss.
Other implementations may include a non-transitory computer readable storage medium storing instructions executable by one or more processor(s) (e.g., a central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s)), and/or tensor processing unit(s) (TPU(s))) to perform a method such as one or more of the methods described above and/or elsewhere herein. Yet other implementations may include a system of one or more computers and/or one or more robots that include one or more processors operable to execute stored instructions to perform a method such as one or more of the methods described above and/or elsewhere herein.
It should be appreciated that all combinations of the foregoing concepts and additional concepts described in greater detail herein are contemplated as being part of the subject matter disclosed herein. For example, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the subject matter disclosed herein.
Example vision components 184A and 184B are also illustrated in
The vision component 184A has a field of view of at least a portion of the workspace of the robot 180A, such as the portion of the workspace that includes example objects 191A. Although resting surface(s) for objects 191 are not illustrated in
The vision component 184B has a field of view of at least a portion of the workspace of the robot 1806, such as the portion of the workspace that includes example objects 191B. Although resting surface(s) for objects 191B are not illustrated in
Although particular robots 180A and 1806 are illustrated in
Robots 180A, 180B, and/or other robots may be utilized to perform a large quantity of grasp episodes and data associated with the grasp episodes can be stored in offline episode data database 150 and/or provided for inclusion in online buffer 112 (of a corresponding one of replay buffers 110A-N), as described herein. As described herein, robots 180A and 180B can optionally initially perform grasp episodes (or other task episodes) according to a scripted exploration policy, in order to bootstrap data collection. The scripted exploration policy can be randomized, but biased toward reasonable grasps. Data from such scripted episodes can be stored in offline episode data database 150 and utilized in initial training of policy model 152 to bootstrap the initial training.
Robots 180A and 180B can additionally or alternatively perform grasp episodes (or other task episodes) using the policy model 152, and data from such episodes provided for inclusion in online buffer 112 during training and/or provided in offline episode data database 150 (and pulled during training for use in populating offline buffer 114). For example, the robots 180A and 180B can utilize method 400 of
The data generated by a robot 180A or 180B during an episode can include state data, actions, and rewards. Each instance of state data for an episode includes at least vision-based data for an instance of the episode. For example, an instance of state data can include a 2D image when a vision component of a robot is a monographic camera. Each instance of state data can include only corresponding vision data, or can optionally include additional data such as whether a grasping end effector of the robot is open or closed at the instance. More formally, a given state observation can be represented as s∈S.
Each of the actions for an episode defines an action that is implemented in the current state to transition to a next state (if any next state). An action can include a pose change for a component of the robot, such as pose change, in Cartesian space, for a grasping end effector of the robot. The pose change can be defined by the action as, for example, a translation difference (indicating a desired change in position) and a rotation difference (indicating a desired change in azimuthal angle). The action can further include, for example, a component action command that dictates a target state of a dynamic state of the component, where the dynamic state is in addition to translation and rotation of the object. For instance, the component action command can indicate whether a gripper is to be opened, closed, or adjusted to a target state between opened and closed (e.g., partially closed). The action can further include a termination command that dictates whether to terminate performance of the robotic task. The terminal state of an episode will include a positive termination command to dictate termination of performance of the robotic task.
More formally, a given state observation can be represented as a∈A. In some implementations, for a grasping task, A includes a vector in Cartesian space t∈R3 indicating the desired change in the gripper position, a change in azimuthal angle encoded via a sine-cosine encoding r∈R3, binary gripper open and close commands gopen and gclose and a termination command e that ends the episode, such that a=(t, r, gopen and gclose, e).
Each of the rewards can be assigned in view of a reward function that can assign a positive reward (e.g., “1”) or a negative reward (e.g., “0”) at the last time step of an episode of performing a task. The last time step is one where a termination action occurred, as a result of an action determined based on the policy model indicating termination, or based on a maximum number of time steps occurring. Various self-supervision techniques can be utilized to assign the reward. For example, for a grasping task, at the end of an episode the gripper can be moved out of the view of the camera and a first image captured when it is out of the view. Then the gripper can be returned to its prior position and “opened” (if closed at the end of the episode) to thereby drop any grasped object, and a second image captured. The first image and the second image can be compared, using background subtraction and/or other techniques, to determine whether the gripper was grasping an object (e.g., the object would be present in the second image, but not the first)—and an appropriate award assigned to the last time step. In some implementations, the reward function can assign a small penalty (e.g., −0.05) for all time steps where the termination action is not taken. The small penalty can encourage the robot to perform the task quickly.
Also illustrated in
As mentioned herein, the policy model 152 can be a deep neural network model, such as the deep neural network model illustrated and described in
ε(θ)=E(s,a,s′)
where QT (s, a, s′)=r(s, a)+γV(s′) is a target value, and D is some divergence metric.
This corresponds to double Q-learning with a target network, a variant on the standard Bellman error, where Q
Q-learning with deep neural network function approximators provides a simple and practical scheme for reinforcement learning with image observations, and is amenable to straightforward parallelization. However, incorporating continuous actions, such as continuous gripper motion in grasping tasks, poses a challenge for this approach. Some prior techniques have sought to address this by using a second network that acts as an approximate maximizer or constraints the Q-function to be convex in a making it easy to maximize analytically. However, such prior techniques can be unstable, which makes it problematic for large-scale reinforcement learning tasks where running hyperparameter sweeps is prohibitively expensive. Accordingly, such prior techniques can be a poor fit for complex manipulation tasks such as grasping, where the Q-function is far from convex in the input. For example, the Q-value may be high for actions that reach toward objects, but low for the gaps between objects.
Accordingly, the QT-Opt approach described herein is an alternative approach that maintains the generality of non-convex Q-functions while avoiding the need for a second maximizer network. In the QT-Opt approach, a state s and action a are inputs into the policy model, and the max in Equation (3) below is evaluated by means of a stochastic optimization algorithm that can handle non-convex and multimodal optimization landscapes.
Formally, let πθ(s) be the policy implicitly induced by the Q-function Qθ(s, a). Equation (3) can be recovered by substituting the optimal policy πθ(s)=arg maxa Q9 (s, a) in place of the arg max argument to the target Q-function. In QT-Opt, πθ(s) is instead evaluated by running a stochastic optimization over a, using Q9(s, a) as the objective value. The cross-entropy method (CEM) is one algorithm for performing this optimization, which is easy to parallelize and moderately robust to local optima for low-dimensional problems. CEM is a simple derivative-free optimization algorithm that samples a batch of N values at each iteration, fits a Gaussian distribution to the best M<N of these samples, and then samples next batch of N from that Gaussian. In some implementations, N=64 and M=6, and two iterations of CEM are performed. As described herein, this procedure can be used both to compute targets at training time, and to choose actions for exploration in the real world.
Turning now to
To effectively ingest and train on such large and diverse datasets, a distributed, asynchronous implementation of QT-Opt can be utilized.
Further, online transitions can optionally be pushed, from robots 180, to online buffer 112. The online transitions can also optionally be stored in offline episode data database 150 and later read by log readers 126A-N, at which point they will be offline transitions.
A plurality of bellman updaters 122A-N operating in parallel sample transitions from the offline and online buffers 114 and 112. In various implementations, this is a weighted sampling (e.g., a sampling rate for the offline buffer 114 and a separate sampling rate for the online buffer 112) that can vary with the duration of training. For example, early in training the sampling rate for the offline buffer 114 can be relatively large, and can decrease with duration of training (and, as a result, the sampling rate for the online buffer 112 can increase). This can avoid overfitting to the initially scarce on-policy data, and can accommodate the much lower rate of production of on-policy data.
The Bellman updaters 122A-N label sampled data with corresponding target values, and store the labeled samples in a train buffer 116, which can operate as a ring buffer. In labeling a given instance of sampled data with a given target value, one of the Bellman updaters 122A-N can carry out the CEM optimization procedure using the current policy model (e.g., with current learned parameters). Note that one consequence of this asynchronous procedure is that the samples in train buffer 116 are labeled with different lagged versions of the current model. In some implementations, bellman updaters 122A-N can each perform one or more steps of method 500 of
A plurality of training workers 124A-N operate in parallel and pull labeled transitions from the train buffer 116 randomly and use them to update the policy model 152. Each of the training workers 124A-N computes gradients and sends the computed gradients asynchronously to the parameter servers 128A-N. In some implementations, bellman updaters 122A-N can each perform one or more steps of method 600 of
Additional description of implementations of methods that can be implemented by various components of
At block 302, the system starts log reading. For example, log reading can be initialized at the beginning of reinforcement learning.
At block 304, the systems reads data from a past episode. For example, the system can read data from an offline episode data database that stores states, actions, and rewards from past episodes of robotic performance of a task. The past episode can be one performed by a corresponding real physical robot based on a past version of a policy model. The past episode can, in some implementations and/or situations (e.g., at the beginning of reinforcement learning) be one performed based on a scripted exploration policy, based on a demonstrated (e.g., through virtual reality, kinesthetic teaching, etc.) performance of the task, etc. Such scripted exploration performances and/or demonstrated performances can be beneficial in bootstrapping the reinforcement learning as described herein.
At block 306, the system converts data into a transition. For example, the data read can be from two time steps in the past episode and can include state data (e.g., vision data) from a state, state data from a next state, an action taken to transition from the state to the next state (e.g., gripper translation and rotation, gripper open/close, and whether action led to a termination), and a reward for the action. The reward can be determined as described herein, and can optionally be previously determined and stored with the data.
At block 308, the system pushes the transition into an offline buffer. The system then returns to block 304 to read data from another past episode.
In various implementations, method 300 can be parallelized across a plurality of separate processors and/or threads. For example, method 300 can be performed simultaneously by each of 50, 100, or more separate workers.
At block 402, the system starts a policy-guided task episode.
At block 404, the system stores the state of the robot. For example, the state of the robot can include at least vision data captured by a vision component associated with the robot. For instance, the state can include an image captured by the vision component at a corresponding time step.
At block 406, the system selects an action using a current robot policy model. For example, the system can utilize a stochastic optimization technique (e.g., the CEM technique described herein) to sample a plurality of actions using the current robot policy model, and can select the sampled action with the highest value generated using the current robot policy model.
At block 408, the system executes the action using the current robot policy model. For example, the system can provide commands to one or more actuators of the robot to cause the robot to execute the action. For instance, the system provide commands to actuator(s) of the robot to cause a gripper to translate and/or rotate as dictated by the action and/or to cause the gripper to close or open as dictated by the action (and if different than the current state of the gripper). In some implementations the action can include a termination command (e.g., that indicates whether the episode should terminate) and if the termination command indicates the episode should terminate, the action at block 408 can be a termination of the episode.
At block 410, the system determines a reward based on the system executing the action using the current robot policy model. In some implementations, when the action is a non-terminal action, the reward can be, for example, “0” reward—or a small penalty (e.g., −0.05) to encourage faster robotic task completion. In some implementations, when the action is a terminal action, the reward can be a “0” if the robotic task was successful and a “1” if the robotic task was not successful. For example, for a grasping task the reward can be “1” if an object was successfully grasped, and a “0” otherwise.
The system can utilize various techniques to determine whether a grasp or other robotic task is successful. For example, for a grasp, at termination of an episode the gripper can be moved out of the view of the camera and a first image captured when it is out of the view. Then the gripper can be returned to its prior position and “opened” (if closed at the end of the episode) to thereby drop any grasped object, and a second image captured. The first image and the second image can be compared, using background subtraction and/or other techniques, to determine whether the gripper was grasping an object (e.g., the object would be present in the second image, but not the first)—and an appropriate award assigned to the last time step. In some implementations, the height of the gripper and/or other metric(s) can also optionally be considered. For example, a grasp may only be considered if the height of the gripper is above a certain threshold.
At block 412, the system pushes the state of block 404, the action selected at block 406, and the reward of block 410 to an online buffer to be utilized as online data during reinforcement learning. The next state (from a next iteration of block 404) can also be pushed to the online buffer. At block 412, the system can also push the state of block 404, the action selected at block 406, and the reward of block 410 to an offline buffer to be subsequently used as offline data during the reinforcement learning (e.g. utilized many time steps in the future in the method 300 of
At block 414, the system determines whether to terminate the episode. In some implementations and/or situations, the system can terminate the episode if the action at a most recent iteration of block 408 indicated termination. In some additional or alternative implementations and/or situations, the system can terminate the episode if a threshold quantity of iterations of blocks 404-412 have been performed for the episode and/or if other heuristics based termination conditions have been satisfied.
If, at block 414, the system determines not to terminate the episode, then the system returns to block 404. If, at block 414, the system determines to terminate the episode, then the system proceeds to block 402 to start a new policy-guided task episode. The system can, a bock 416, optionally reset a counter that is used in block 414 to determine if a threshold quantity of iterations of blocks 404-412 have been performed.
In various implementations, method 400 can be parallelized across a plurality of separate real and/or simulated robots. For example, method 400 can be performed simultaneously by each of 5, 10, or more separate real robots. Also, although method 300 and method 400 are illustrated in separate figures herein for the sake of clarity, it is understood that in many implementations methods 300 and 400 are performed in parallel during reinforcement learning.
At block 502, the system starts training buffer population.
At block 504, the system retrieves a robotic transition. The robotic transition can be retrieved from an online buffer or an offline buffer. The online buffer can be one populated according to method 400 of
At block 506, the system determines a target Q-value based on the retrieved robotic transition information from block 504. In some implementations, the system determines the target Q-value using stochastic optimization techniques as described herein. In some implementations, the stochastic optimization technique is CEM and, in some of those implementations, block 506 may include one or more of the following sub-blocks.
At sub-block 5061, the system selects N actions for the robot, where N is an integer number.
At sub-block 5062, the system generates a Q-value for each action by processing each of the N actions for the robot and processing next state data of the robotic transition (of block 504) using a version of a policy model.
At sub-block 5063, the system selects M actions from the N actions based on the generated Q-values, where M is an integer number.
At sub-block 5064, the system selects N actions based on a Gaussian distribution from the M actions.
At sub-block 5065, the system generates a Q-value for each action by processing each of the N actions and processing the next state data using the version of the policy model.
At sub-block 5066, the system selects a max Q-value from the generated Q-values at sub-block 5065.
At sub-block 5067, the system determines a target Q-value based on the max Q-value selected at sub-block 5066. In some implementations, the system determines the target Q-value as a function of the max Q-value and a reward included in the robotic transition retrieved at block 504.
At block 508, the system stores, in a training buffer, state data, a corresponding action, and the target Q-value determined at sub-block 5067. The system then proceeds to block 504 to perform another iteration of blocks 504, 506, and 508.
In various implementations, method 500 can be parallelized across a plurality of separate processors and/or threads. For example, method 500 can be performed simultaneously by each of 5, 10, or more separate threads. Also, although method 300, 400, and 500 are illustrated in separate figures herein for the sake of clarity, it is understood that in many implementations methods 300, 400, and 500 are performed in parallel during reinforcement learning.
At block 602, the system starts training the policy model.
At block 604, the system retrieves, from a training buffer, state data of a robot, action data of the robot, and a target Q-value for the robot.
At block 606, the system generates a predicted Q-value by processing the state data of the robot and an action of the robot using a current version of the policy model. It is noted that in various implementations the current version of the policy model utilized to generate the predicted Q-value at block 606 will be updated relative to the model utilized to generate the target Q-value that is retrieved at block 604. In other words, the target Q-value that is retrieved at block 604 will be generated based on a lagged version of the policy model.
At block 608, the system generates a loss value based on the predicted Q-value and the target Q-value. For example, the system can generate a log loss based on the two values.
At block 610, the system determines whether there is an additional state data, action data, and target Q-value to be retrieved for the batch (where batch techniques are utilized). If it is determined that there is additional state data, action data, and target Q-value to be retrieved for the batch, then the system performs another iteration of blocks 604, 606, and 608. If it is determined that there is not an additional batch for training the policy model, then the system proceeds to block 612.
At block 612, the system determines a gradient based on the loss(es) determined at iteration(s) of block 608, and provides the gradient to a parameter server for updating parameters of the policy model based on the gradient. The system then proceeds back to block 604 and performs additional iterations of blocks 604, 606, 608, and 610, and determines an additional gradient at block 612 based on loss(es) determined in the additional iteration(s) of block 608.
In various implementations, method 600 can be parallelized across a plurality of separate processors and/or threads. For example, method 600 can be performed simultaneously by each of 5, 10, or more separate threads. Also, although method 300, 400, 500, and 600 are illustrated in separate figures herein for the sake of clarity, it is understood that in many implementations methods 300, 400, 500, and 600 are performed in parallel during reinforcement learning.
At block 702, the system starts performance of a robotic task.
At block 704, the system receives current state data of a robot to perform the robotic task.
At block 706, the system selects a robotic action to perform the robotic task. In some implementations, the system selects the robotic action using stochastic optimization techniques as described herein. In some implementations, the stochastic optimization technique is CEM and, in some of those implementations, block 706 may include one or more of the following sub-blocks.
At sub-block 7061, the system selects N actions for the robot, where N is an integer number.
At sub-block 7062, the system generates a Q-value for each action by processing each of the N actions for the robot and processing current state data using a trained policy model.
At sub-block 7063, the system selects M actions from the N actions based on the generated Q-values, where M is an integer number.
At sub-block 7064, the system selects N actions based on a Gaussian distribution from the M actions.
At sub-block 7065, the system generates a Q-value for each action by processing each of the N actions and processing the next state data using the trained policy model.
At sub-block 7066, the system selects a max Q-value from the generated Q-values at sub-block 7065.
At block 708, the robot executes the selected robotic action.
At block 710, the system determines whether to terminate performance of the robotic task. In some implementations and/or situations, the system can terminate the performance of the robotic task if the action at a most recent iteration of block 706 indicated termination. In some additional or alternative implementations and/or situations, the system can terminate the episode if a threshold quantity of iterations of blocks 704, 706, and 708 have been performed for the performance and/or if other heuristics based termination conditions have been satisfied.
If the system determines, at block 710, not to terminate the selected robotic action, then the system performs another iteration of blocks 704, 706, and 708. If the system determines, at block 710, to terminate the selected robot action, then the system proceeds to block 712 and ends performance of the robotic task.
In
In
The policy model 800 includes a plurality of initial convolutional layers 864, 866, 867, etc. with interspersed max-pooling layers 865, 868, etc. The vision data 861 is processed using the initial convolutional layers 864, 866, 867, etc. and max-pooling layers 865, 868, etc.
The policy model 800 also includes two fully connected layers 869 and 870 that are followed by a reshaping layer 871. The action 862 and optionally the gripper open value 863 are processed using the fully connected layers 869, 870 and the reshaping layer 871. As indicated by the “+” of
Turning now to
Operational components 940a-940n may include, for example, one or more end effectors and/or one or more servo motors or other actuators to effectuate movement of one or more components of the robot. For example, the robot 925 may have multiple degrees of freedom and each of the actuators may control actuation of the robot 925 within one or more of the degrees of freedom responsive to the control commands. As used herein, the term actuator encompasses a mechanical or electrical device that creates motion (e.g., a motor), in addition to any driver(s) that may be associated with the actuator and that translate received control commands into one or more signals for driving the actuator. Accordingly, providing a control command to an actuator may comprise providing the control command to a driver that translates the control command into appropriate signals for driving an electrical or mechanical device to create desired motion.
The robot control system 960 may be implemented in one or more processors, such as a CPU, GPU, and/or other controller(s) of the robot 925. In some implementations, the robot 925 may comprise a “brain box” that may include all or aspects of the control system 960. For example, the brain box may provide real time bursts of data to the operational components 940a-940n, with each of the real time bursts comprising a set of one or more control commands that dictate, inter alia, the parameters of motion (if any) for each of one or more of the operational components 940a-940n. In some implementations, the robot control system 960 may perform one or more aspects of methods 400 and/or 700 described herein.
As described herein, in some implementations all or aspects of the control commands generated by control system 960 in performing a robotic task can be based on an action selected based on a current state (e.g., based at least on current vision data) and based on utilization of a trained policy model as described herein. Stochastic optimization techniques can be utilized in selecting an action at each time step of controlling the robot. Although control system 960 is illustrated in
User interface input devices 1022 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 1010 or onto a communication network.
User interface output devices 1020 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computing device 1010 to the user or to another machine or computing device.
Storage subsystem 1024 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 924 may include the logic to perform selected aspects of the method of
These software modules are generally executed by processor 1014 alone or in combination with other processors. Memory 1025 used in the storage subsystem 1024 can include a number of memories including a main random access memory (RAM) 1030 for storage of instructions and data during program execution and a read only memory (ROM) 1032 in which fixed instructions are stored. A file storage subsystem 1026 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 1026 in the storage subsystem 1024, or in other machines accessible by the processor(s) 1014.
Bus subsystem 1012 provides a mechanism for letting the various components and subsystems of computing device 1010 communicate with each other as intended. Although bus subsystem 1012 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
Computing device 1010 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 1010 depicted in
Particular examples of some implementations disclosed herein are now described, along with various advantages that can be achieved in accordance with those and/or other examples.
In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, implementations disclosed herein enable closed-loop vision-based control, whereby the robot continuously updates its grasp strategy, based on the most recent observations, to optimize long-horizon grasp success. Those implementations can utilize QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage thousands (e.g., over 500,000) real-world grasp attempts to train a deep neural network Q-function with a large quantity of parameters (e.g., over 500,000 or over 1,000,000) to perform closed-loop, real-world grasping that generalizes to a high grasp success rate (e.g., >90%, >95%) on unseen objects. Aside from attaining a very high success rate, grasping utilizing techniques described herein exhibits behaviors that are quite distinct from more standard grasping systems. For example, some techniques can automatically learn regrasping strategies, probe objects to find the most effective grasps, learn to reposition objects and perform other non-prehensile pre-grasp manipulations, and/or respond dynamically to disturbances and perturbations.
Various implementations utilize observations that come from a monocular RGB camera, and actions that include end-effector Cartesian motion and gripper opening and closing commands (and optionally termination commands). The reinforcement learning algorithm receives a binary reward for lifting an object successfully, and optionally no other reward shaping (or only a sparse penalty for iterations). The constrained observation space, constrained action space, and/or sparse reward based on grasp success can enable reinforcement learning techniques disclosed herein to be feasible to deploy at large scale. Unlike many reinforcement learning tasks, a primary challenge in this task is not just to maximize reward, but to generalize effectively to previously unseen objects. This requires a very diverse set of objects during training. To make maximal use of this diverse dataset, the QT-Opt off-policy training method is utilized, which is based on a continuous-action generalization of Q-learning. Unlike other continuous action Q-learning methods, which are often unstable due to actor-critic instability, QT-Opt dispenses with the need to train an explicit actor, instead using stochastic optimization over the critic to select actions and target values. Even fully off-policy training can outperform strong baselines based on prior work, while a moderate amount of on-policy joint fine-tuning with offline data can improve performance on challenging, previously unseen objects.
QT-Opt trained models attain a high success rate across a range of objects not seen during training. Qualitative experiments show that this high success rate is due to the system adopting a variety of strategies that would be infeasible without closed-loop vision-based control. The learned policies exhibit corrective behaviors, regrasping, probing motions to ascertain the best grasp, non-prehensile repositioning of objects, and other features that are feasible only when grasping is formulated as a dynamic, closed-loop process.
Current grasping systems typically approach the grasping task as the problem of predicting a grasp pose, where the system looks at the scene (typically using a depth camera), chooses the best location at which to grasp, and then executes an open-loop planner to reach that location. In contrast, implementations disclosed herein utilize reinforcement learning with deep neural networks, which enables dynamic closed-loop control. This allows trained policies to perform pre-grasp manipulation, respond to dynamic disturbances, and to learn grasping in a generic framework that makes minimal assumptions about the task.
In contrast to framing closed-loop grasping as a servoing problem, implementations disclosed herein use a general-purpose reinforcement learning algorithm to solve the grasping task, which enables long-horizon reasoning. In practice, this enables autonomously acquiring complex grasping strategies. Further, implementations can be entirely self-supervised, using only grasp outcome labels that are obtained automatically to incorporate long-horizon reasoning via reinforcement learning into a generalizable vision-based system trained on self-supervised real-world data. Yet further, implementations can operate on raw monocular RGB observations (e.g., from an over-the-shoulder camera), without requiring depth observations and/or other supplemental observations.
Implementations of the closed-loop vision-based control framework are based on a general formulation of robotic manipulation as a Markov Decision Process (MDP). At each time step, the policy observes the image from the robot's camera and chooses a gripper command. This task formulation is general and could be applied to a wide range of robotic manipulation tasks that are in addition to grasping. The grasping task is defined simply by providing a reward to the learner during data collection: a successful grasp results in a reward of 1, and a failed grasp a reward of 0. A grasp can be considered successful if, for example, the robot holds an object above a certain height at the end of the episode. The framework of MDPs provide a powerful formalism for such decision making problems, but learning in this framework can be challenging. Generalization requires diverse data, but recollecting experience on a wide range of objects after every policy update is impractical, ruling out on-policy algorithms. Instead, implementations present a scalable off-policy reinforcement learning framework based around a continuous generalization of Q-learning. While actor-critic algorithms are a popular approach in the continuous action setting, implementations disclosed herein recognize that a more stable and scalable alternative is to train only a Q-function, and induce a policy implicitly by maximizing this Q-function using stochastic optimization. To handle the large datasets and networks, a distributed collection and training system is utilized that asynchronously updates target values, collects on-policy data, reloads off-policy data from past experiences, and trains the network on both data streams within a distributed optimization framework.
The utilized QT-Opt algorithm is a continuous action version of Q-learning adapted for scalable learning and optimized for stability, to make it feasible to handle large amounts of off-policy image data for complex tasks like grasping. In reinforcement learning, s∈S denotes the state. As described herein, in various implementations the state can include (or be restricted to) image observations, such as RGB image observations from a monographic RGB camera. Further, a∈A denotes the action. As described herein, in various implementations the action can include (or be restricted to) robot arm motion, gripper command, and optionally termination command. At each time step t, the algorithm chooses an action, transitions to a new state, and receives a reward r(st, at). The goal in reinforcement learning is to recover a policy that selects actions to maximize the total expected reward. One way to acquire such an optimal policy is to first solve for the optimal Q-function, which is sometimes referred to as the state-action value function. The Q-function specifies the expected reward that will be received after taking some action a in some state s, and the optimal Q-function specifies this value for the optimal policy. In practice, a parameterized Q-function Qθ(s, a) can be learned, where θ can denote the weights in a neural network. The optimal Q-function can be learned by minimizing the Bellman error, given by equation (1) above, where QT (s, a, s′)=r(s, a)+γV(s′) is a target value, and D is a divergence metric. The cross-entropy function can be used for D, since total returns are bounded in [0, 1]. The expectation is taken under the distribution over all previously observed transitions, and V (s′) is a target value. Two target networks can optionally be utilized to improve stability, by maintaining two lagged versions of the parameter vector θ,
Q-learning with deep neural network function approximators provides a simple and practical scheme for RL with image observations, and is amenable to straightforward parallelization. However, incorporating continuous actions, such as continuous gripper motion in a grasping application, poses a challenge for this approach. Prior work has sought to address this by using a second network that amortizes the maximization, or constraining the Q-function to be convex in a, making it easy to maximize analytically. However, the former class of methods are notoriously unstable, which makes it problematic for large-scale RL tasks where running hyperparameter sweeps is prohibitively expensive. Action-convex value functions are a poor fit for complex manipulation tasks such as grasping, where the Q-function is far from convex in the input. For example, the Q-value may be high for actions that reach toward objects, but low for the gaps between objects.
The proposed QT-Opt presents a simple and practical alternative that maintains the generality of non-convex Q-functions while avoiding the need for a second maximizer network. The image s and action a are inputs into the network, and the arg max in Equation (1) is evaluated with a stochastic optimization algorithm that can handle non-convex and multimodal optimization landscapes. Let π
Learning vision based policies with reinforcement learning that generalizes over new scenes and objects requires large amounts of diverse data. To effectively train on such large and diverse dataset, a distributed, asynchronous implementation of QT-Opt is utilized. Transitions are stored in a distributed replay buffer database, which both loads historical data from disk and can accept online data from live ongoing experiments across multiple robots. The data in this buffer is continually labeled with target Q-values by using a large set (e.g., >500, 1000) “Bellman updater” jobs, which carry out the CEM optimization procedure using the current target network, and then store the labeled samples in a second training buffer, which operates as a ring buffer. One consequence of this asynchronous procedure is that some samples in the training buffer are labeled with lagged versions of the Q-network. Training workers pull labeled transitions from the training buffer randomly and use them to update the Q-function. Multiple (e.g., >5, 10) training workers can be utilized, each of which compute gradients which are sent asynchronously to parameter servers.
QT-Opt can be applied to enable dynamic vision-based grasping. The task requires a policy that can locate an object, position it for grasping (potentially by performing pre-grasp manipulations), pick up the object, potentially regrasping as needed, raise the object, and then signal that the grasp is complete to terminate the episode. To enable self-supervised grasp labeling in the real world, the reward only indicates whether or not an object was successfully picked up. This represents a fully end-to-end approach to grasping: no prior knowledge about objects, physics, or motion planning is provided to the model aside from the knowledge that it can extract autonomously from the data.
In order to enable the model to learn generalizable strategies that can pick up new objects, perform pre-grasp manipulation, and handle dynamic disturbances with vision-based feedback, it must be trained on a sufficiently large and diverse set of objects. Collecting such data in a single on-policy training run would be impractical. The off-policy QT-Opt algorithm described herein makes it possible to pool experience from multiple robots and multiple experiments. Since a completely random initial policy would produce a very low success with such an unconstrained action space, a weak scripted exploration policy can optionally be utilized to bootstrap data collection. This policy is randomized, but biased toward reasonable grasps, and achieves a grasp success rate around 15-30%. A switch to using the learned QT-Opt policy can then be made once it reaches a threshold success rate (e.g., of about 50%) and/or after a threshold quantity of iterations.
This distributed design of the QT-Opt algorithm can achieve various benefits. For example, trying to store all transitions in the memory of a single machine is infeasible. The employed distributed replay buffer enables storing hundreds of thousands of transitions across several machines. Also, for example, the Q-network is quite large, and distributing training across multiple GPUs drastically increases research velocity by reducing time to convergence. Similarly, in order to support large scale simulated experiments, the design has to support running hundreds of simulated robots that cannot fit on a single machine. As another example, decoupling training jobs from data generation jobs allows treating of training as data-agnostic, making it easy to switch between simulated data, off-policy real data, and on-policy real data. It also lets the speed of training and data generation to be scaled independently.
Online agents (real or simulated robots) collect data from the environment. The policy used can be the Polyak averaged weights Q
To support offline training, a log replay job can be executed. This job reads data sequentially from disk for efficiency reasons. It replays saved episodes as if an online agent had collected that data. This enables seamless merging off-policy data with on-policy data collected by online agents. Offline data comes from all previously run experiments. In fully off-policy training, the policy can be trained by loading all data with the log replay job, enabling training without having to interact with the real environment.
Despite the scale of the distributed replay buffer, the entire dataset may still not fit into memory. In order to be able to visit each datapoint uniformly, the Log Replay can be continuously run to refresh the in-memory data residing in the Replay Buffer.
Off-policy training can optionally be utilized initially to initialize a good policy, and then a switch made to on-policy joint fine-tuning. To do so, fully off-policy training can be performed by using the Log Replay job to replay episodes from prior experiments. After training off-policy for enough time, QT-Opt can be restarted, training with a mix of on-policy and off-policy data.
Real on-policy data is generated by real robots, where the weights of the policy Q
Since the real robots can stop unexpectedly (e.g., due to hardware faults), data collection can be sporadic, potentially with delays of hours or more if a fault occurs without any operator present. This can unexpectedly cause a significant reduction in the rate of data collection. To mitigate this, on-policy training can also gated by a training balancer, which enforces a fixed ratio between the number of joint fine-tuning gradient update steps and number of on-policy transitions collected. The ratio can be defined relative to the speed of the GPUs and of the robots, which can change over time.
In various implementations, a target network can be utilized to stabilize deep Q-Learning. Since target network parameters typically lag behind the online network when computing TD error, the Bellman backup can actually be performed asynchronously in a separate process. r(s, a)+γV(s′) can be computed in parallel on separate CPU machines, storing the output of those computations in an additional buffer (the “train buffer”).
Note that because several Bellman updater replicas are utilized, each replica will load a new target network at different times. All replicas push the Bellman backup to the shared replay buffer in the “train buffer”. This makes the target Q-values effectively generated by an ensemble of recent target networks, sampled from an implicit distribution
The distributed replay buffer supports having named replay buffers, such as: “online buffer” that holds online data, “offline buffer” that holds offline data, and “train buffer” that stores Q-targets computed by the Bellman updater. The replay buffer interface supports weighted sampling from the named buffers, which is useful when doing on-policy joint fine-tuning. The distributed replay buffer is spread over multiple workers, which each contain a large quantity (e.g., thousands) of transitions. All buffers are FIFO buffers where old values are removed to make space for new ones if the buffer is full.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/037264 | 6/14/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62685838 | Jun 2018 | US |