Dexterous in-hand manipulation is the ability to move a grasped object from one pose to another desired pose. Humans routinely use in-hand manipulation in performing many tasks such as re-orienting a tool from its initial grasped pose to a useful pose, securing a better grasp on the object, exploring the shape of an unknown object and other such complex tasks. This makes robotic in-hand manipulation an important step towards the general goal of manipulating objects in cluttered and unstructured environments such as a kitchen or a warehouse where such tasks are abundant. Despite significant advances in manipulation mechanics, hand design, and sensing, the problem of controlling dexterous hands for versatile in-hand manipulation remains a long-standing unsolved challenge.
Although reinforcement learning (RL) has been successful in demonstrating diverse in-hand manipulation skills both in simulation and on real hands, the policies are object-centric and require large training times. More importantly, these policies have not been demonstrated with arbitrary orientations of the hand, as it is expected that the palm supports the object during manipulation—a consequence of the policies being trained with the hand in palm-up orientation to simplify training. The policies also require extensive external sensing involving multi-camera systems to track the fingers and/or the object, systems that are hard to deploy outside the lab environments.
With regards to sensing, tactile feedback has a strong potential in enabling versatile in-hand manipulation skills that resist perturbation forces, adapt to variations in object and contact properties and other unmodelled differences such as transmission dynamics, inertia, backlash that corrupts proprioceptive feedback, and the like. However, integrating tactile feedback with RL is a challenge on its own. Besides the general difficulty of simulating the transduction involved, tactile feedback is often high dimensional which can prohibitively drive up the number of training samples required. Hence, prior works using RL for in-hand manipulation either entirely avoid using tactile feedback or consider tasks requiring fewer training samples where it is feasible to learn directly with the real hand.
The disclosure escribed herein focuses on achieving arbitrary in-hand object re-orientation by combining in-grasp manipulation and finger-gaiting. In-grasp manipulation is a specific type of in-hand manipulation skill where the fingertips re-orient the grasped object while maintaining contacts. The kinematic constraints of the hand while maintaining contacts restrict the range of re-orientation this skill can achieve on its own. Finger-gaiting, another in-hand manipulation skill, involves breaking and making contact to substitute one finger for another without dropping the object. It is an invaluable skill because it can overcome the kinematic constraints of in-grasp manipulation and achieve potentially limitless object re-orientation.
In one aspect, we combine in-grasp manipulation and finger-gaiting in-hand manipulation skills for large in-hand re-orientation. Finger-gaiting involves making and breaking contact to achieve re-grasping and finger substitutions without dropping the object which makes it an invaluable skill for large in-hand re-orientation: it does not face the kinematic constraints of in-grasp manipulation and thus in combination with in-grasp manipulation we can achieve potentially limitless object re-orientation.
In another aspect, RL is also used to enable in-hand manipulation skills but with a few significant differences. First and foremost, we train our policies to perform in-hand manipulation only using force-closed precision fingertip grasps (e.g. precision in-hand manipulation without requiring the presence of the palm underneath the object for support), and thus enable our policies to be used in arbitrary orientations of the hand. However, the task of learning to manipulate only via such precision grasps is a significantly harder problem: action randomization, responsible for exploration in reinforcement learning, often fails as the hand almost always drops the object. As a solution, we propose designing appropriate initial state distributions, initializing episode rollouts with a wide range of grasps as an alternative exploration mechanism. To this end, we train our policies to achieve continuous object re-orientation about a specified axis. First, we find this formulation to have significantly better sample efficiency for learning finger-gaiting. But more importantly, it does not require knowledge of absolute object pose, which in turn would require cumbersome external sensing. With this approach, we learn policies to rotate object about cardinal axes and combine them for arbitrary in-hand object re-orientation.
Critically, we learn our policies with low dimensional contact location and force feedback, on the way towards future transfer of the policies on our hand with tactile sensors that accurately predict contact location & force and are also easy to simulate (including noise). We also leave out the object pose feedback achieving in-hand manipulation purely using internal on-board sensing. We show that these policies are robust i.e. they resist perturbations and cope up with noise in feedback and object agnostic i.e., they generalize to unseen objects, all the while requiring significantly fewer samples to train.
According one aspect, a system for generating a model-free reinforcement learning policy for a robotic hand for grasping an object is provided, including a processor; a memory; and a simulator implemented via the processor and the memory, performing: sampling a plurality of stable grasps relevant to reorienting the grasped object about a desired axis of rotation and using stable grasps as initial states for collecting training trajectories; learning finger-gaiting and finger-grasping policies for each axis of rotation in the hand coordinate frame based on proprioceptive sensing in the robotic hand, wherein the finger-gaiting and finger-pivoting policy is implemented on the robotic hand.
In some embodiments, the sampling of a plurality of varied stable grasps comprises initializing the grasped object in a random pose and sampling a plurality of fingertip positions of the robotic hand. In some embodiments, the sampling is based on a number of fingertip contacts on the grasped object.
In some embodiments, the finger-gaiting and finger-grasping policies for each axis of rotation are combined.
In some embodiments, the proprioceptive sensing provides current positions and controller set-point positions of the robotic hand. In some embodiments, the robotic hand is a fully-actuated and position-controlled robotic hand.
In some embodiments, a reward function associated with a critic of the simulator is based on the angular velocity of a grasped object along a desired axis of rotation. In some embodiments, a reward function associated with a critic of the simulator is based on the number of fingertip contacts on a grasped object and the separation between a desired and a current axis of rotation.
According to one aspect, a method for generating a model-free reinforcement learning policy for a robotic hand for grasping an object includes sampling a plurality of stable grasps relevant to reorienting the grasped object about a desired axis of rotation and using stable grasps as initial states for collecting training trajectories; learning finger-gaiting and finger-grasping policies for each axis of rotation in the hand coordinate frame based on proprioceptive sensing in the robotic hand, and implementing the finger-gaiting and finger-pivoting policy on the robotic hand.
In some embodiments, the sampling of a plurality of varied stable grasps comprises initializing the grasped object in a random pose and sampling a plurality of fingertip positions of the robotic hand. In some embodiments, the sampling is based on a number of fingertip contacts on the grasped object.
In some embodiments, the finger-gaiting and finger-grasping policies for each axis of rotation are combined.
In some embodiments, the proprioceptive sensing provides current positions and controller set-point positions of the robotic hand.
In some embodiments, the method includes providing a reward function associated with a critic of the simulator is based on the angular velocity of a grasped object along a desired axis of rotation. In some embodiments, the method includes providing a reward function associated with a critic of the simulator is based on the number of fingertip contacts on a grasped object and the separation between a desired and a current axis of rotation.
According to one aspect, a robotic hand implementing a model-free reinforcement learning policy for a robotic hand for grasping an object includes a processor; a memory storing finger-gaiting and finger-grasping policies built on a simulator by: a simulator implemented via the processor and the memory, sampling a plurality of stable grasps relevant to reorienting the grasped object about a desired axis of rotation and using stable grasps as initial states for collecting training trajectories; learning finger-gaiting and finger-grasping policies for each axis of rotation in the hand coordinate frame based on proprioceptive sensing in the robotic hand, and a controller implementing the finger-gaiting and finger-pivoting on the robotic hand.
In some embodiments, the sampling of a plurality of varied stable grasps comprises initializing the grasped object in a random pose and sampling a plurality of fingertip positions of the robotic hand. In some embodiments, the sampling is based on a number of fingertip contacts on the grasped object.
In some embodiments, the finger-gaiting and finger-grasping policies for each axis of rotation are combined.
In some embodiments, the proprioceptive sensing provides current positions and controller set-point positions of the robotic hand. In some embodiments, the robotic hand is a fully-actuated and position-controlled robotic hand.
In some embodiments, a reward function associated with a critic of the simulator is based on the angular velocity of a grasped object along a desired axis of rotation. In some embodiments, a reward function associated with a critic of the simulator is based on the number of fingertip contacts on a grasped object and the separation between a desired and a current axis of rotation.
The following includes definitions of selected terms employed herein. The definitions include various examples and/or forms of components that fall within the scope of a term and that may be used for implementation. The examples are not intended to be limiting. Further, one having ordinary skill in the art will appreciate that the components discussed herein, may be combined, omitted or organized with other components or organized into different architectures.
A “processor”, as used herein, processes signals and performs general computing and arithmetic functions. Signals processed by the processor may include digital signals, data signals, computer instructions, processor instructions, messages, a bit, a bit stream, or other means that may be received, transmitted, and/or detected. Generally, the processor may be a variety of various processors including multiple single and multicore processors and co-processors and other multiple single and multicore processor and co-processor architectures. The processor may include various modules to execute various functions.
A “memory”, as used herein, may include volatile memory and/or non-volatile memory. Non-volatile memory may include, for example, ROM (read only memory), PROM (programmable read only memory), EPROM (erasable PROM), and EEPROM (electrically erasable PROM). Volatile memory may include, for example, RAM (random access memory), synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), and direct RAM bus RAM (DRRAM). The memory may store an operating system that controls or allocates resources of a computing device.
A “disk” or “drive”, as used herein, may be a magnetic disk drive, a solid state disk drive, a floppy disk drive, a tape drive, a Zip drive, a flash memory card, and/or a memory stick. Furthermore, the disk may be a CD-ROM (compact disk ROM), a CD recordable drive (CD-R drive), a CD rewritable drive (CD-RW drive), and/or a digital video ROM drive (DVD-ROM). The disk may store an operating system that controls or allocates resources of a computing device.
A “bus”, as used herein, refers to an interconnected architecture that is operably connected to other computer components inside a computer or between computers. The bus may transfer data between the computer components. The bus may be a memory bus, a memory controller, a peripheral bus, an external bus, a crossbar switch, and/or a local bus, among others. The bus may also interconnects components using protocols such as Media Oriented Systems Transport (MOST), Controller Area network (CAN), Local Interconnect Network (LIN), among others.
A “database”, as used herein, may refer to a table, a set of tables, and a set of data stores (e.g., disks) and/or methods for accessing and/or manipulating those data stores.
An “operable connection”, or a connection by which entities are “operably connected”, is one in which signals, physical communications, and/or logical communications may be sent and/or received. An operable connection may include a wireless interface, a physical interface, a data interface, and/or an electrical interface.
A “computer communication”, as used herein, refers to a communication between two or more computing devices (e.g., computer, personal digital assistant, cellular telephone, network device) and may be, for example, a network transfer, a file transfer, an applet transfer, an email, a hypertext transfer protocol (HTTP) transfer, and so on. A computer communication may occur across, for example, a wireless system (e.g., IEEE 802.11), an Ethernet system (e.g., IEEE 802.3), a token ring system (e.g., IEEE 802.5), a local area network (LAN), a wide area network (WAN), a point-to-point system, a circuit switching system, a packet switching system, among others.
An “agent”, as used herein, may refer to a “robotic hand.” Additionally, “setting” as used herein, may be used interchangeably with “environment”. A “feature” as used herein, may include a goal.
The aspects discussed herein may be described and implemented in the context of non-transitory computer-readable storage medium storing computer-executable instructions. Non-transitory computer-readable storage media include computer storage media and communication media. For example, flash memory drives, digital versatile discs (DVDs), compact discs (CDs), floppy disks, and tape cassettes. Non-transitory computer-readable storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, modules, or other data.
The simulator 130 or the processor 110 may generate a policy network 144, which may be stored on the memory 120 of the system 100 for generating a model-free reinforcement learning policy. The system may optionally further include a communication interface 146 which enables the policy network 144 to be transmitted to other devices, such as a server 150, which may include a database 152. In this way, the policy network 144 generated by the system 100 for generating a model-free reinforcement learning policy may be stored on the database 152 of the server 150. Discussion regarding greater detail associated with the building of the policy network 145 may be provided herein (e.g.,
The server may then optionally propagate the policy network 144 to one or more robotic hands 160. The robotic hand may be equipped with a communication interface 162, a storage device 164, a controller 166, and one or finger manipulation systems 678, which may include actuators and/or sensors, for example. In some embodiments, the hands include the tactile sensors described in U.S. Pat. No. 10,663,362, which is incorporated by reference in its entirety herein. The storage device 164 may store the policy network 144 from the server, and the controller may operate the robotic hand in an autonomous fashion based on the policy network 144. In this way, the sensors of the robotic hand(s) may detect grasped objects and provide those as inputs (e.g., observations) to the policy network 144 developed by the simulator 130, which may then provide a suggested action for the robotic hand(s).
In reinforcement learning, a model may refer to the different dynamic states of an environment and how these states lead to a reward. A policy may be a strategy generated to determine actions to take based on a current state. The overall outcome of reinforcement learning (or other types of learning) may be to develop a policy. Explained again, the policy may be a series of behaviors or actions to take when presented with a specific domain. Reinforcement may be applied by continually re-running or re-executing the learning process based on the results of prior learning, effectively updating an old policy with a newer policy to learn from the results and to improve the policy. In model based reinforcement learning, a model may be utilized to represent the environment or domain to indicate states and possible actions. By knowing states, the policies may target these states and actions specifically in each repetition cycle, testing and improving the accuracy of the policy, to improve the quality of the model. The policy, on the other hand, may be the learnings on the behaviors, where as the model may include the facts or scenario states that back up and confirm the learnings. According to one aspect, model-free reinforcement learning may be provided to build the policy. The policy may take information associated with grasping an object and output a suggestion action for the robotic hand, such as finger-gaiting and finger-pivoting, for example.
For in-hand manipulation, we use model-free deep reinforcement learning (RL), including learning finger-gaiting (manipulation involving finger substitution and re-grasping) and finger-pivoting (manipulation involving the object in hinge-grasp) skills. Both skills are important towards enabling large-angle in-hand object re-orientation: achieving an arbitrarily large rotation of the grasped object around a given axis, up to or even exceeding a full revolution. Such a task is generally not achievable by in-grasp manipulation (e.g., without breaking the contacts of the original grasp) and requires finger-gaiting or finger-pivoting (e.g., breaking and re-establishing contacts during manipulation); these are not restricted by the kinematic constraints of the hand and can achieve potentially limitless object re-orientation.
We are interested in achieving these skills exclusively through using fingertip grasps (e.g., precision in-hand manipulation) without requiring the presence of the palm underneath the object, which enables the policies to be used in arbitrary orientations of the hand. However, the task of learning to manipulate only via such precision grasps is a significantly harder problem: action randomization, responsible for exploration in RL, often fails as the hand can easily drop the object.
Furthermore, we would like to circumvent the need for cumbersome external sensing by only using internal sensing in achieving these skills. The challenge here is that the absence of external sensing implies we do not have information regarding the object such as its global shape and pose. However, internal sensing by itself can provide object information sufficient towards our goal.
Finger-gaiting and finger-pivoting skills can be achieved purely through intrinsic sensing in simulation, where we evaluate both proprioceptive feedback and tactile feedback. To this end, we consider the task of continuous object re-orientation about a given axis, aiming to learn finger-gaiting and finger-pivoting without object pose information. With this approach, we learn policies to rotate object about cardinal axes and combine them for arbitrary in-hand object re-orientation. To overcome challenges in exploration, we collect training trajectories starting from a wide range of grasps sampled from appropriately designed initial state distributions as an alternative exploration mechanism.
We learn finger-gaiting and finger-pivoting policies that can achieve large angle in-hand re-orientation of a range of simulated objects. Our policies learn to grasp and manipulate only via precision fingertip grasps using a highly dexterous and fully actuated hand, allowing us to keep the object in a stable grasp without the need for passive support at any instance during manipulation. We achieve these skills by making use of only intrinsic sensing such as proprioception and touch, while also generalizing to multiple object shapes. We present an exhaustive analysis of the importance of different internal sensor feedback for learning finger-gaiting and finger-pivoting policies in a simulated environment using our approach.
While a whole spectrum of methods have been considered for in-hand manipulation, online trajectory optimization methods and model-free deep RL methods stand out for highly actuated dexterous hands. Model-based online trajectory optimization methods have been exceptional in generating complex behaviors for dexterous robotic manipulation in general, but not for in-hand manipulation as these tasks fatally exacerbate their limitations: transient contacts introduce large non-linearities in the model, which also depends on hard-to-model contact properties.
Early work on finger-gaiting and finger-pivoting generally make simplifying assumptions such as 2D manipulation, accurate models, and smooth object geometries, which limit their versatility. Fan et al. and Sundaralingam et al. use model based online optimization and demonstrate finger-gaiting in simulation. These methods either use smooth objects or require accurate kinematic objects of the object and also challenging to transfer to real hand.
OpenAI et al. demonstrate finger-gaiting and finger-pivoting using RL, but as previously discussed, their policies cannot be used for arbitrary orientations of the hand. This can be achieved using only force-closed precision fingertip grasps but learning in-hand manipulation using only these grasps is challenging with few prior work. Li et al. learn 2D re-orientation using model-based controllers to ensure grasp stability in simulation. Veiga et al. demonstrate in-hand reorientation with only fingertips but these object centric policies are limited to small re-orientations via in-grasp manipulation and still require external sensing. Shi et al. demonstrate precision finger-gaiting but only on a lightweight ball. Morgan et al. also show precision finger-gaiting but with an under-actuated hand specifically designed for this task. We consider finger-gaiting with a highly actuated hand; our problem is exponentially harder due to increased degrees of freedom leading to poor sample complexity.
Some prior work use human expert trajectories to improve sample complexity for dexterous manipulation. However, these expert demonstrations are hard to obtain for precision in-hand manipulation tasks and even more so for non-anthropomorphic hands. Alternatively, model-based RL has also been considered for some in-hand manipulation tasks: Nagabandi et al. manipulate boading balls but use the palm for support; Morgan et al. learn finger-gaiting but with a task specific underactuated hand. However, learning a reliable forward model for precision in-hand manipulation with a fully dexterous hand can be challenging. Collecting data involves random exploration, which, as we discuss later in this paper, has difficulty exploring in this domain.
Prior work using model-free RL for manipulation rarely use tactile feedback as tactile sensing available on real hand is often high dimensional and hard to simulate. Hence, Hoof et al. propose learning directly on the real hand but this naturally limits us to tasks learnable on real hand. Alternatively, Veiga et al. learn a higher level policy through RL, while low level controllers exclusively deal with tactile feedback. However, this method deprives the policy from leveraging tactile feedback beneficial in other challenging tasks. While Melnik et al. show that using tactile feedback improves sample complexity in such tasks, they use high-dimensional tactile feedback with full coverage of the hand that is hard to replicate on a real hand. We instead consider low-dimensional tactile feedback covering only the fingertips.
Contemporary to our work, Chen et al. also show in-hand re-orientation without support surfaces that generalizes to novel objects. The policies exhibit complex dynamic behaviors including occasionally throwing the object and regrasping it in the desired orientation. We differ from this work as our policies only use sensing that is internal to the hand, and always keep the object in a stable grasp to be robust to perturbation forces at all times. Furthermore, our policies require a number of training samples that is smaller by multiple orders of magnitude, a feature that we attribute to efficient exploration via appropriate initial state distributions.
In the present subject matter, we address two important challenges for precision in-hand re-orientation using reinforcement learning. First, a hand-centric decomposition method achieves arbitrary in-hand re-orientation in an object agnostic fashion. Next, collecting training trajectories starting at varied stable grasps alleviates the challenge of exploration for learning precision in-hand manipulation skills. We use these grasps to design appropriate initial state distributions for training. Our approach assumes a fully-actuated and position-controlled (torque-limited) hand.
Our method relies on intrinsic sensing, and perform this in general fashion without assuming object knowledge. Thus, we do it in a hand-centric way: we learn to rotate around axes grounded in the hand frame. This means we do not need external tracking (which presumably needs to be trained for each individual object) to give us object-pose. We also find that rewarding angular velocity about desired axis of rotation is conducive to learning finger-gaiting and finger-pivoting policies. However, learning a single policy for any arbitrary axis is challenging as it involves learning goal-conditioned policies, which is difficult for model-free RL.
Our proposed method for wide arbitrary in-hand reorientation is to decompose the problem of achieving arbitrary angular velocity of the object into learning separate policies about the cardinal axes as shown in
We assume that proprioceptive sensing can provide current positions q and controller set-point positions qd. We note that the combination of desired positions and current positions can be considered as a proxy for motor forces, if the characteristics of the underlying controller are fixed. More importantly, we assume tactile sensing to provide absolute contact positions ci∈3 and normal forces tni∈ on each fingertip i. With known fingertip geometry, the contact normals {circumflex over (t)}ni∈3 can be derived from contact positions ci.
Our axis-specific re-orientation policies are conditioned only on proprioceptive and tactile feedback as given by the observation vector o:
where m is the number of fingers. Our policies command set-point changes Δqd
We now describe the procedure for learning in-hand reorientation policies for an arbitrary but fixed axis. Let {circumflex over (k)} be the desired axis of rotation. To learn axis-specific policy π{circumflex over (k)} that continuously re-orients the object about the desired axis {circumflex over (k)} we use the object's angular velocity ω along {circumflex over (k)} as reward as shown in
The reward function is described in (2), where nc is the number of fingertip contacts and φ is the separation between the desired and current axis of rotation. Symbols ∧, ∨, I are the logical and, the logical or, and indicator function, respectively. We also use reward clipping to avoid local optima and idiosyncratic behaviors. In our setup, rmax and φmax are both set to 0.5. Although the reward uses the object's angular velocity, we do not need additional sensing to measure it as we only train in simulation with the intent of zero-shot transfer to hardware.
Enabling Exploration with Domain Knowledge
A fundamental issue in using reinforcement learning for learning precision in-hand manipulation skills is that a random exploratory action can easily disturb the stability of the object held in precision grasp, causing it to be dropped. This difficulty is particularly acute for finger-gaiting, which requires fingertips to break contact with the object and transition between different grasps, involving different fingertips, all while re-orienting the object. Intuitively, the likelihood of selecting a sequence of random actions that can accomplish this feat and obtain a useful reward signal is very low.
For a policy to learn finger-gaiting, it may encounter these diverse grasps within its training samples so that the policy's action distributions can improve at these states. Consider taking a sequence of random actions starting from a stable l-finger grasp. While it is possible to reach a stable grasp with an additional finger in contact (if available), it is more likely to lose one finger contact, then another and so on until the object is dropped. Over multiple trials, we can expect to encounter most combinations of l1 grasps. In this setting, it can be argued that starting from a stable grasp with all m fingers in contact leads to maximum exploration.
Our insight is to observe that through domain knowledge we are already aware of the states that a sufficiently exploratory policy might visit. Thus, we use our knowledge of relevant states in designing the initial states used for episode rollouts and show that it is critical for learning precision finger-gaiting and finger-pivoting.
We sample sufficiently-varied stable grasps relevant to re-orienting the object about the desired axis and use them as initial states for collecting training trajectories. These grasps must be well distributed in terms of number of contacts, contact positions relative to the object, and object poses relevant to the task. To this end, we first initialize the object in a random pose and then sample fingertip positions until we find a stable grasp as described in Stable Grasp Sampling (SGS) in Alg. 1, below:
object
simulator state of the sampled grasp
In SGS, we first sample an object pose and a hand pose, then update the simulator with the sampled poses towards obtaining a grasp. We advance the simulation for a short duration, ts, to let any transients die down. If the object has settled into a grasp with at least two contacts, the pose is used for collecting training trajectories. Note that the fingertips could be overlapping with the object or with each other as we do not explicitly check this. However, due to the soft-contact model used by the simulator (Emanuel Todorov, Tom Erez, and Yuval Tassa. “MuJoCo: A physics engine for model-based control”. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. October 2012, pp. 5026-5033), the inter-penetrations are resolved during simulation. An illustrative set of grasps sampled by SGS are shown in
To sample the hand pose, we start by sampling finger-tip locations within an annulus centered on and partially overlaps with the object, as shown in
For evaluating our method, we focus on learning precision in-hand re-orientation about the z- and x-axes for a range of regular object shapes. (The y-axis is similar to x-, given the symmetry of our hand model.) Our object set, which consists of a cylinder, sphere, icosahedron, dodecahedron and cube, is designed so that we have objects of varying difficulty with the sphere and cube being the easiest and hardest, respectively. For training, we use PPO (John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford and Oleg Klimov. “Proximal Policy Optimization Algorithms” July 2017.) We chose PPO over other state-of-the-art methods such as SAC primarily for training stability
For the following analysis, we use z-axis re-orientation as a case study. In addition to the above, we also train z-axis re-orientation policies without assuming joint set-point feedback qd. For all these policies, we study their robustness properties by adding noise and also by applying perturbation forces on the object. We also study the zero-shot generalization properties of these policies. Finally, through ablation studies we present a detailed analysis ascertaining the importance of different components of feedback for achieving finger-pivoting.
We note that, in simulation, the combination of qd and q can be considered a good proxy for torque, since simulated controllers have stable and known stiffness. However, this feature might not transfer to a real hand, where transmissions exhibit friction, stiction and other hard to model effects. We thus evaluate our policies both with and without joint set-point observations.
However, when using a wide initial distribution of grasps (sampled via SGS), the policies learn finger-gaiting and achieve continuous re-orientation of the object with significantly higher returns. With our approach, we also learn finger-pivoting for re-orientation about the x-axis, with learning curves shown in
As expected, difficulty of rotating the objects increases as we consider objects of lower rotational symmetry from sphere to cube. In the training curves in
We also successfully learn policies for in-hand re-orientation without joint set-point position feedback, but these policies achieve slightly lower returns. However, they may have interesting consequences for generalization as we will discuss in Sec IV-C.
In particular, our policies show little drop in performance for noise in joint positions, but are more sensitive to noise in contact feedback. Nevertheless, they are still robust, and achieve high returns even at 5 mm error in contact position and 25% error in contact force. Interestingly, for noise in contact position, we found that drop in performance arises indirectly through the error in contact normal {circumflex over (t)}ni (computed from contact position cni). As for perturbation forces on the object, we observe high returns even for high perturbation forces (1N) equivalent to the weight of our objects. Our policies are robust event without joint-setpoint qd feedback with similar robustness profiles.
We study generalization properties of our policies by evaluating it on different objects in the object set. We consider the transfer score, which is the ratio Ry Ra where Ry is the average returns obtained when evaluating the policy learned with object i on object j
We are particularly interested to discover what aspects matter most in contact feedback. To answer such questions, we run a series of ablations holding out different components. For this, we again consider learning finger-gaiting on the cube as shown in
Based on this ablation study, we can make a number of observations. As expected, contact feedback is essential for learning in-hand re-orientation via finger-gaiting; the policy does not learn finger-gaiting with just proprioceptive feedback (#4). More interesting, and also more surprising, is that explicitly computing contact normal ti and providing it as feedback is critical when excluding joint position set-point qd (#6 to #10). In fact, the policy learns finger-gaiting with just contact normal and joint position feedback (#10). However, while not critical, contact position and force feedback are still beneficial as they improve sample efficiency (#6, #7).
The techniques described herein focus on the problem of learning in-hand manipulation policies that can achieve large-angle object re-orientation via finger-gaiting. We consider sensing modalities intrinsic to the hand, such as touch and proprioception, with no external vision or tracking sensor providing object-specific information. Furthermore, we aim for policies that can achieve manipulation skills without using a palm or other surfaces for passive support, and which instead need to maintain the object in a stable grasp.
A component of our approach described herein is the use of appropriate initial state distributions during training, used to alleviate the intrinsic instability of precision grasping. We also decompose the manipulation problem into axis-specific rotation policies in the hand coordinate frame, allowing for object-agnostic policies. Combining these, we are able to achieve the desired skills in a simulated environment, the first instance in the literature of such policies being successfully trained with intrinsic sensor data.
Still another aspect involves a computer-readable medium including processor-executable instructions configured to implement one aspect of the techniques presented herein. An aspect of a computer-readable medium or a computer-readable device includes a computer-readable medium, such as a CD-R, DVD-R, flash drive, a platter of a hard disk drive, etc., on which is encoded computer-readable data. This encoded computer-readable data, such as binary data including a plurality of zero's and one's, in turn includes a set of processor-executable computer instructions configured to operate according to one or more of the principles set forth herein. In this implementation, the processor-executable computer instructions may be configured to perform a method, such as the method 500 of
As used in this application, the terms “component”, “module,” “system”, “interface”, and the like are generally intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processing unit, an object, an executable, a thread of execution, a program, or a computer. By way of illustration, both an application running on a controller and the controller may be a component. One or more components residing within a process or thread of execution and a component may be localized on one computer or distributed between two or more computers.
Further, the claimed subject matter is implemented as a method, apparatus, or article of manufacture using standard programming or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. The term “article of manufacture” as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. Of course, many modifications may be made to this configuration without departing from the scope or spirit of the claimed subject matter.
Generally, aspects are described in the general context of “computer readable instructions” being executed by one or more computing devices. Computer readable instructions may be distributed via computer readable media as will be discussed below. Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform one or more tasks or implement one or more abstract data types. Typically, the functionality of the computer readable instructions are combined or distributed as desired in various environments.
In other aspects, the computing device 510 includes additional features or functionality. For example, the computing device 510 may include additional storage such as removable storage or non-removable storage, including, but not limited to, magnetic storage, optical storage, etc. Such additional storage is illustrated in
The term “computer readable media” as used herein includes computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data. Memory 514 and storage 518 are examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which may be used to store the desired information and which may be accessed by the computing device 510. Any such computer storage media is part of the computing device 510.
The term “computer readable media” includes communication media. Communication media typically embodies computer readable instructions or other data in a “modulated data signal” such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” includes a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
The computing device 510 may include input device(s) 522 such as keyboard, mouse, pen, voice input device, touch input device, infrared cameras, video input devices, or any other input device. Output device(s) 520 such as one or more displays, speakers, printers, or any other output device may be included with the computing device 510. Input device(s) and output device(s) may be connected to the computing device 510 via a wired connection, wireless connection, or any combination thereof. In one aspect, an input device or an output device from another computing device may be used as input device(s) 522 or output device(s) 524 for the computing device 510. The computing device 912 may include communication connection(s) 524 to facilitate communications with one or more other devices 540, such as through network 530, for example.
Reinforcement Learning (RL) of robot sensorimotor control policies has seen great advances in recent years, demonstrated for a wide range of motor tasks. In the case of manipulation, this has translated in higher levels of dexterity than previously possible, typically demonstrated by the ability to re-orient a grasped object in-hand using complex finger movements. However, training a sensorimotor policy is still a difficult process, particularly for problems where the underlying state space exhibits complex structure, such as “narrow passages” between parts of the space are accessible or useful. Manipulation is indeed such a problem: even when starting with the object secured between the digits, a random action can easily lead to a drop, and thus to an irrecoverable state. Finger-gaiting further implies transitions between different subsets of fingers used to hold the object, all while maintaining stability. This leads to difficulty in exploration during training, since random perturbations in the policy action space are unlikely to discover narrow passages in state space. Current studies address this difficult through a variety of means: using simple, convex objects to reduce the difficulty of the task, reliance on support surfaces to reduce the chances of a drop, object pose tracking through extrinsic sensing, etc.
The difficulty of exploring problems with labyrinthine state space structure is far from new in robotics. In fact, the large and highly effective family of Sampling-Based Planning (SBP) algorithms was developed in the field to address this exact problem. By expanding a known structure towards targets randomly sampled in the state space of the problem (as opposed to the action space of the agent), SBP methods can explore even very high-dimensional state spaces in ways that are probabilistically complete, or guaranteed to converge to optimal trajectories. However, SBP algorithms are traditionally designed to find trajectories rather than policies. For problems with computationally demanding dynamics, SBP can not be used on-line for previously unseen start states, or to quickly correct when unexpected perturbations are encountered along the way.
In this paper, we draw on the strength of both RL and SBP methods in order to train motor control policies for in-hand manipulation with finger gaiting. We aim to manipulate more difficult objects, including concave shapes, while securing them at all times without relying on support surfaces. Furthermore, we aim to achieve large re-orientation of the grasped object with purely intrinsic (tactile and proprioceptive) sensing. To achieve that, we explore multiple variants of the non-holonomic RRT algorithm with added constraints to find (approximate) trajectories that explore the useful parts of the problem state space. Then, we use these trajectories as reset distributions to train complete RL policies based on the full dynamics of the problem. Overall, the main contributions of this work include:
Exploration methods for general RL operate under the strict assumption that the learning agent cannot teleport between states, mimicking the constraints of the real world. Under such constraints, proposed exploration methods include using intrinsic rewards or improving action consistency via temporally correlated noise in policy actions or parameter space noise.
Fortunately, in cases where the policies are primarily trained in simulation, this requirement can be relaxed, and we can use our knowledge of the relevant state space to design effective exploration strategies. A number of these methods improve exploration by injecting useful states into the reset distribution during training. Nair et al. use states from human demonstrations in a block stacking task, while Ecoffet et al. use states previously visited by the learning agent itself for problems such as Atari games and robot motion planning. Tavakoli et al. evaluate various schemes for maintaining and resetting from the buffer of visited states. However, these schemes were evaluated only on benchmark continuous control tasks. From a theoretical perspective, Agarwal et al. show that a favorable reset state distribution provides a means to circumvent worst-case exploration issues, using sample complexity analysis of policy gradients.
Finding feasible trajectories through a complex state space is a well-studied motion planning problem. Of particular interest to us are sampling-based methods such as Rapidly exploring Random Trees (RRT) and Probabilistic Road Maps (PRM). These families of methods have proven highly effective, and are still being expanded. Stable Sparse-RRT (SST) and its optimal variant SST* are examples of recent sampling-based methods for high-dimensional motion planning with physics. However, the goal of these methods is finding (kinodynamic) trajectories between known start and goal states, rather than closed-loop control policies which can handle deviations from the expected states.
Several approaches have tried to combine the exploratory ability of SBP with RL, leveraging planning for global exploration while learning a local control policy via RL. These methods were primarily developed for and tested on navigation tasks, where nearby state space samples are generally easy to connect by an RL agent acting as a local planner. The LeaPER algorithm also uses plans obtained by RRT as reset state distribution and learns policies for simple non-prehensile manipulation. However, the state space for the prehensile in-hand manipulation tasks we show here is highly constrained, with small useful regions and non-holonomic transitions. Other approaches use trajectories planned by SBP as expert demonstrations for RL, but this requires that planned trajectories also include the actions used to achieve transitions, which SBP does not always provide. Alternatively, Jurgenson et al. and Ha et al. use planned trajectories in the replay buffer of an off-policy RL agent for multi-arm motion planning. However, it is unclear how off-policy RL can be combined with the extensive physics parallelism that has been vital in the recent success of on-policy methods for learning manipulation.
Turning specifically to the problem of dexterous manipulation, a number of methods have been used to advance the state of the art, including planning, learning, and leveraging mechanical properties of the manipulator. Leveroni et al. build a map of valid grasps and use search methods to generate gaits for planar reorientation, while Han et al. consider finger-gaiting of a sphere and identify the non-holonomic nature of the problem. Some methods have also considered RRT for finger-gaiting in-hand manipulation, but limited to simulation for a spherical object. More recently, Morgan et al. demonstrate robust finger-gaiting for object reorientation using actor-critic reinforcement learning and multi-modal motion planning, both in conjunction with a compliant, highly underactuated hand designed explicitly for this task. Bhatt et al. also demonstrate robust finger-gaiting finger-pivoting manipulation with a soft compliant hand, but these skills were not autonomously learned but rather hand-designed and executed in an open-loop fashion.
Model-free RL has also led to significant progress in dexterous manipulation, starting with OpenAI's demonstration of finger-gaiting and finger-pivoting, trained in simulation and translated to real hardware. However, this approach uses extensive extrinsic sensing infeasible outside the lab, and relies on support surfaces such as the palm underneath the object. Khandate et al. show dexterous finger-gaiting and finger-pivoting skills using only precision fingertip grasps to enable both palm-up and palm-down operation, but only on a range of simple convex shapes and in a simulated environment. Makoviychuk et al. showed that GPU physics could be used to accelerate learning skills similar to OpenAI's. Allshire et al. used extensive domain randomization and sim-to-real transfer to re-orient a cube but used table top as an external support surface. Chen et al. demonstrated in-hand re-orientation for a wide range of objects under palm-up and palm-down orientations of the hand with extrinsic sensing providing dense object feedback. Sievers et al. and Pitz et al. demonstrated in-hand cube reorientation to desired pose with purely tactile feedback. Qi et al. used rapid motor adaptation to achieve effective sim-to-real transfer of in-hand manipulation skills for small cylindrical and cube-like objects. In our case, the exploration ability of SBP allows learning of policies for more difficult tasks, such as in-hand manipulation of non-convex and large shapes, with only intrinsic sensing. We also achieve successful, robust sim-to-real transfer without extensive domain randomization or domain adaptation, by closing the sim-to-real gap via tactile feedback.
In this paper, we focus on the problem of achieving dexterous in-hand manipulation while simultaneously securing the manipulated object in a precision grasp. Keeping the object stable in the grasp during manipulation is needed in cases where a support surface is not available, or the skill must be performed under different directions for gravity (i.e. palm up or palm down). However, it also creates a difficult class of manipulation problems, combining movement of both the fingers and the object with a constant requirement of maintaining stability. In particular, we focus on the task of achieving large in-hand object rotation, which we, as others before, believe to be representative of this general class of problems, since it requires extensive finger gaiting and object re-orientation.
Formally, our goal is to obtain a policy for issuing finger motor commands, rewarded by achieving large object rotation around a given hand-centric axis. The state of our system at time t is denoted by xt=(qt, pt), where q∈d is a vector containing the positions of the hand's d degrees of freedom (joints), and p∈6 contains the position and orientation of the object with respect to the hand. An action (or command) is denoted by the vector a∈d comprising new setpoints for the position controllers running at every joint.
For parts of our approach, we assume that a model of the forward dynamics of our environment (i.e. a physics simulator) is available for planning or training. We denote this model by xt+1=F (xt, at). We will show however that our results transfer to real robots using standard sim-to-real methods.
We chose to focus on the case where the only sensing avail-able is hand-centric, either tactile or proprioceptive. Achieving dexterity with only proprioceptive sensing, as biological organisms are clearly capable of, can lead to skills that are robust to occlusion and lighting and can operate in very constrained settings. With this directional goal in mind, the observation available to our policy consists of tactile and proprioceptive data collected by the hand, and no global object pose information. Formally, the observation vector is
where qt, qts∈d are the current positions and setpoints of the joints, and ct∈[0,1]m is the vector representing binary (contact/no-contact) touch feedback for each of m fingers of the hand.
As discussed above, we also require that the hand maintain a stable precision grasp of the manipulated object at all times. Overall, this means that our problem is characterized by a high-dimensional state space, but only small parts of this state space are accessible for us: those where the hand is holding the object in a stable precision grasp. Furthermore, the transition function of our problem is non-holonomic: the subset of fingers that are tasked with holding the object at a specific moment, as well as the object itself, must move in concerted fashion. Conceptually, the hand-object system must evolve on the complex union of high-dimensional manifolds that form our accessible states. Still, the problem state space must be effectively explored if we are to achieve dexterous manipulation with large object re-orientation and finger gaiting.
To effectively explore our high-dimensional state space characterized by non-holonomic transitions, we turn to the well-known Rapidly-Exploring Random Trees (RRT) algorithm. We leverage our knowledge of the manipulation domain to induce tree growth along the desired manifolds in state space. In particular, we expect two conditions to be met for any state: (1) the hand must maintain at least three fingers in contact with the object, and (2) the distribution of these contacts must be such that a stable grasp is possible. We note that these are necessary, but not sufficient conditions for stability; nevertheless, we found them sufficient for effective exploration.
Preservation of condition (1) during the transition between two states means that the object and the fingers that maintain contact with it must move in unison. Assume that we would like the system to evolve from state xstart=(qstart, pstart) towards state xend=(qend, pend), with a desired change in state of Δxdes=(Δqdes, Δpdes)=xend−xstart. Further assume that the set S comprises the indices of the fingers that are expected to maintain contact throughout the motion. The requirement of maintaining contact, linearized around xstart, can be expressed as:
where JS (qstart) is the Jacobian of contacts on fingers in set S computed at qstart, and GS (pstart) is the grasp map matrix of contacts on fingers in set S computed at pstart. This is further equivalent to
where NS(xstart)=[JS(qstart)−GS(pstart)]T.
It follows that, if the desired direction of motion in state space Δxdes violates this constraint, we can still find a similar movement that does not violate the constraint by projecting the desired vector into the null space of the matrix N as defined above:
where α is a constant determining the size of the step we are willing to take in the projected direction.
We note that this simple projection linearizes the contact constraint around the starting state. Even for small α, small errors due to this linearization can accumulate over multiple steps leading the fingers to lose contact. Thus, in practice, we further modify xnew by bringing back into contact with the object any finger that is within a given distance threshold (in practice, we set this threshold to 5 mm).
Maintaining at least three contacts with the object does not in itself guarantee a stable grasp. We take further steps to ensure that the contact distribution is appropriate for stability. Assume a set of k contacts, where each contact i has a normal direction ni expressed in the global coordinate frame. We require that, if at least one contact j applies a non-zero normal contact force of magnitude cj, the other contacts must be able to approximately balance it via normal forces of their own, minimizing the resulting net wrench applied to the object. This is equivalent to requiring that the hand have the ability to create internal object forces by applying normal forces at the existing contacts. We formulate this problem as a Quadratic Program:
If the resulting minimization objective is below a chosen stability threshold, we deem the grasp to be stable:
We note that this measure is conservative in that it does not rely on friction forces. Furthermore, it ensure that the fingers are able to generate internal object forces using contact normal forces, but does not specify what are appropriate motor torques for doing so. Nevertheless, we have found it effective in pushing exploration towards useful parts of the state space. We can now put together these constraints into the complete algorithm shown in Alg. 1 and referred to in the rest of this paper as M-RRT. The essence of this algorithm is the forward propagation in lines 7-11. Given a desired direction of movement in state space, we want to ensure that at least three fingers maintain contact with the object. We thus project the direction of motion onto each of the manifolds defined by the contact constraints of each possible set of three fingers that begin the transition in contact with the object. We then choose the projected motion that brings us closest to the desired state-space sample. Finally, we perform an analytical stability check on the new state in line 9 via eqs. (6-9).
We note that M-RRT does not make use of the environment's transition function F ( ) (i.e. system dynamics). In fact, both the projection method in eqs. (4-5) and the stability check via eqs. (6-9) can be considered as approximations of the transition function, aiming to preserve movement constraints but without explicitly computing and checking the system's dynamics. As such, they are fast to compute but approximate in nature. It is possible that some of the transitions in the resulting RRT tree are in fact invalid under full system dynamics, or require complex sequences of motor actions. As we will see in Sec. III-D however, they are sufficient for helping learn a closed-loop control policy. Furthermore, for cases where the F ( ) is available and fast to evaluate, we also study a variant of our approach that makes explicit use of it in the next section.
For problems where system dynamics F ( ) are available and fast to evaluate, we also investigate the general non-holonomic version of the RRT algorithm, which is able to determine an action that moves the agent towards a desired sample in state space via random sampling. Alg. 2 below is referred to as G-RRT.
The essence of this algorithm is the while loop in line 5: it is able to grow the tree in a desired direction by sampling a number Kmax of random actions, then using the transition function F ( ) of our problem to evaluate which of these produces a new node that is as close as possible to a sampled target.
Our only addition to the general-purpose algorithm is the stability check in line 8: a new node gets added to the tree only if it passes a stability check. This check consists of advancing the simulation for an additional Is with no change in the action; if, at the end of this interval, the object has not been dropped (i.e. the height of the object is above a threshold) the new node is deemed stable and added to the tree. Assuming a typical simulation step of 2 ms, this implies 500 additional calls to F ( ) for each sample; however, it does away with the need for domain-specific analytical stability methods as we used for M-RRT.
Overall, the great advantage of this algorithm lies in its simplicity and generality. The only manipulation-specific component is the aforementioned stability check. However, its performance can be dependent on Kmax (i.e. number of action samples at each iteration), and each of these samples requires a call to the transition function. This problem can be alleviated by the advent of highly efficient and massively parallel physics engines implementing the transition function, which is an important research direction complementary to our study.
While the RRT algorithms we have discussed so far have excellent abilities to explore the complex state space of in-hand manipulation, and to identify (approximate) transitions that follow the complex manifold structure of this space, they do not provide directly usable policies. In fact, M-RRT does not provide actions to use, and the transitions might not be feasible under the true transition function. G-RRT does find transitions that are valid, and also identifies the associated actions, but provides no mechanism to act in states that are not part of the tree, or to act under slightly different transition functions.
In order to generate closed-loop policies able to handle variability in the encountered states, we turn to RL algorithms. Critically, we rely on the trees generated by our sampling-based algorithms to ensure effective exploration of the state space during policy training. The specific mechanism we use to transition information from the sampling based tree to the policy training method is via the reset distribution: we select relevant paths from the planned tree, then use the nodes therein as reset states for policy training.
We note that the sampling-based trees as described here are task-agnostic. Their effectiveness lies in achieving good coverage of the state space (usually within pre-specified limits along each dimension). Once a specific task is prescribed (e.g. via a reward function), we must select the paths through the tree that are relevant for the task. For the concrete problem chosen in this paper (large in-hand object reorientation) we rely on the heuristic of selecting the top ten paths from the RRT tree that achieve the largest angular change for the object around the chosen rotation axis. (Other selection mechanisms are also possible; a promising and more general direction for future studies is to select tree branches that accumulate the highest reward.) After selecting the task-relevant set of states from the RRT tree, we use a uniform distribution over these states as a reset distribution for RL.
Our approach is compatible with RL methods that alternate between collecting episode rollouts and updating the policy, and restarting episode rollouts starting from a new set of states. Thus, both off-policy and on-policy RL are equally feasible. However, we use on-policy learning due to its compatibility with GPU physics simulators and relative training stability.
We use the robot hand shown in
We test our methods on the object shapes illustrated in
We found that both algorithms are able to effectively explore the state space. G-RRT is able to explore farther with fewer iterations, and its performance further increases with the number of actions tested at each iteration. We attribute this difference to the fact that M-RRT is constrained to taking small steps due to the linearization of constraints used in the extension projection. G-RRT, which uses the actual physics of the domain to expand the tree, is able to take larger steps at each iteration without the risk of violating the manipulation constraints.
As expected, the performance of G-RRT improves with the number Kmax of actions tested at each iteration. Interestingly, the algorithm performs well even with Kmax=1; this is equivalent to a tree node growing in a completely random direction, without any bias towards the intended sample. However, we note that, at each iteration, the node that grows is the closest to the state-space sample taken at the beginning of the loop. This encourages growth at the periphery of the three and along the real constraint manifolds, and, as shown here, still leads to effective exploration.
Both these algorithms can be parallelized at the level of the main loop (line 1). However, the extensive sampling of possible actions, which is the main computational expense of G-RRT (line 5) also lends itself to parallelization. In practice, we use the IsaacGym parallel simulator to parallelize this algorithm at both these levels (32 parallel evaluations of the main loop, and 512 parallel evaluations of the action sampling loop). This made both algorithms practical for testing in the context of RL.
We then moved on to using paths from the planned trees in conjunction with RL training. Since our goal is finger gaiting for z-axis rotation, we planned additional trees with each method where object rotation around the x- and y-axes was restricted to 0.2 radians. Then, from each tree, we select 2×104 nodes from the paths which exhibit the most rotation around the z-axis, and extract their nodes. On average, each such path comprises 100-400 nodes. In the case of G-RRT, we recall that all tree nodes are subjected to an explicit stability check under full system dynamics before being added to the tree; we can thus use each of them as is. If using M-RRT, we also apply the same stability check to the nodes of the longest paths at this time, before using them as reset states for RL as described next.
In our experiments, we compare the following approaches:
Fixed Initialization (FI): For completeness, we also tried restarting training from a single fixed state. As expected, this method also failed to learn even in the simple cases. Addition-ally, we evaluated fixed initialization with gravity curriculum (zero to full). The policy only learned in-grasp manipulation, reorienting the object by the maximum possible amount with-out breaking contact before dropping. We found that the policy did not learn finger-gaiting even with zero gravity when using a fixed initialization. Thus, fixed initialization with or without gravity curriculum learning does not help with learning finger-gaiting. We hypothesize that curriculum learning has limited power to address exploration issues because policies tend to converge to sub-optimal behaviors that are hard to overcome later in training.
Training results are summarized
We also studied the impact of size of the tree used in extracting reset states.
In addition, we performed an ablation study of policy feed-back. Particularly, we aimed to compare intrinsically available tactile feedback vs. object pose feedback that would require external sensing.
To test the applicability of our method on real hardware, we attempted to transfer the learned policy for a subset of representative objects: cylinder, cube, cuboid & L-shape. We chose these objects to span the range from simpler to more difficult manipulation skills.
For sim-to-real transfer, we take a number of additional steps. We impose velocity and torque limits in the simulation, mirroring those used on the real motor (0.6 rad/s and 0.5 N-m respectively). We found that our hardware has significant latency of 0.05 s which we included in the simulation. In addition, we modified the angular velocity reward to maintain a desired velocity instead of maximizing the object's angular velocity. We also randomize joint origins (0.1 rad), friction coefficient (1-40), and train with perturbation forces (1N). All these changes are introduced successively via a curriculum.
For sensing, we used the current position and setpoint from the motor controllers with no additional changes. For tactile data, we found that information from our tactile fingers is most reliable for contact forces above 1 N. We thus did not use reported contact data below this threshold, and imposed a similar cutoff in simulation. Overall, we believe that a key advantage of exclusively using proprioceptive data is a smaller sim-to-real gap compared to extrinsic sensors such as cameras. For the set of representative objects, we ran the respective policy ten consecutive times, and counted the number of successful complete object revolutions achieved before a drop. In other words, five revolutions means the policy successfully rotated the object for 1, 800° before dropping it. In addition, we also report the average object rotation speed observed during the trials. The results of these trials are summarized in Table I.
The results we have presented show that sampling-based exploration methods make it possible to achieve difficult manipulation tasks via RL. In fact, these popular and widely used classes of algorithms are highly complementary in this case. RL is effective at learning closed-loop control policies that maintain the local stability needed for manipulation, and, thanks to training on large number of examples, are robust to variations in the encountered states. However, the standard RL exploration techniques (random perturbations in action space) are ineffective in the highly constrained state space with complex manifold structure of manipulation tasks. Conversely, SBP methods, which rely on a fundamentally different approach to exploration, can effectively discover relevant regions of the state space, and convey this information to RL training algorithms, for example via an informed reset distribution.
Since sampling-based exploration methods are not expected to generate directly usable trajectories, exploration can also use approximate models of physical constraints, which can be informed by well-established analytical models of robotic manipulators. Interestingly, we found that using the general-purpose exploration algorithm using the full transition function of the environment is still more sample-efficient than using such analytical constraint models. Nevertheless, both are usable in practice, particularly with the advent of massively parallel physics simulators.
We use this approach to demonstrate finger gaiting precision manipulation of both convex and non-convex objects, using only tactile and proprioceptive sensing. Using only these types of intrinsic sensors makes manipulation skills insensitive to occlusion, illumination or distractors, and reduces the sim-to-real gap. We take advantage of this by demonstrating our approach both in simulation and on real hardware. We note that, while some applications naturally preclude the use of vision (e.g. extracting an object from a bag), we expect that in many real-life situations future robotic manipulators will achieve the best performance by combining touch, proprioception and vision. Learning in-hand object reorientation to achieve a given desired pose may also potentially benefit from our approach, for example by leveraging information from the RRT tree to find appropriate trajectories for reaching a specific node. Another more general and promising direction for future work involves other mechanisms by which ideas from sampling-based exploration can facilitate RL, beyond reset distributions. Some SBP algorithms can also be used to suggest possible actions for transitions between regions of the state space, a feature that we do not take advantage of here (even though one of the exploration algorithms we use does indeed compute actions). Alternatively, sampling-based exploration techniques could be integrated directly into the policy training mechanisms, removing the need for two separate stages during training. We hope to explore all these ideas in future work.
Although the subject matter has been described in language specific to structural features or methodological acts, it is to be understood that the subject matter of the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example aspects.
Various operations of aspects are provided herein. The order in which one or more or all of the operations are described should not be construed as to imply that these operations are necessarily order dependent. Alternative ordering will be appreciated based on this description. Further, not all operations may necessarily be present in each aspect provided herein.
As used in this application, “or” is intended to mean an inclusive “or” rather than an exclusive “or”. Further, an inclusive “or” may include any combination thereof (e.g., A, B, or any combination thereof). In addition, “a” and “an” as used in this application are generally construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form. Additionally, at least one of A and B and/or the like generally means A or B or both A and B. Further, to the extent that “includes”, “having”, “has”, “with”, or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising”.
Further, unless specified otherwise, “first”, “second”, or the like are not intended to imply a temporal aspect, a spatial aspect, an ordering, etc. Rather, such terms are merely used as identifiers, names, etc. for features, elements, items, etc. Additionally, “comprising”, “comprises”, “including”, “includes”, or the like generally means comprising or including, but not limited to.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives or varieties thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
This application is a continuation of International Application PCT/US2022/044618, entitled “ROBOTIC DEXTERITY WITH INTRINSIC SENSING AND REINFORCEMENT”, filed Sep. 23, 2022, which claims priority to and the benefit of U.S. Provisional Application, Ser. No. 63/247,719, entitled “ROBOTIC DEXTERITY WITH INTRINSIC SENSING AND REINFORCEMENT”, filed Sep. 23, 2021, the entirety of which is incorporated by reference in its entirety herein.
This invention was made with government support under N00014-21-1-4010, and N00014-19-1-2062 awarded by the Office of Naval Research, and 1551631, and 1734557 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63247719 | Sep 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2022/044618 | Sep 2022 | WO |
Child | 18613576 | US |