The present disclosure relates generally to a method for decoupling a high dimensional neural network into two or more neural networks of lower input dimensions and, more particularly, to a network modularization method to generate robot actions for high dimensional tasks which decomposes high degrees of freedom (DOF) actions into groups, and each of the grouped actions is searched individually by a neural network using specially designed data.
The use of industrial robots to perform a wide range of manufacturing, assembly and material movement operations is well known. One such application is a pick and place operation, where a robot picks up individual parts from a bin and places each part on a conveyor or a shipping container. An example of this application would be where parts which have been molded or machined are dropped into the bin and settle in random locations and orientations, and the robot is tasked with picking up each part and placing it in a predefined orientation (pose) on a conveyor which transports the parts for packaging or for further processing. Depending on the type of parts in the bin and other factors, finger-type graspers or suction-type grippers may be used as the robot tool. A vision system (one or more cameras) is typically used to identify the position and pose of individual parts in the bin.
It is known in the art to use trained neural network systems to compute grasping instructions for parts in a bin. However, existing neural network grasp learning systems suffer from drawbacks which limit their practical use. One known system encodes a top-down candidate grasp into an image patch and trains a network to predict the quality of a plurality of candidate grasps. This system requires a long time to compute candidate grasps, and can only produce top-down (vertical) grasps for parallel-jaw grippers. Moreover, this system cannot predict the effect of interference between parts in cluttered environments, as it is trained only with individual isolated parts/objects, not with a random jumble of parts in a bin.
Another known system removes the requirement of time-consuming grasp candidate calculation by training a network to take the original depth image and output the quality of each pixel. However, this system cannot make accurate predictions for each pixel due to the large number of pixels contained in each image. Thus, this system is not as accurate as the system discussed above. Furthermore, this system cannot handle the densely cluttered environment which is typical of parts in a bin, due to the ambiguity of gripper angle/width encoding. In addition, this system can only produce a straight top-down grasp solution. Finally, without predicting depth, this system can potentially drive the robot gripper into adjacent parts in the bin, and cause damage to the gripper or the parts.
Yet another existing system attempts to determine a six DOF grasp with a single network. However, this system cannot handle a cluttered grasping environment (such as a pile of parts) in the grasp evaluation network, and requires a grasp refinement step after the grasp evaluation in the neural network.
In light of the circumstances described above, there is a need for a method of decomposing high dimensional learning neural networks into two or more lower dimension networks, with the method being applicable to full-DOF grasp planning and other applications.
In accordance with the teachings of the present disclosure, a method for decoupling or modularizing high dimensional neural networks into two or more neural networks of lower input dimensions is described and shown. The disclosed network modularization method is particularly suited to generating full-DOF robot grasping actions based on images of parts in a bin to be picked. In one example, a first network encodes grasp positional dimensions and a second network encodes grasp rotational dimensions. The first network is trained to predict a position at which a grasp quality is maximized for any value of the grasp rotations. The second network is trained to identify the maximum grasp quality while searching only at the previously-identified position from the first network. In this way, the two networks collectively identify an optimal grasp, while each network's dimensional searching space is greatly reduced. Specifically, a large number of grasp positions and rotations can be evaluated in a total number of searches equaling the sum of the evaluated positions and rotations, rather than the product. The separation of dimensions between the networks may be designed to best suit a particular application, even including three neural networks instead of two in some applications.
Additional features of the presently disclosed devices and methods will become apparent from the following description and appended claims, taken in conjunction with the accompanying drawings.
The following discussion of the embodiments of the present disclosure directed to a neural network modularization technique to learn high dimensional robot tasks is merely exemplary in nature, and is in no way intended to limit the disclosed devices and techniques or their applications or uses.
The use of industrial robots for picking parts from a source and placing them at a destination is well known. In one common application, a supply of parts is provided in a bin, such as a bin full of parts which have just been cast or molded. Teaching a robot to recognize and grasp an individual part in a bin full of parts has always been challenging. Traditional methods teach robots manually in structural environments. For high dimensional tasks in unstructured environments, it is desired to learn a robust grasping skill by deep learning using a neural network trained for pattern recognition.
However, to learn a high dimensional robot task, the learning-based methods generally require encoding high dimensional states/actions and searching in high dimensional action space. For example, to learn a six degrees of freedom (DOF) general bin picking task, the neural network needs to encode the high dimensional observation and 6-DOF grasp actions before searching in the action space. This can increase the complexity of the network and introduce heavy computation load.
Concerning the challenges in high-dimensional learning, two known existing methods reduce the searching to four dimensions and constrain the approach direction of the grasps in a top-down manner. Also, these learning-based methods are either not fast enough (due to the requirement of time-consuming candidate grasp calculation) or not accurate enough (because they try to predict too many dimensions, which is difficult for neural networks). Yet another existing method uses a single neural network for a six-DOF grasp proposal, but this method suffers from high search complexity, requires subsequent grasp refinement, and cannot handle a cluttered object environment as is typical of parts in a bin.
In order to overcome the shortcomings of existing methods and systems, the present disclosure describes a technique for modularizing or decoupling large, high dimensional neural networks into two or three smaller networks of lower dimension. Using this neural network modularization technique, searching accuracy can be maintained while network performance and efficiency are greatly improved. One application for the disclosed neural network modularization technique is in robotic part grasping, where all degrees of freedom (DOF) of a grasp are computed from images of a bin full of parts, and the computed grasp exceeds a quality threshold.
Motion of the robot 100 is controlled by a controller 110, which typically communicates with the robot 100 via a cable 112. The controller 110 provides joint motion commands to the robot 100 and receives joint position data from encoders in the joints of the robot 100, as known in the art. The controller 110 also provides commands to control operation of the gripper 102—including gripper rotation angle and width, and grip/ungrip commands.
A computer 120 is in communication with the controller 110. The computer 120 includes a processor and memory/storage configured with neural networks for computing a grasp proposal based on three dimensional (3D) camera images. In one embodiment, the computer 120 running the neural networks in execution or inference mode is the same computer on which the neural networks were previously trained. In another embodiment, the neural networks are trained on a different computer and provided to the computer 120 for use in live robotic grasping operations.
A pair of 3D cameras 130 and 132 communicate, via hard-wire connection or wirelessly, with the computer 120 and provide images of the workspace. In particular, the cameras 130/132 provide images of objects 140 in a bin 150. The images (including depth data) from the cameras 130/132 provide point cloud data defining the position and orientation of the objects 140 in the bin 150. When there are two of the 3D cameras 130 and 132 having different perspectives, it is possible to compute or project a 3D depth map of the objects 140 in the bin 150 from any suitable point of view. In another embodiment, only one of the 3D cameras (130) is used, such as oriented for a directly vertical line of sight.
The position of the bin 150 relative to the robot 100 is known, so that when a grasp of an object 140 at a location in the bin 150 is computed, the robot 100 can control the gripper 102 to execute the grasp. The task of the robot 100 is to pick up one of the objects 140 from the bin 150 and move the object to a conveyor 160. In the example shown, an individual part 142 is selected, grasped by the gripper 102 of the robot 100, and moved to the conveyor 160 along a path 180.
For each part picking operation, the computer 120 receives one or more images of the objects 140 in the bin 150, from the cameras 130/132. From the camera images, the computer 120 computes one or more depth maps of the pile of objects 140 in the bin 150. Using the depth maps, the neural networks running on the computer 120 determine a high quality, full-DOF grasp for one individual object in the bin 150. For example, an object on top of the pile of objects, with significant portions of its sides exposed and free from surrounding objects, would be a good grasp candidate.
When an object (such as the object 142) is identified as being in a position for a high quality grasp according to the techniques described in detail below, the computer 120 provides the individual object grasp data to the controller 110, which then commands the robot 100 to grasp and move the object. The individual object grasp data provided to the controller 110 by the computer 120 preferably includes 3D coordinates of the grasp target point, the angle of approach to be followed by the gripper 102, and the gripper angle of rotation and width (or positions of all finger joints).
Using the individual object grasp data, the controller 110 can compute robot motion instructions which cause the gripper 102 to grasp the identified object (e.g., the object 142) and move the object to the destination location along a collision-free path (the path 180). Instead of the conveyor 160, the destination location could be a shipping container in which the objects are placed in individual compartments, or any other surface or device where the objects are further processed in a subsequent operation.
After the object 142 is moved to the conveyor 160, new image data is provided by the cameras 130/132, as the pile of objects 140 will have changed. The computer 120 must then identify a new target object for grasping based on the new image data using the trained neural networks. The new target object must be identified by the computer 120 very quickly, because the object identification and path computation must be performed in real time as fast as the robot 100 can move one of the objects 140 and return to pick up the next. The efficient searching provided by lower-dimensional modularized neural networks enables the fast grasp computation needed in this grasping application.
The application described above in the system of
In box 210 of
The neural network 220 in the box 210 encodes six dimensions of gripper pose defined by the input environment data associated with grasps {p,r}, and in execution phase attempts to search depth images to identify an optimum (p*,r*) of all six dimensions. A neural network which encodes and searches this many dimensions becomes very complex (many layers), and as a result, the training and searching are very slow, and the search results may be imprecise or ambiguous.
In box 250 of
Another way to explain the above is as follows. The single neural network 220 searches for a high dimension robot action (predicts dimension values
by searching across all dimensions {r,p} for the values {r*,p*} which maximize a quality metric Q which is a function of both r and p. According to the presently disclosed techniques, the single high dimension neural network 220 can be decomposed into a modularization of the two neural networks 260 and 270, where the first neural network 260 predicts a maximal margin value
where Qr(p) is the grasp quality projected along the r direction, and the second neural network 270 predicts the conditional behavior
Following is a detailed discussion of how a high dimensional search problem can be modularized into two neural networks where each network has a reduced dimension search space but the combined networks still find an optimum grasp quality value.
As discussed earlier, when p and r represent position and rotation dimensions of a grasp (gripper pose), both p and r each include three dimensions or degrees of freedom. Thus, it can be easily envisioned that in order to find an optimal grasp candidate, many different values of p and r will have to be searched. That is, the values of i and j are likely to at least be in the hundreds. For example, if the x, y and z dimensions of p are each divided into ten increments, p will have a dimensional size of 10×10×10=1000. When searching of the grid 310 is performed by a single neural network (such as the network 220 of
As also mentioned earlier, the present disclosure defines techniques for separating (modularizing) the one large, high dimension neural network into two (or more) simpler neural networks. A key to separating the one large neural network into two simpler neural networks is encoding the first neural network to find a value p* which yields the overall highest grasp quality at one of its corresponding values of r, so that the second neural network can then search the r dimension at an optimal location in p. Following is a discussion of this technique.
In box 320 is shown a first technique for neural network modularization, where the searching of both p and r dimensions of the grid 310 is separated into a search of the p dimension to find p*, followed by a search of the r dimension at p* to find the maximum quality grasp. As discussed above, p* can be found by
where Qr(p) is the grasp quality Q(r,p) projected along the r direction. Qr(p) hides r and is a function of p only. The technique described in the box 320 defines Qr(p)=∫Q(r,p)dr. When Qr(p) is defined in this way, the first neural network (shown at 330) finds the value of p* which has the best average quality Q, that is, the integral across all values of r. Based on the sizes of the quality dots in the grid 310, it can be seen that the value of p* in the box 320 is pi, which is the column with the highest average quality.
When the second neural network in the box 320, shown at 340, searches across all values of r at p*=pi to identify the maximum value of Q, all other values of p are hidden. Thus, the second neural network finds the maximum grasp quality for p*=pi, which occurs at r0. A visual inspection of the grid 310 reveals that the grasp quality at (r0,pi) is not the overall maximum grasp quality. Thus, the normal margin technique shown in the box 320 is not reliably able to find a maximum value when used in neural network modularization.
In box 350 is shown a second technique for neural network modularization, where the searching of both p and r dimensions of the grid 310 is separated into a search of the p dimension to find p*, followed by a search of the r dimension at p* to find the maximum quality grasp. According to the present disclosure, the technique described in the box 350 uses a maximal margin technique which defines
When Qr(p) is defined in this way, the first neural network (shown at 360) finds the value of p* which has the best overall quality Q, that is, the maximum individual quality across all values of r. In other words, the first neural network predicts a particular p has high score as long as there exists one r that performs well. Based on the sizes of the quality dots in the grid 310, it can be seen that the value of p* in the box 350 is p0, which is the column containing the cell with the highest individual quality.
When the second neural network in the box 350, shown at 370, searches across all values of r at p*=p0 to identify the maximum value of Q, all other values of p are hidden. Thus, the second neural network finds the maximum grasp quality for p*=p0, which occurs at rj. A visual inspection of the grid 310 reveals that the grasp quality at (rj,p0) is in fact the overall maximum grasp quality. Thus, the maximal margin technique shown in the box 350 is able to find a target value of one dimension (p*) which yields a maximum value when used in neural network modularization.
Because the second neural network 370 in the box 350 searches r only at p*=p0, and all other values of p are hidden, the search of the second neural network is much faster (by a factor of j) than a single neural network search across all dimensions of the grid 310. This huge improvement in neural network searching performance is very important in robotic grasping applications where the grasp proposal corresponding to an image of a pile of objects must be computed in real time to support robot control.
Based on the preceding discussion of
and
can be employed effectively for neural network modularization. Following is a discussion of how this is done in a training phase and in an inference phase of neural network modularization.
As shown schematically at the top of the boxes 400 and 450, the first neural network 410 encodes the grouped dimension p such that, based on input data which describes the environment for {p}, the first neural network is able to predict p*—the value of p for which there is a maximum quality at some value of r. The second neural network 460 then encodes all dimensions (p,r) based on input data which defines the environment for all r at the previously identified p*, and identifies values (p*,r*) where a maximum grasp quality exists.
To train the first neural network 410, maximal margin data is first prepared as indicated at 420 and 430. The input data Ir(p) indicated at 420 represents the state to uniquely encode the action p. The action r is hidden so Ir(p) is a function of p. The output data Qr(p) indicated at 430 is the quality Q(r,p) projected along r direction with maximal margin method (discussed above with respect to
The 3D depth image can be divided into multiple layers, each at a different height (z0, z1, . . . , zN). The height z0, represented by a line 540, indicates the highest point in the depth image (e.g., the pile of parts). One depth image layer is provided for the height z0, which shows a complete depth map including all objects from the z0 level (top of highest object) all the way down to the zN level (bottom of the bin). In the depth image layer for z0, the z coordinates of the depth image have a reference origin set to z0—such that everything in the depth image has a negative z coordinate. The height z1, represented by a line 542, indicates a level slightly below the top of the depth image. Another depth image layer is provided for the height z1, which again shows a depth map including all objects from the z0 level down to the bottom zN level; however, in the z1 depth image layer, the z coordinates of the depth image have a reference origin set to z1—such that everything above z1 in the depth image has a positive z coordinate, and everything below z1 in the depth image has a negative z coordinate. Similarly, depth image layers are provided for additional levels zi (represented by a line 544). Each depth image layer is a complete depth image for the entire pile of parts, but each depth image layer has a different origin in the z direction. By slicing at different levels and providing multiple depth image layers, the z direction is encoded in training the neural network 410. The x and y encoding is naturally done by the two-dimensional information in each layer. This is shown in
The curve 530 has a feature 550 and a feature 560. The features 550 and 560 are high spots in the depth map which indicate places where the pile of parts is higher due to the presence of one or more parts. The features 550 and 560 will be manifested in shapes in the depth image and in the grasp quality maps, as discussed below.
The depth image layer 600 includes a shape 620 and a shape 630, which correspond to the features 550 and 560, respectively, of
To summarize the maximal margin data preparation for training the first neural network 410 (in the box 400 of
To train the second neural network 460 (
For grasping applications, in the input data I(r,p) shown at 480 (the depth image crops 482, 484, 486, . . . ), p is encoded by crop centers (x and y from the location on a depth image layer, and z from the depth origin of that layer), and r is encoded by crop angles, both as determined from the quality maps 430. It is often advantageous to allow non-vertical grasp directions in order to provide the best bin picking capability. The approach direction of a grasp (that is, a non-vertical approach direction) may be encoded in the first and second neural networks (410/460) by feeding depth images 420 (used again at 470) of different view angles. The depth images of different view angles may be computed from point cloud data obtained from two 3D cameras having different positions and orientations, as shown in
Output data Q(r,p) is shown at 490, which includes a quality metric value associated with each of the depth image crops 482/484/486. The quality metric is also provided by the external training data source (discussed further below with respect to
The first neural network 410 may be a fully convolutional network (FCN)—which is best suited for “image-in/image-out” applications. The second neural network 460 may be a convolutional neural network (CNN)—which is best suited for “image-in/scalar-out” applications, where high accuracy is possible due to the low dimensional content of the output. Both the first neural network 410 and the second neural network 460 are trained using supervised learning, which means that the desired output data from the networks (the quality maps 430 from the first neural network 410, and the grasp quality metrics 490 from the second neural network 460) are provided as inputs for training. Following the training steps illustrated in the boxes 400 and 450 of
The neural networks 410 and 460, shown in
In the box 700, the first step is preparing the input data Ir(p) that encodes p. In grasping applications, Ir(p) can be depth images associated with different p. Depending on encoding methods and network structure, there can be different input/output types. In the grasping example shown, multiple depth image layers are provided at 710 for depth encoding and a fully convolutional network structure is used. Therefore, the inputs are depth images (horizontal slices, encoding p in x and y) centered in different heights or layers (encoding p in z), as discussed previously with respect to
When the inputs Ir(p) (710) are provided, the neural network 410 can be run in a “forward loop” in inference mode as shown in the box 700. The output from running the first neural network 410 in inference mode is multiple quality maps (one at each layer in z—as shown at 720) where each pixel shows the maximal margin quality
if grasping with associated p as discussed previously. Pixels in each quality map layer which have a grasp quality above a threshold are highlighted as dots or spots; all other pixels do not represent a quality grasp. Finally, the ultimate output from the first neural network 410 is the maximal margin value of p, obtained as shown in the equation at 730 from
The margin value p* is provided to the second neural network 460 as shown by arrow 740.
The box 750 includes the steps involving the second neural network 460. This begins by preparing the data for the second neural network 460. The value of p*, provided from the first neural network 410 at the arrow 740, is applied to a particular depth image layer 760 as shown at arrow 762. By identifying a particular crop location (x and y) on a particular depth image layer (z), the value of p* is fully defined as input to the second neural network 460. The r space can then be searched by providing a plurality of rotated cropped image patches as shown at 770. Thus, the input to the second neural network 460 is I(r,p|p=p*). It is important to remember that during the inference phase, the second neural network 460 only searches in r space, as p space (the value p*) is already encoded from the first neural network 410.
The output of the second neural network 460 is the quality (scalar value) of each different r, as shown at 780. Thus, the output of the second neural network 460 is Q(r,p|p=p*). That is, a quality value (grasp quality in the example shown) is computed for each of the rotated cropped image patches 770. Finally, the ultimate output from the second neural network 460 is value of r having the greatest value of Q, obtained as shown by the equation in box 790 as
The value p* from the first neural network 410 is concatenated with the value r* from the second neural network 460 to provide the full action (r*,p*)—which in the grasping example shown is a full six-DOF robot grasp of an object, where the grasp has the high quality Q which was found as described above.
To explain once again what was done by the modularized neural networks 410 and 460 as shown in
The preceding discussion and the examples shown in
A box 800 includes a first neural network 810 and its associated input and output data. A depth image 820, such as from one or more 3D cameras, is provided as input. In the training phase, a corresponding best quality grasp location is also provided for supervised learning. Through training using many of the depth images 820, the first neural network 810 learns to encode x and y from features of the depth image 820 to correspond to a best grasp quality. In the inference phase, the depth image 820 is provided to the first neural network 810, and the output is a single quality map 830 indicating a best quality grasp location in x and y dimensions. That is, the first neural network 810 encodes the maximal margin of Qz(xy). The first neural network 810 provides the x-y dimensions of the best grasp location (x*y*) to the second neural network 860 in the inference phase, as indicated at arrow 840.
A box 850 includes a second neural network 860 and its associated input and output data. The function of the second neural network 860 is to encode the z dimension. From a depth image 870 (which is the same as the depth image 820) and the input (x*y*) (which came from the first neural network 810 on the arrow 840), a depth image patch may be cropped at the best x and y position (x*y*) as shown in box 880 at 882. Another way to think of it is that, in the inference phase, the depth image 870 is cut into slices in the z direction as shown at 890, and the slices are evaluated at the best (x*y*) grasp location (shown at arrow 892) to determine the height z* at which a best quality grasp is found. Shown at 898 is one of the slices from the stack shown at 890, with a best grasp location circled, where the best grasp location in the slice 898 corresponds to the best grasp location in the quality map 830 from the first neural network 810 (which had not yet evaluated the z direction). It can thus be seen that final three-dimensional grasp quality from the second neural network 860
in the slice 898 agrees in the x and y directions with the maximum two-dimensional grasp quality (Qz(xy)) from the first neural network 810 in the quality map 830.
The final output grasp location (x*y*z*) includes the best x and y dimensions identified by the first neural network 810, and the best z dimension identified by the second neural network 860. The final output grasp location (x*y*z*) is provided to a robot controller which then provides commands to the robot to grasp the part at the identified coordinates. After the part is grasped, a new depth image would be provided to the neural networks 810 and 860, and coordinates of a new best grasp location computed.
By modularizing the 3-DOF grasp search into two networks—one network searching two dimensions, and another network searching one dimension—the overall search performance is improved. For example, consider a case where the x and y dimensions are each divided into a fairly coarse 20×20 grid, and the z dimension is divided into 10 layers. Using the disclosed network modularization techniques, the first neural network 810 searches a space of size 20×20=400, and the second neural network 860 searches a space of size 10; the resulting modularized search space has a size of 400+10=410. If all three dimensions were searched in a single network, that network would have a search space with a size of 20×20×10=4000.
A 3D depth image 910 (depicting a pile of objects in a bin, for example) is provided to a grasp proposal network 920. The grasp proposal network 920 is a fully convolutional network (FCN)—as it receives an image as input (the depth image 910) and provides an image as output (a grasp quality map 930). The grasp quality map 930 is provided to a grasp ranking network 940—which is a convolutional neural network (CNN), as it receives an image in and provides scalar data out (gripper width and rotation). The gripper width and rotation (shown at 950) from the grasp ranking network 940 are combined with the best grasp position (x/y/z) from the grasp quality map 930 produced by the grasp proposal network 920; together, this provides a 5-DOF grasp definition (x/y/z/w/θ) to be used by a robot controller.
As discussed in detail previously, the grasp proposal network 920 and the grasp ranking network 940 are first trained using supervised learning, and then operated in inference mode. In training, the grasp proposal network 920 is provided with depth images and corresponding grasp quality maps. The grasp ranking network 940 is trained by providing the depth image and corresponding quality maps as provided to the grasp proposal network 920, along with the desired outputs of gripper width/rotation and the final grasp quality. An automated method for performing this training is discussed further below with respect to
In inference mode, the grasp proposal network 920 is provided with depth images only (and provides a quality map as output), while the grasp ranking network 940 is provided with the depth image and corresponding quality map as input (and provides outputs of gripper width/rotation and the final grasp quality associated with the location of the best quality grasp chosen from the quality map).
By modularizing the 5-DOF grasp search of
In a first data preparation step at box 1010, automatic grasp searching on individual objects in a database is shown. Multiple grasps of an object 1012 by a gripper 1014 are illustrated. 3D solid or surface models of parts to be analyzed are provided, along with gripper data including geometry and operational parameters (finger joint locations, joint angle ranges, etc.). An iterative optimization method is used to produce robust grasp candidates based on part shape and gripper parameters. The step shown in the box 1010 provides a plurality of quality grasp positions and orientations for an individual part (the object 1012) by a particular gripper (the gripper 1014). These grasp poses can be computed automatically for many different objects using many different grippers.
In a second data preparation step at box 1020, robust grip simulation is performed, taking variation and interference into consideration. At this step, objects are further randomly sampled into dense clusters by simulating a stream of the objects tumbling into a bin and randomly settling in a pile of objects having various positions, orientations and entanglements. The pose of each object in the simulated pile is known, so the previously generated grasps (from the box 1010) can be tested to determine their effectiveness in simulated real-world conditions (entanglements and interferences). The success of each previously generated grasp is tested in this way, using a 3D depth image of the simulated pile of objects along with the previously generated grasps. The step shown at the box 1020 is a physical environment simulation which is performed entirely using mathematical simulations, not using actual parts and images. The simulated depth image, grasp location quality maps, grasp poses, and the success rates (collectively shown at box 1022) are stored and later used to train the grasp learning networks of
The grasp optimizations and simulations described above and depicted in the boxes 1010 and 1020 were disclosed in U.S. patent application Ser. No. 17/016,731, titled EFFICIENT DATA GENERATION FOR GRASP LEARNING WITH GENERAL GRIPPERS, filed 10 Sep. 2020 and commonly assigned with the present application, and hereby incorporated by reference in its entirety.
The grasp learning networks of
The training of the grasp proposal network 920 requires the depth image 910 as input. The depth image 910 can be provided from the box 1022, where the depth image 910 depicts the pile of objects from the physical environment simulation. For supervised learning, the training of the grasp proposal network 920 also requires the grasp quality map 930 depicting the quality of grasps at different pixel locations. The quality map 930 is also provided from the box 1022, where quality maps were computed from the physical environment simulation. The physical environment simulation shown in the box 1020 can be performed many times (thousands of times), with each random simulation providing a different random pile of objects, resulting in an ample quantity and diversity of the depth images 910 and corresponding quality maps 930 to train the grasp proposal network 920.
Next, the grasp ranking network 940 is trained. This network uses depth image crops at different angles (prepared from the quality map 930) as input, and outputs the gripper rotation angle (θ) and gripper width (w) as shown on the line 950, along with the corresponding grasp quality, also using the simulation results from the box 1022 as a basis for training (supervised learning). With the gripper rotation angle and gripper width included, the output at the box 960 now includes five grasping degrees of freedom. Stating again to be clear—the grasp optimization method of the box 1010 produces many different grasps for an object using a particular gripper; the physical environment simulation method of the box 1020 produces grasp quality simulation results for different grasp poses applied to randomly generated piles of objects; and the outputs of the physical environment simulation are used to train the grasp proposal network 920 and the grasp ranking network 940.
Following the training of the two neural networks (920, 940) as shown in
In box 1110 are the steps in a first phase of the method; choosing the best approach direction. Given an input scene 1112, which for example may be images from a pair of 3D cameras of parts in a bin, the associated point cloud is projected to multiple sampled approach directions by a direction encoding box 1120 to produce synthetic images 1122. The images of the input scene 1112 include depth information, allowing the algorithm in the direction encoding box 1120 to produce synthetic surface images as viewed from a plurality of randomly selected projection points of view. In other words, if the robot gripper approaches from a certain direction, what will the 3D surface image look like from that direction? These synthetic images are created for many different randomly sampled projection angles and provided in the synthetic images 1122, where the projection angles are within directional limits from which the robot may approach.
An approach direction proposal network 1130 is used to predict the overall quality if grasping from each approach direction proposed in the synthetic images 1122. In other words, in the network 1130, the grasp is hidden, and the approach direction containing a best grasp quality is determined using the maximal margin technique discussed earlier. The approach direction is defined as two vectors, v1 and v2, which may be azimuth and elevation angles in a polar coordinate system, or any other components which define a three-dimensional approach direction vector. The chosen approach direction (v1, v2) is stored as approach direction vector 1138, and will be used later by the robot controller. The depth image associated with the optimal approach direction is also saved and used in the next phase.
In box 1140 are the steps in a second phase of the method; deciding the best grasp position. A depth image 1142 is provided, which is the synthetic depth image (from the set of images 1122) associated with the optimal approach direction 1138 chosen above. The depth image 1142 is cut into slices at different heights at slice cutting box 1150, and sent to grasp position proposal network 1160. The grasp position proposal network 1160 generates quality maps for image slices of different heights as shown at 1164, and as discussed earlier relative to
In box 1170 are the steps in a third and final phase of the method; deciding the grasp angle (θ) and width (w). A depth image 1172 is provided, which is the depth image associated with the desired approach direction 1138 (v1, v2) and the desired grasp position 1168 (x,y,z) selected above. At box 1180, image patches are cropped at different angles around the top-quality grasp position stored at 1168. These image patches (1182) are sent to a grasp ranking network 1190 to output the qualities and widths (1194) for each evaluated image patch and angle. The grasp angle (θ) and width (w) corresponding to the highest quality is selected and stored at 1198.
The desired approach direction (2 DOF—v1,v2) stored at 1138, the best grasp position (3 DOF—x,y,z) stored at 1168 and the best grasp width/angle (2 DOF—w,θ) stored at 1198 are sent to the robot controller for execution, as indicated at 1199. That is, the robot controller instructs the robot to grasp a part from the bin using robot motion commands computed from the approach direction, grasp position and grasp width/angle information. The robot then places the part in a prescribed location (such as on a conveyor, or in a shipping container). The method then returns to the box 1110 where a new image for a new input scene 1112 is provided, and the grasp selection steps and robot execution are repeated.
The use of three neural networks for a 7-DOF grasp search application as shown in
The examples discussed above with respect to
At box 1220, two or more neural networks are provided, where one neural network is provided for each of the sets of grouped dimensions from the box 1210. The neural networks are concatenated in series (for inference mode) as shown in the preceding figures and discussed extensively. The neural networks run on a computer such as the computer 120 of
At box 1230, the two or more neural networks are independently trained using supervised learning. The supervised learning technique involves providing each network with a large number of training examples, where each example includes both inputs and desired outputs of the network. For example, in
At box 1240, the neural networks are run in inference mode, where an input defining an environment of the problem is provided, and each of the neural networks searches only its corresponding set of grouped dimensions to find the target values. The target values output from each of the neural networks are used as inputs by others of the neural networks downstream in the series. For example, in a two-network system, the input depth image is the only input provided to the first neural network and is used by the first neural network to compute a grasp quality map output having the target values of position coordinates (first set of grouped dimensions) of high quality grasps. The input depth image and the quality map are then provided as input to the second neural network (further processed such as by cropping and rotation) and are used by the second neural network to compute the target values of rotations (second set of grouped dimensions) of high quality grasps.
At box 1250, the outputs of the two or more neural networks are combined to provide a final output. In the two-network example discussed with respect to the box 1240, the highest quality metric value is selected; the corresponding target values of the second set of grouped dimensions are then combined with the target values of the first set of grouped dimensions (from the first neural network) to make up the complete and final output. In the case of grasp searching from depth images, the final output is the concatenated sets of dimensions or degrees of freedom corresponding to the highest quality grasp.
The disclosed methods for neural network modularization to learn high dimensional robot tasks offer many advantages over previously known methods. One great advantage of the disclosed methods is fast computation, because searching in a high-dimensional action space (e.g., 5 or more DOF) is avoided. Separation of the search dimensions into two or three neural networks offers a computation speed improvement factor of several orders of magnitude, as described above. This complexity reduction and speed improvement makes it possible to compute high-dimensional tasks that were simply not practical using existing methods.
The decoupling of the search space of the different neural networks allows the type of neural network to be optimally chosen for each task—such as fully convolutional networks for image in I image out computation, and a convolutional neural network to be used for a final scalar output computation. In addition, network design and performance are transparent and easy to analyze. In the disclosed technique, each network searches a separate portion of the control action space, and each network is trained independently from other networks. Therefore, the performance of each network can be analyzed independently without considering the outputs from other networks. The network decoupling or modularization is enabled by using the maximal margin technique for encoding one network's set of grouped dimensions while hiding others.
Throughout the preceding discussion, various computers and controllers are described and implied. It is to be understood that the software applications and modules of these computers and controllers are executed on one or more computing devices having a processor and a memory module. In particular, this includes a processor in the robot controller 110 which controls the robot performing the object grasping, in the computer 120 and in any other computer which is used for neural network training and inference/execution. Specifically, the processors in the computer(s) are configured to perform the image analysis, neural network training and execution in the manner described throughout the foregoing disclosure—for grasp learning or other neural network applications.
While a number of exemplary aspects and embodiments of the disclosed techniques for modularization of high dimension neural networks have been discussed above, those of skill in the art will recognize modifications, permutations, additions and sub-combinations thereof. It is therefore intended that the following appended claims and claims hereafter introduced are interpreted to include all such modifications, permutations, additions and sub-combinations as are within their true spirit and scope.
Number | Name | Date | Kind |
---|---|---|---|
10474928 | N | Nov 2019 | B2 |
10970830 | Zhang | Apr 2021 | B2 |
11445222 | Andreopoulos | Sep 2022 | B1 |
11460982 | Knight | Oct 2022 | B1 |
11663483 | Haidar | May 2023 | B2 |
20160321541 | Liu | Nov 2016 | A1 |
20180268256 | Di Febbo | Sep 2018 | A1 |
20190273510 | Elkind | Sep 2019 | A1 |
20190332942 | Wang | Oct 2019 | A1 |
20200050941 | Zhuang | Feb 2020 | A1 |
20200151019 | Yu | May 2020 | A1 |
20200242777 | Jiang | Jul 2020 | A1 |
20200302340 | Durand | Sep 2020 | A1 |
20210042946 | Groh | Feb 2021 | A1 |
20210049458 | Chang | Feb 2021 | A1 |
20210110274 | Tripathi | Apr 2021 | A1 |
20210182681 | Markram | Jun 2021 | A1 |
20210204884 | Ravishankar | Jul 2021 | A1 |
20210326660 | Krishnan | Oct 2021 | A1 |
20210365716 | Li | Nov 2021 | A1 |
20220051076 | Bingham | Feb 2022 | A1 |
20220122103 | Qiu | Apr 2022 | A1 |
20220147795 | Chu | May 2022 | A1 |
20220148204 | Guizilini | May 2022 | A1 |
20220198616 | Lee | Jun 2022 | A1 |
20220215267 | Oster | Jul 2022 | A1 |
20220238128 | Chi | Jul 2022 | A1 |
20220292355 | Lyske | Sep 2022 | A1 |
20220319154 | Zhang | Oct 2022 | A1 |
20220383525 | Sabato | Dec 2022 | A1 |
20230065862 | Karabutov | Mar 2023 | A1 |
20230067528 | Guo | Mar 2023 | A1 |
20230087722 | Ren | Mar 2023 | A1 |
20230181082 | Ravishankar | Jun 2023 | A1 |
Entry |
---|
Jeffrey Mahler, Matthew Mail, Vishal Satish, Michael Danielczuk, Bill Derose, Stephen McKinley, Ken Goldberg “Learning ambidextrous robot grasping policies.” Science Robotics Jan. 16, 2019, pp. 1-11, 4 eaau4984, American Association for the Advancement of Science, Washington, DC. |
Douglas Morrison, Peter Corke, Jürgen Leitner, “Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach”, 2018 Robotics: Science and Systems (RSS), Australian Centre for Robotic Vision, Brisbane, Australia. |
Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, Ken Goldberg, “Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics.” Aug. 8, 2017, Dept. of EECS, University of California, Berkeley. |
Arsalan Mousavian, Clemens Eppner, Dieter Fox “6-dof graspnet: Variational grasp generation for object manipulation.” Proceedings of the IEEE International Conference on Computer Vision. Aug. 17, 2019. pp. 1-11. |
https://github.com/dougsm/ggcnn/issues/17 Sep. 19, 2019. |
Number | Date | Country | |
---|---|---|---|
20220391638 A1 | Dec 2022 | US |