This disclosure relates generally to a system and method for obtaining a 3D pose of an object and, more particularly, to a robot system that obtains a 3D pose of an object that is part of a group of objects, where the system obtains an RGB image of the objects, segments the image using image segmentation, crops out the segmentation images of the objects and uses a learned-based neural network to obtain the 3D pose of each object in the segmentation images.
Robots perform a multitude of tasks including pick and place operations, where the robot picks up and moves objects from one location, such as a collection bin, to another location, such as a conveyor belt, where the location and orientation of the objects, known as the object's 3D pose, in the bin are slightly different. Thus, in order for the robot to effectively pick up an object, the robot often needs to know the 3D pose of the object. In order to identify the 3D pose of an object being picked up from a bin, some robot systems employ a 3D camera that generates 2D red-green-blue (RGB) color images of the bin and 2D gray scale depth map images of the bin, where each pixel in the depth map image has a value that defines the distance from the camera to a particular object, i.e., the closer the pixel is to the object the lower its value. The depth map image identifies distance measurements to points in a point cloud in the field-of-view of the camera, where a point cloud is a collection of data points that is defined by a certain coordinate system and each point has an x, y and z value. However, if the object being picked up by the robot is transparent, light is not accurately reflected from a surface of the object and the point cloud generated by the camera is not effective and the depth image is not reliable, and thus the object cannot be reliably identified to be picked up.
U.S. patent application Ser. No. 16/839,274, titled 3D Pose Estimation by a 2D camera, filed Apr. 3, 2020, assigned to the assignee of this application and herein incorporated by reference, discloses a robot system for obtaining a 3D pose of an object using 2D images from a 2D camera and a learned-based neural network that is able to identify the 3D pose of a transparent object being picked up. The neural network extracts a plurality of features on the object from the 2D images and generates a heatmap for each of the extracted features that identify the probability of a location of a feature point on the object by a color representation. The method provides a feature point image that includes the feature points from the heatmaps on the 2D images, and estimates the 3D pose of the object by comparing the feature point image and a 3D virtual CAD model of the object. In other words, an optimization algorithm is employed to optimally rotate and translate a CAD model so that projected feature points match in the model with predicted feature points in the image.
As mentioned, the '274 robotic system predicts multiple feature points on the images of the object being picked up by the robot. However, if the robot is selectively picking up an object from a group of objects, such as objects in a bin, there would multiple objects in the image and each object would have multiple predicted features. Therefore, when the CAD model is rotated its projected feature points may match the predicted feature points on different objects, thus preventing the process from reliably identifying the pose of a single object.
The following discussion discloses and describes a system and method for obtaining a 3D pose of objects to allow a robot to pick up the objects. The method includes obtaining a 2D red-green-blue (RGB) color image of the objects using a camera, and generating a segmentation image of the RGB images by performing an image segmentation process using a deep learning convolutional neural network that extracts features from the RGB image and assigns a label to pixels in the segmentation image so that objects in the segmentation image have the same label. The method also includes separating the segmentation image into a plurality of cropped images where each cropped image includes one of the objects, estimating the 3D pose of each object in each cropped image, and combining the 3D poses into a single pose image. The steps of obtaining a color image, generating a segmentation image, separating the segmentation image, estimating a 3D pose of each object and combining the 3D poses are performed each time an object is picked up from the group of objects by the robot.
Additional features of the disclosure will become apparent from the following description and appended claims, taken in conjunction with the accompanying drawings.
The following discussion of the embodiments of the disclosure directed to a robot system that obtains a 3D pose of an object that is in a group of transparent objects, where the system obtains an RGB image of the objects, segments the image using image segmentation, crops out the segmented images of the objects and uses a learned-based neural network to obtain the 3D pose of the segmented objects is merely exemplary in nature, and is in no way intended to limit the invention or its applications or uses. For example, the system and method have application for determining the position and orientation of a transparent object that is in a group of transparent objects. However, the system and method may have other applications.
In order for the robot 12 to effectively grasp and pick up the objects 16 it needs to be able to position the end-effector 14 at the proper location and orientation before it grabs the object 16. As will be discussed in detail below, the robot controller 22 employs an algorithm that allows the robot 12 to pick up the objects 16 without having to rely on an accurate depth map image. More specifically, the algorithm performs an image segmentation process using the different colors of the pixels in an RGB image from the camera 20. Image segmentation is a process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics. Thus, the segmentation process predicts which pixel belongs to which of the objects 16.
Modern image segmentation techniques may employ deep learning technology. Deep learning is a particular type of machine learning that provides greater learning performance by representing a certain real-world environment as a hierarchy of increasing complex concepts. Deep learning typically employs a software structure comprising several layers of neural networks that perform nonlinear processing, where each successive layer receives an output from the previous layer. Generally, the layers include an input layer that receives raw data from a sensor, a number of hidden layers that extract abstract features from the data, and an output layer that identifies a certain thing based on the feature extraction from the hidden layers. The neural networks include neurons or nodes that each has a “weight” that is multiplied by the input to the node to obtain a probability of whether something is correct. More specifically, each of the nodes has a weight that is a floating point number that is multiplied with the input to the node to generate an output for that node that is some proportion of the input. The weights are initially “trained” or set by causing the neural networks to analyze a set of known data under supervised processing and through minimizing a cost function to allow the network to obtain the highest probability of a correct output.
The sliding window search produces a bounding box image 54 including a number of bounding boxes 52 that each surrounds a predicted object in the image 44, where the number of bounding boxes 52 in the image 54 will be reduced each time the robot 12 removes one of the objects 16 from the bin 18. The module 50 parameterizes a center location (x, y), width (w) and height (h) of each box 52 and provides a prediction confidence value between 0% and 100% that an object 16 exists in the box 52. The image 54 is provided to a binary segmentation module 56 that estimates, using a neural network, whether a pixel belongs to the object 16 in each of the bounding boxes 52 to eliminate background pixels in the box 52 that are not part of the object 16. The remaining pixels in the image 54 in each of the boxes 52 are assigned a value for a particular object 16 so that a 2D segmentation image 58 is generated that identifies the objects 16 by different indicia, such as color. The image segmentation process as described is thus a modified form of a deep learning mask R-CNN (convolutional neural network). The segmented objects in the image 58 are then cropped to separate each of the identified objects 16 in the image 58 as cropped images 60 having only one of the objects 16.
Each of the cropped images 60 is then sent to a separate 3D pose estimation module 70 that performs the 3D pose estimation of the object 16 in that image 60 to obtain an estimated 3D pose 72 in the same manner, for example, as in the '274 application.
The image 94 is then compared to a nominal or virtual 3D CAD model of the object 16 that has the same feature points in a pose estimation processor 98 to provide the estimated 3D pose 72 of the object 14. One suitable algorithm for comparing the image 94 to the CAD model is known in the art as perspective-n-point (PnP). Generally, the PnP process estimates the pose of an object with respect to a calibrated camera given a set of n 3D points of the object in the world coordinate frame and their corresponding 2D projections in an image from the camera 20. The pose includes six degrees-of-freedom (DOF) that are made up of the rotation (roll, pitch and yaw) and 3D translation of the object with respect to the camera coordinate frame.
This analysis is depicted by equation (1) for one of the corresponding feature points between the images 108 and 116, where equation (1) is used for all of the feature points of the images 108 and 116.
where Vi is one of the feature points 104 on the CAD model 114, vi is the corresponding projected feature point 102 in the model image 116, ai is one of the feature points 102 on the object image 108, R is the rotation and T is the translation of the CAD model 114 both with respect to the camera 112, symbol ′ is the vector transpose, and V refers to any feature point with index 1. By solving equation (1) with an optimization solver, the optimal rotation and translation can be calculated, thus providing the estimation of the 3D pose 72 of the object 16.
All of the 3D poses 72 are combined into a single image 74, and the robot 12 selects one of the objects 16 to pick up. Once the object 16 is picked up and moved by the robot 12, the camera 20 will take new images of the bin 18 to pick up the next object 16. This process is continued until all of the objects 16 have been picked up.
The discussion above talks about identifying the 3D pose of objects in a group of objects having the same type or category of objects, i.e., transparent bottles. However, the process described above has application identifying the 3D pose of objects in a group of objects having different types or category of objects. This is illustrated by a segmented image 124 shown in
As will be well understood by those skilled in the art, the several and various steps and processes discussed herein to describe the disclosure may be referring to operations performed by a computer, a processor or other electronic calculating device that manipulate and/or transform data using electrical phenomenon. Those computers and electronic devices may employ various volatile and/or non-volatile memories including non-transitory computer-readable medium with an executable program stored thereon including various code or executable instructions able to be performed by the computer or processor, where the memory and/or computer-readable medium may include all forms and types of memory and other computer-readable media.
The foregoing discussion discloses and describes merely exemplary embodiments of the present disclosure. One skilled in the art will readily recognize from such discussion and from the accompanying drawings and claims that various changes, modifications and variations can be made therein without departing from the spirit and scope of the disclosure as defined in the following claims.