This disclosure relates generally to computer vision and machine learning systems, and more particularly to image-based pose estimation for various objects.
In general, there a variety of computer applications that involve object pose estimation with six degrees of freedom (6DoF) such as robotic navigation, autonomous driving, and augmented reality (AR) applications. For 6DoF object pose estimation, a prototypical methodology typically relies on the detection of semantic keypoints that are predefined for each object. However, there a number of challenges with respect to detecting semantic keypoints for textureless or symmetric objects because some of their semantic keypoints may become interchanged. Accordingly, the detection of semantic keypoints for those objects across different frames can be highly inconsistent such that they cannot contribute to valid 6DoF poses under the world coordinate system.
The following is a summary of certain embodiments described in detail below. The described aspects are presented merely to provide the reader with a brief summary of these certain embodiments and the description of these aspects is not intended to limit the scope of this disclosure. Indeed, this disclosure may encompass a variety of aspects that may not be explicitly set forth below.
According to at least one aspect, a computer-implemented method includes obtaining an image that displays a scene with a first object and a second object. The method includes generating a first set of two-dimensional (2D) keypoints corresponding to the first object. The method includes generating first object pose data based on the first set of 2D keypoints. The method includes generating camera pose data based on the first object pose data. The camera pose data corresponds to capture of the image. The method includes generating a keypoint heatmap based on the camera pose data. The method includes generating a second set of 2D keypoints corresponding to the second object based on the keypoint heatmap. The method includes generating second object pose data based on the second set of 2D keypoints. The method includes generating first coordinate data of the first object in world coordinates using the first object pose data and the camera pose data. The method includes generating second coordinate data of the second object in the world coordinates using the second object pose data and the camera pose data. The method includes tracking the first object in the world coordinates using the first coordinate data. The method includes tracking the second object in the world coordinates using the second coordinate data.
According to at least one aspect, a system includes at least a camera and a processor. The processor is in data communication with the camera. The processor is operable to receive a plurality of images from the camera. The processor is operable to obtain an image that displays a scene with a first object and a second object. The processor is operable to generate a first set of 2D keypoints corresponding to the first object. The processor is operable to generate first object pose data based on the first set of 2D keypoints. The processor is operable to generate camera pose data based on the first object pose data. The camera pose data corresponds to capture of the image. The processor is operable to generate a keypoint heatmap based on the camera pose data. The processor is operable to generate a second set of 2D keypoints corresponding to the second object based on the keypoint heatmap. The processor is operable to generate second object pose data based on the second set of 2D keypoints. The processor is operable to generate first coordinate data of the first object in world coordinates using the first object pose data and the camera pose data. The processor is operable to generate second coordinate data of the second object in the world coordinates using the second object pose data and the camera pose data. The processor is operable to track the first object based on the first coordinate data. The processor is operable to track the second object based on the second coordinate data.
According to at least one aspect, one or more non-transitory computer readable storage media stores computer readable data with instructions that when executed by one or more processors cause the one or more processors to perform a method. The method includes generating a first set of 2D keypoints corresponding to the first object. The method includes generating first object pose data based on the first set of 2D keypoints. The method includes generating camera pose data based on the first object pose data. The camera pose data corresponds to capture of the image. The method includes generating a keypoint heatmap based on the camera pose data. The method includes generating a second set of 2D keypoints corresponding to the second object based on the keypoint heatmap. The method includes generating second object pose data based on the second set of 2D keypoints. The method includes generating first coordinate data of the first object in world coordinates using the first object pose data and the camera pose data. The method includes generating second coordinate data of the second object in the world coordinates using the second object pose data and the camera pose data. The method includes tracking the first object in the world coordinates using the first coordinate data. The method includes tracking the second object in the world coordinates using the second coordinate data.
These and other features, aspects, and advantages of the present invention are discussed in the following detailed description in accordance with the accompanying drawings throughout which like characters represent similar or like parts.
The embodiments described herein, which have been shown and described by way of example, and many of their advantages will be understood by the foregoing description, and it will be apparent that various changes can be made in the form, construction, and arrangement of the components without departing from the disclosed subject matter or without sacrificing one or more of its advantages. Indeed, the described forms of these embodiments are merely explanatory. These embodiments are susceptible to various modifications and alternative forms, and the following claims are intended to encompass and include such changes and not be limited to the particular forms disclosed, but rather to cover all modifications, equivalents, and alternatives falling with the spirit and scope of this disclosure.
The system 100 includes at least a processing system 110 with at least one processing device. For example, the processing system 110 includes at least an electronic processor, a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor, a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), any suitable processing technology, or any number and combination thereof. As a non-limiting example, the processing system may include at least one GPU and at least one CPU, for instance, such that machine learning inference is performed by the GPU while other operations are performed by the CPU. The processing system 110 is operable to provide the functionalities of the semantic SLAM and 6DoF pose estimations as described herein.
The system 100 includes a memory system 120, which is operatively connected to the processing system 110. The memory system 120 includes at least one non-transitory computer readable storage medium, which is configured to store and provide access to various data to enable at least the processing system 110 to perform the operations and functionality, as disclosed herein. The memory system 120 comprises a single memory device or a plurality of memory devices. The memory system 120 may include electrical, electronic, magnetic, optical, semiconductor, electromagnetic, or any suitable storage technology that is operable with the system 100. For instance, in an example embodiment, the memory system 120 can include random access memory (RAM), read only memory (ROM), flash memory, a disk drive, a memory card, an optical storage device, a magnetic storage device, a memory module, any suitable type of memory device, or any number and combination thereof. With respect to the processing system 110 and/or other components of the system 100, the memory system 120 is local, remote, or a combination thereof (e.g., partly local and partly remote). For instance, in an example embodiment, the memory system 120 includes at least a cloud-based storage system (e.g. cloud-based database system), which is remote from the processing system 110 and/or other components of the system 100.
The memory system 120 includes at least a semantic SLAM framework 130, the machine learning system 140, training data 150, and other relevant data 160, which are stored thereon. The semantic SLAM framework 130 includes computer readable data with instructions, which, when executed by the processing system 110, is configured to train, deploy, and/or employ one or more machine learning systems 140. The computer readable data can include instructions, code, routines, various related data, any software technology, or any number and combination thereof. In an example embodiment, as shown in
In an example embodiment, the machine learning system 140 includes a convolutional neural network (CNN), any suitable encoding and decoding network, any suitable artificial neural network model, or any number and combination thereof. Also, the training data 150 includes at least a sufficient amount of sensor data (e.g. video data, digital image data, cropped image data, etc.), timeseries data, various loss data, various weight data, and various parameter data, as well as any related machine learning data that enables the system 100 to provide the semantic SLAM framework 130 and the trained machine learning system 140, as described herein. Meanwhile, the other relevant data 160 provides various data (e.g. operating system, machine learning algorithms, computer-aided design (CAD) databases, etc.), which enables the system 100 to perform the functions as discussed herein. As aforementioned, the system 100 is configured to train, employ, and/or deploy at least one machine learning system 140.
The system 100 is configured to include at least one sensor system 170. The sensor system 170 includes one or more sensors. For example, the sensor system 170 includes an image sensor, a camera, a radar sensor, a light detection and ranging (LIDAR) sensor, a thermal sensor, an ultrasonic sensor, an infrared sensor, a motion sensor, an audio sensor, any suitable sensor, or any number and combination thereof The sensor system 170 is operable to communicate with one or more other components (e.g., processing system 110 and memory system 120) of the system 100. For example, the sensor system 170 may provide sensor data, which is then processed by the processing system 110 to generate suitable input data (e.g., digital images) for semantic SLAM framework 1130 and the machine learning system 140. In this regard, the processing system 110 is configured to obtain the sensor data directly or indirectly from one or more sensors of the sensor system 170. The sensor system 170 is local, remote, or a combination thereof (e.g., partly local and partly remote). Upon receiving the sensor data, the processing system 110 is configured to process this sensor data (e.g., perform object detection via another machine learning system stored in memory system 120 to obtain bounding boxes and object classes) and provide this processed sensor data in a suitable format (e.g., digital image data, cropped image data, etc.) in connection with the semantic SLAM framework 130, the machine learning system 140, the training data 150, or any number and combination thereof
In addition, the system 100 may include at least one other component. For example, as shown in
In addition,
Referring back to
The keypoint network is configured to predict the 2D keypoint coordinates together with their uncertainty. In addition, to make it able to provide consistent keypoint tracks for symmetric objects, the keypoint network optionally takes prior keypoint heatmap inputs that are expected to be somewhat noisy. The backbone architecture of the keypoint network is the stacked hourglass network with a stack of two hourglass networks. The machine learning system 140 includes a multi-channel keypoint parameterization due to its simplicity. With this formulation, each channel is responsible for predicting a single keypoint, and all of the keypoints for the dataset are combined into one output tensor, thereby allowing a single keypoint network to be used for all of the objects.
Given the image and prior input cropped to a bounding box and resized to a static input resolution, the keypoint network predicts a N×H/d×W/d tensor p, where H×W is the input resolution, d is the downsampling ratio (e.g., four), and N is the total number of keypoints for the dataset. From p, a set of N 2D keypoints {u1, u2, . . . , uN}, 2×2 covariance matrices {Σ1, Σ2, . . . , ΣN} are predicted. A binary vector m∈[0,1]N is also predicted from the average pooled raw logits of p, which is trained to decide which keypoints belong to the object and are within the bounding box. Note that the keypoint network is trained to still predict occluded keypoints. Every channel of p, pi is enforced to be a 2D probability mass by utilizing a spatial softmax. The predicted keypoint is taken as the expected value of 2D coordinates over this probability mass ui=Σu,vpi(u,v)[u v]T. Unlike the non-differentiable argmax operation, this allows us to use the keypoint coordinate directly in the loss function, which relates to the uncertainty estimation.
Also, to efficiently track the keypoints over time during deployment, the system 100 is configured to obtain keypoint predictions having a symmetry hypothesis that is consistent with the 3D scene. The machine learning system 140 includes N extra channels as input to the keypoint network which contain a prior detection of the object's keypoints. To create the training prior, the 3D keypoints are projected into the image plane with a perturbed ground truth object pose δT,OCT in order to make the keypoint network robust to noisy prior detections, place the keypoints in the correct channel, and set the heatmap to a 2D Gaussian with a fixed σ=15.
A set of symmetry transforms, S={S
Since a prior for the initial detection may not be obtained, the system 100 is configured to predict initial keypoints for symmetric objects when the prior is not available. For this reason, during training, the keypoint network is given a prior detection only half of the time. Of course the question arises of how to detect the initial keypoints for symmetric objects without the prior. The ground truth pose cannot simply be used to create the keypoint label since many images will look the same but with different keypoint labels, thereby creating an ill-posed one-to-many mapping. As opposed to the mirroring technique and additional symmetry classifier, the system 100 utilizes the set of symmetry transforms. So, when the prior is not given to the keypoint network during training, the system 100 alleviates the ill-posed problem by choosing the symmetry for keypoint labels that brings the 3D keypoints closest (in orientation) to those transformed into a canonical view {Oc} in the camera frame:
In equation 1,
denotes the kth point of a mean-subtracted point cloud.
As shown in
Besides the first image, whose camera frame becomes the global reference frame {G}; the system 100 is configured to estimate the camera pose GCT with the set of object PnP poses and the current estimates of the objects in the global frame. For each asymmetric object that is both detected in the current frame with a successful PnP pose OCTpnp and has an estimated global pose OGT, the system 100 is configured to create a hypothesis about the current camera's pose as GCThyp=OCTpnp OGT−1 and then project the 3D keypoints from all objects that have both a global 3D estimate and detection in the current image into the current image plane with this camera pose, and count inliers with a χ2 test using the detected keypoints and uncertainty. The system 100 is configured to take the camera pose hypothesis with the most inliers as the final GCT, and reject any hypothesis that has too few. After this, any objects that have valid PnP poses but are not yet initialized in the scene are given an initial pose OGT=GCT−1OCTpnp. With a rough estimate of the current camera, the system 100 is configured to create the prior detections for the keypoints of symmetric objects by projecting the 3D keypoints for these objects into the current image, and constructing the prior keypoint heatmaps for keypoint network input.
Since each object is initialized with a PnP pose, it is possible that the initialization can be very poor from a PnP failure, and, if the pose is bad enough (e.g., off by a large orientation error), optimization cannot fix it due to only reaching local minima. To address this issue, the system 100 is configured to check if the PnP pose from the current image yields more inliers over the last few views than the current estimated pose, and, if this is true, then the system 100 is configured to re-initialize the object with the new pose. After this, the system 100 is configured to perform a quick local refinement of the camera pose by fixing the object poses and optimizing just the current camera to better register it into the scene.
The back-end global optimization module 404 runs periodically to refine the whole scene (object and camera poses) based on the measurements from each image. Rather than reduce the problem to a pose graph (i.e., using relative pose measurements from PnP), the system 100 is configured to keep the original noise model of using the keypoint detections as measurements, which allows us to weight each residual with the covariance prediction from the network. The global optimization problem is formulated by creating residuals that constrain the pose GC
r
j,l,k
=u
j,l,k−Πj,l() [2]
where Πj,l is the perspective projection function for the bounding box of object in image j. Thus the full problem becomes to minimize the cost over the entire scene.
C=Σ
j,l,k
S
j,l,kρH(rj,l,kTΣj,l,k−1rj,l,k) [3]
where Σj,l k is the 2×2 covariance matrix for the keypoint uj,l,k, ρH is the Huber norm which reduces the effect of outliers during the optimization steps, and sj,l,k∈{0,1} is a binary variable that is 1 if the measurement was deemed an inlier before the optimization started, and 0 otherwise. Both ρH and sj,l,k use the same outlier threshold τ, which is derived from the 2-dimensional λ2 distribution, and is always set to the 95% confidence threshold τ=5.991. Thus, the outlier threshold does not need to be manually tuned as long as the covariance matrix Σj,l,k can properly capture the true error of keypoint uj,l,k.
To provide robustness to the optimization against outliers, the process is actually split into four sub-optimizations, where the system 100 is configured to re-classify inliers and outliers by recomputing before each sub-optimization starts. This way, outliers can become inliers again after optimization updates the variables, and inliers can become outliers. Halfway through the optimization, the system 100 may remove the Huber norm since most, if not all, of the outliers have already been excluded.
Referring to the use case shown in
Prior to the first pass, the system 100 is initialized with an initial pass through the pipeline 400. More specifically, during the initial pass, the machine learning system 140 is configured to receive a cropped image of each object in an image taken at time t0. In this case, the initial pass includes a stream that includes each object in that image (e.g., both asymmetrical and symmetrical objects). In response to each cropped image, the machine leaning system 140 (e.g., the keypoint network) is configured to generate 2D keypoints for each object at time t0. The system is also configured to generate object pose data at time t0 for each object via a PnP process using the 2D keypoints for that object and 3D keypoints corresponding to a 3D model (e.g., CAD model) for that object. In this regard, the system 100 (e.g., memory system 120) includes a CAD database, which includes CAD models of various objects including each of the objects in the image 406. The CAD database also includes a set of 3D keypoints for each CAD model. In addition, during this initial pass, the camera pose data is set to be the global reference frame {G} and is not calculated in this instance. Also, coordinate data is generated for each object based on the object pose data with respect to the global reference frame. After the initial pass is performed, then the system 100 is configured to perform the first pass and the second pass of the pipeline 400.
With respect to the first pass of the pipeline 400, the machine learning system 140 receives the first stream of one or more images of one or more objects, which are identified as asymmetrical. In this case, the machine learning system 140 receives the second cropped image 422 and the third cropped image 424, which are associated with asymmetric labels, as input. In response to receiving an image as input, the machine learning system 140 is configured to generate 2D keypoints for the object in that image. The machine learning system 140 is agnostic to the choice of keypoint. For example,
With the object pose data of each asymmetric object in the current camera frame, the system 100 is configured to obtain a coarse estimate of the current camera pose in the global frame. More specifically, for instance, if the current frame is not the first frame, based on the correspondence between all of the 2D keypoints of the asymmetric objects and their previously recovered 3D locations, then the current camera pose is also estimated through another PnP process. In this regard, the system 100 is configured to generate camera pose data via PnP using various keypoint data relating to the set of asymmetric objects in the first stream. More specifically, the system 100 is configured to generate camera pose data via PnP using the set of 2D keypoints 426 of the second object 410 at time the set of 2D keypoints 428 of the third object 412 at time tj a prior set of 3D keypoints of the second object O2 in world coordinates at time tj−1, and a prior set of 3D keypoints of the third object O3 in world coordinates at time tj−1. The prior set of 3D keypoints of the second object O2 in world coordinates at time tj−1 and the prior set of 3D keypoints of the third object O3 in world coordinates at time tj−1 may be obtained from the memory system 120 as prior knowledge that was given or previously generated.
With the camera pose data, the system 100 is configured to estimate the detections for 2D keypoints of each symmetric object at time tj by projecting the prior set of 3D keypoints at time tj−1 for each symmetric object into the current image, and constructing a keypoint heatmap for each symmetric object. For example, in
In addition, the system 100 is configured to generate corresponding coordinate data of the second object 410 in world coordinates at time tj using the object pose data of that second object 410 at time tj and the camera pose data of the camera at time The system 100 is also configured to generate corresponding coordinate data of the third object 412 in world coordinates at time tj using the object pose data of that third object 412 at time tj and the camera pose data of the camera at time tj. Upon completing the first pass of the pipeline 400, the system 100 is configured to provide at least the camera pose data of the camera in world coordinates, coordinate data of the second object 410 in world coordinates, and coordinate data of the third object 412 in world coordinates. After handling each asymmetric object in the image 406, then the system 100 is configured to perform the second pass of the pipeline 400 with the camera pose data in world coordinates.
With respect to the second pass of the pipeline 400, the machine learning system 140 receives the second stream of one or more images of one or more objects, which are identified as symmetrical. In this case, the first stream only includes a single symmetrical object (i.e., the first object 408). The machine learning system 140 thus receives the first cropped image 420 of the first object 408 as input. In addition, the machine learning system 140 also receives the keypoint heatmap 430 as input. In this regard, the machine learning system 140 is configured to generate 2D keypoints for the first object 408 in response to the first cropped image 420 and the keypoint heatmap 430. In this regard,
In addition, the system 100 is configured to generate corresponding coordinate data of the first object 408 in world coordinates at time tj using the object pose data of that first object 408 and the camera pose data in world coordinates at time tj. As aforementioned, in this example, the camera pose data at time tj is generated during the first pass. Upon completing the second pass of the pipeline 400, the system 100 is configured to provide at least the coordinate data of the first object 408 in world coordinates at time tj. After handling each symmetric object taken from the image 406, the system 100 is configured to handle the next image or the next frame. In this regard, the system 100 is configured to update and track 6DoF camera pose estimations in world coordinates. Also, the system 100 is configured to update and track 6DoF object pose estimations of various objects in world coordinates.
In contrast,
The control system 620 is configured to receive sensor data from the sensor system 610. The processing system 640 includes at least one processor. For example, the processing system 640 includes an electronic processor, a CPU, a GPU, a microprocessor, a FPGA, an ASIC, processing circuits, any suitable processing technology, or any number and combination thereof. Upon receiving sensor data received from the sensor system 610, the processing system 640 is configured to process the sensor data to provide suitable input data, as previously described, to the semantic SLAM framework 130 and the machine learning system 140. The processing system 640, via the semantic SLAM framework 130 and the machine learning system 140, is configured to generate coordinate data for the camera and the objects in world coordinates as output data. In an example embodiment, the processing system 640 is operable to generate actuator control data based on this output data. The control system 620 is configured to control the actuator system 630 according to the actuator control data.
The memory system 660 is a computer or electronic storage system, which is configured to store and provide access to various data to enable at least the operations and functionality, as disclosed herein. The memory system 660 comprises a single device or a plurality of devices. The memory system 660 includes electrical, electronic, magnetic, optical, semiconductor, electromagnetic, any suitable memory technology, or any combination thereof. For instance, the memory system 660 may include RAM, ROM, flash memory, a disk drive, a memory card, an optical storage device, a magnetic storage device, a memory module, any suitable type of memory device, or any number and combination thereof. In an example embodiment, with respect to the control system 620 and/or processing system 640, the memory system 560 is local, remote, or a combination thereof (e.g., partly local and partly remote). For example, the memory system 660 is configured to include at least a cloud-based storage system (e.g. cloud-based database system), which is remote from the processing system 640 and/or other components of the control system 620.
The memory system 660 includes the semantic SLAM framework 130 and the trained machine learning system 140. Also, in an example, the memory system 660 includes an application program 680. In this example, the application program 680 relates to computer vision and mapping. The application program 680 is configured to ensure that the processing system 640 is configured to generate the appropriate input data for the semantic SLAM framework 130 and the machine learning system 140 based on sensor data received from the sensor system 610. In addition, the application program 680 is configured to use the coordinate data of the camera and the coordinate data of the objects in world coordinates to contribute to computer vision and/or mapping. In general, the application program 680 enables the semantic SLAM framework 130 and the trained machine learning system 140 to operate seamlessly as a part of the control system 620.
Furthermore, as shown in
As described in this disclosure, the system 100 provides a number of advantages and benefits. For example, the system 100 is configured to provide a keypoint-based object SLAM system that jointly estimates the globally-consistent object pose data and camera pose data in real time even in the presence of incorrect detections and symmetric objects. In addition, the system 100 is configured to predict and track semantic keypoints for symmetric objects, thereby providing a consistent hypothesis about the symmetry over time by exploiting the 3D pose information from SLAM. The system 100 is also configured to train the keypoint network to estimate the covariance of its predictions in such a way that the covariance quantifies the true error of the keypoints. The system 100 is configured to show that utilizing this covariance in the SLAM system significantly improves the object pose estimation accuracy.
Also, the system 100 is configured to handle keypoints of symmetric objects in an effective manner for multi-view 6D object pose estimation. More specifically, the system 100 uses pose estimation data of one or more asymmetric objects to improve pose estimation data of one or more symmetric objects. Compared to a prototypical keypoint-based method, the system 100 provides greater consistency in semantic detection across frames, thereby leading to more accurate final results. The system focuses on providing a solution in real-time and is over 10 times faster than iterative methods, which are impractically slow.
Furthermore, benefiting from the prior knowledge, the machine learning system 140 is configured to predict 2D keypoints of various objects with respect to sequential frames while providing more semantic consistency for symmetric objects such that the overall fusion of the multi-view results is more accurate. More technically, the variance of the keypoint heatmap is determined by the uncertainty of keypoints estimated from the machine learning system 140 and fused from multi-view previous. Advantageously, the machine learning system 140 is trained to predict the semantic 2D keypoints and also the uncertainty associated with these semantic 2D keypoints.
Also, the system 100 provides a configuration, which advantageously includes front-end processing and back-end processing. More specifically, the front-end processing is responsible for processing the incoming frames, running the keypoint network, estimating the current camera pose, and initializing new objects. Meanwhile, the back-end processing is responsible for refining the camera and object poses for the whole scene. In this regard, the system 100 advances existing methods in handling 6DoF object pose estimations for symmetric objects. The system 100 is configured to provide keypoint detection of asymmetric objects into a SLAM system such that a new camera pose can be estimated. Given the new camera pose, the previous detection of keypoints of symmetric objects can be projected onto the current frame to assist with the keypoint detection on the current frame. Given the prior knowledge on the previous determined symmetry, the keypoint estimation results across multi-frames can be more semantically consistent. Moreover, the 6DoF pose estimations may be used in a variety of applications, such as autonomous driving, robots, security systems, manufacturing systems, augmented reality systems, as well as a number of other technologies that are not specifically mentioned herein.
That is, the above description is intended to be illustrative, and not restrictive, and provided in the context of a particular application and its requirements. Those skilled in the art can appreciate from the foregoing description that the present invention may be implemented in a variety of forms, and that the various embodiments may be implemented alone or in combination. Therefore, while the embodiments of the present invention have been described in connection with particular examples thereof, the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the described embodiments, and the true scope of the embodiments and/or methods of the present invention are not limited to the embodiments shown and described, since various modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims. Additionally or alternatively, components and functionality may be separated or combined differently than in the manner of the various described embodiments, and may be described using different terminology. These and other variations, modifications, additions, and improvements may fall within the scope of the disclosure as defined in the claims that follow.