The device and method disclosed in this document relates to image processing and, more particularly, to dense visual SLAM (Simultaneous Localization and Mapping) methods.
Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to the prior art by inclusion in this section.
Three-dimensional (3D) position tracking is one of the fundamental techniques required to align the virtual and real world together in Augmented Reality applications, but is also applicable to many other useful application. Visual SLAM (Simultaneous Localization and Mapping) is the dominant technology for reconstruction of camera trajectory and the environment, especially in absence of any prior knowledge on the environments. A visual SLAM system typically consists of a front-end and a back-end. Visual odometry is the front-end that estimates the per-frame camera pose, which is essential in Augmented Reality applications to seamlessly align virtual objects with the real world. In the back-end, normally, camera poses are refined through a global optimization. Some methods also reconstruct the model of the environment which can also be further used for many other purposes, such as 3D mapping, physically-based simulation, and animation. The majority of existing visual SLAM techniques can be grouped into two categories, frame-to-frame and frame-to-model approaches, based on how the back-end map information is maintained and utilized for front-end pose estimation.
Frame-to-frame approaches are typically keyframe-based and rely on pose-pose constraints between a frame and a keyframe for pose estimation (e.g., DVO and σ-DVO). Specifically, a set of keyframes are identified during the visual odometry. For each frame associated with a keyframe, a relative camera pose is computed with respect to this keyframe. When loop closure is detected, the current keyframe can be associated with previous keyframes to create more pose-pose constraints. Considering all the frames, a pose graph can be constructed that represents the pose-pose constraints across frames. Then a pose-graph optimization can be performed to refine the camera pose of each frame. However, frame-to-frame approaches do not maintain a global representation of the scene (e.g., point cloud) and suffer from accumulated camera drift.
In the frame-to-model approaches, such as PTAM, KinectFusion, and ElasticFusion, a global map of the scene is usually maintained and updated. Popular map representations include point clouds, surfel clouds, and volumetric fields. These approaches can provide accurate models of the environment. However, camera pose estimation is typically performed in a frame-to-model fashion, where only the pose-point constraint between the global map and the current frame observation is considered. Normally, the back-end optimization is performed to only optimize the keyframe poses and the global point cloud with pose-point constraints, which means that the other frames between keyframes are never optimized. Some frame-to-model approaches do not perform any form of global optimization at all. Consequently, the accuracy of the final camera trajectory is limited. For the approaches that only maintain sparse point clouds, generating accurate dense meshes is very challenging. Moreover, approaches using dense map representations, especially volumetric representation, typically suffer from low spatial resolution and rely heavily on GPU acceleration.
RGBD dense Visual SLAM approaches have shown their advantages in robustness and accuracy in recent years. However, there are still several challenges such as sensor noise and other inconsistences in RGBD measurements across multiple frames that could jeopardize the accuracy of both camera trajectory and scene reconstruction. It would be beneficial to provide a dense visual SLAM method that properly accounts for sensor noise and other inconsistences in sensor measurements. It would be further advantageous, the method utilized a both pose-pose and pose-point constraints in an efficient manner for back-end optimization.
A method for visual simultaneous localization and mapping is disclosed. The method comprises: storing, in a memory, a first data structure having a plurality of surfels representing a 3D environment, each surfel having a 3D position, an uncertainty of the 3D position, an intensity, an uncertainty of the intensity, and a surface normal, the 3D position of each surfel being in a first coordinate space; receiving, with a processor, a first image of the 3D environment from a camera, the first image having an array of pixels, each pixel having an intensity and a depth; estimating, with the processor, a first camera pose from which the camera captured the first image based on the first image and the first data structure; and updating, with the processor, at least one surfel in the plurality of surfels of the first data structure based on the first image and the first camera pose.
A visual simultaneous localization and mapping system is disclosed. The system comprises: a camera configured to capture a plurality of images of a 3D environment, each image having an array of pixels, each pixel having an intensity and a depth; a memory configured to store a first data structure having a plurality of surfels representing the 3D environment, each surfel having a 3D position, an uncertainty of the 3D position, an intensity, an uncertainty of the intensity, and a surface normal, the 3D position of each surfel being in a first coordinate space; and a processor. The processor is configured to: receive a first image of the plurality of images of the 3D environment from the camera; estimate a first camera pose from which the camera captured the first image based on the first image and the first data structure; and update at least one surfel in the plurality of surfels of the first data structure based on the first image and the first camera pose.
The foregoing aspects and other features of the method and system are explained in the following description, taken in connection with the accompanying drawings.
For the purposes of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiments illustrated in the drawings and described in the following written specification. It is understood that no limitation to the scope of the disclosure is thereby intended. It is further understood that the present disclosure includes any alterations and modifications to the illustrated embodiments and includes further applications of the principles of the disclosure as would normally occur to one skilled in the art which this disclosure pertains.
Three-Dimensional Localization and Mapping System
With reference to
The camera 20 is configured to generate images of the scene 30 having both photometric information (intensity, color, and/or brightness) and geometric information (depth and/or distance). Particularly, in at least one embodiment, the camera 20 is an RGB-D camera configured to generate RGB-D images. Each RGB-D image comprises two-dimensional array of pixels. In at least one embodiment, each pixel includes RGB color intensity values and a depth value. However, it will be appreciated that the camera 20 may comprises a monochromatic and/or greyscale camera which captures images having pixels with only an intensity value and a depth.
The processor 12 is configured to process the captured images of the scene 30 to perform visual odometry to estimate the position, orientation, and trajectory of the camera 20 with respect to the scene 30 across several frames. Additionally, the processor 12 is configured to generate a three-dimensional model or map representation of the scene 30.
It will be appreciated that such mapping of the scene 30 and localization of the camera 20 within the scene 30, has many useful applications such as augmented reality, three-dimensional environment mapping or object scanning, autonomous robot navigation (e.g. vehicles, vacuum cleaners, flying drones), and other similar robotics applications. Accordingly, the three-dimensional localization and mapping system 10 may take a variety of different forms such as a smartphone, a tablet computer, a handheld camera, a head mounted display, or a sensor/processing system of a robot and/or vehicle.
In some embodiments, particularly for augmented reality applications, the three-dimensional localization and mapping system 10 further includes a display screen 16. The display screen 16 may comprise any of various known types of displays, such as an LCD or OLED screen. In some embodiments, the display screen 16 may comprise a touch screen configured to receive touch inputs from a user. In augmented reality applications, the processor 12 may be configured to operate the display screen 16 to display real-time images/video captured by the camera 20 with virtual objects superimposed thereon.
Method for Dense Visual SLAM Using Probabilistic Surfel Maps
Methods for operating the augmented reality system 10 are described below. In particular, methods for dense visual SLAM are described. In the description of the methods, statements that a method is performing some task or function refers to a controller or general purpose processor executing programmed instructions stored in non-transitory computer readable storage media operatively connected to the controller or processor to manipulate data or to operate one or more components in the three-dimensional localization and mapping system 10 to perform the task or function. Particularly, the processor 12 of the three-dimensional localization and mapping system 10 above may be such a controller or processor. Alternatively, the controller or processor may be implemented with more than one processor and associated circuitry and components, each of which is configured to form one or more tasks or functions described herein. It will be appreciated that some or all of the operations the method can also be performed by a remote server or cloud processing infrastructure. Additionally, the steps of the methods may be performed in any feasible chronological order, regardless of the order shown in the figures or the order in which the steps are described.
With continued reference to
In conventional map representations, a scene point encodes a set of attributes representing different properties. For SLAM, typical attributes include 3D position, intensity and surface normal. However, due to various factors such as sensor measurement noise, occlusion and change of lighting conditions, there are usually mismatches across multiple observations of a single scene point. Those mismatches are the major sources of the accumulated drift in visual odometry.
To reduce the camera drift and construct a more accurate and consistent map, the PSM data structure disclosed herein encodes both the attributes as well as their uncertainties. Mathematically, we denote each point in the PSM according to the equation:
which represents the (accumulated) measurement of a scene point. It includes a 3D position pi∈3, a positional uncertainty represented via a 3×3 covariance matrix Σ
and a 3D normal ni∈3. In at least one embodiment, the distribution of 3D positional uncertainty Σ
and the distribution of
intensity uncertainty
is modeled with a one dimensional Gaussian
independently.
The PSM data structure is utilized in both the front-end visual odometry and the back-end map construction. Particularly, as will be discussed in greater detail below, the memory 14 is configured to store a global PSM 62 and a keyframe PSM 64. The global PSM 62 is a map representation of all portions of the scene 30 that have been captured in the images from the camera 20 and is used during the back-end optimization to provide pose-point constraints that are properly weighted through the encoded uncertainties. In contrast, the keyframe PSM 64 is a map representation of only a portion of the scene 30 that was captured in a most recent keyframe Kj and is used during visual odometry to provide a more accurate reference for relative camera pose estimation. As used herein,
will be used to denote the global PSM 62 and
will be used to denote the keyframe PSM 64.
The method 100 begins with a step of receiving a current image frame from a camera (block 115). Particularly, with respect to the embodiments described in detail herein, the processor 12 of the three-dimensional localization and mapping system 10 is configured to execute instructions of the visual SLAM program 60 to receive an image frame from the camera 20. Particularly, in at least one embodiment, the processor 12 is configured to receive an RGB-D image frame from an RGB-D camera 20. As discussed above, each RGB-D image frame comprises two-dimensional array of pixels. In at least one embodiment, each pixel includes RGB color values and a depth value.
The method 100 continues with a step of estimating a camera pose for the current frame using the keyframe PSM (block 120). Particularly, with respect to the embodiments described in detail herein, the processor 12 is configured to execute instructions of the visual SLAM program 60 to estimate a relative camera pose between the current RGB-D image frame and the keyframe PSM 64. More particularly, the processor 12 is configured to calculate the optimal relative camera pose that minimizes photometric and geometric errors between the points i of the keyframe PSM 64 and measurements of the current RGB-D image frame, when the points i of the keyframe PSM 64 are transformed into the camera coordinate space of the current RGB-D image frame using the relative camera pose.
Particularly, each RGB-D pixel of the current RGB-D image frame can be treated as a measurement and is modeled in a similar way as a point in the PSM data structure. Particularly, each measurement is modeled as two Gaussians distributions and one normal denoted as {pi, Σpi, Ii, σIi
In at least one embodiment, the processor 12 is configured compute the positional uncertainty Σpi with a sensor noise model that models the uncertainty. For example, zero-mean multivariate Gaussian distributions can be used for structure light-based RGB-D camera 20. In many embodiments, the RGB-D camera 20 emits infra-red patterns and recovers depth from correspondences between two image views with a small parallax. During this process, the disparity is quantized into sub-pixels, which introduces a quantization error in the depth measurement. The noise due to quantization error is defined as:
where qpix is the sub-pixel resolution of the device, b is the baseline, and f is the focal length.
This error increases quadratically with range Zi, thus preventing the use of depth observations from far objects. The 3D sensor noise of RGB-D cameras can be modeled with a zero-mean multivariate Gaussian distribution whose covariance matrix has the following as diagonal components:
where the σ332 is directed along the ray, and βx and βy denote the angular resolutions in x and y directions. Note that the error in the ray direction increases quadratically.
Therefore, the positional uncertainty Σpi for each 3D point pi can be expressed as:
Σpi=Rraydiag(σ112,σ222,σ332)RrayT (4),
where Rray denotes the rotation matrix between the ray and camera coordinates, and RrayT is the transpose of Rray.
A camera pose is a rigid body transformation in special Euclidean group SE(3). Thus, the relative camera pose between the current RGB-D image frame and the keyframe PSM 64 can be expressed as a 4×4 homogeneous transformation matrix T, which includes a 3×3 rotation matrix R∈SO(3) and a 3D translation vector t∈3. Given a point i in the in keyframe PSM 64 and a camera pose T corresponding to a certain frame, we can transform the 3D position pi of the point i from the PSM coordinate space to the camera coordinate space as pi:pi=g(pi;T)=Rpi+t. The inverse transformation T−1 transforms pi from the camera coordinate space back to the PSM coordinate space. Generally, for pose optimization, the transformation T is represented as a six dimensional twist vector ξ given by Lie algebra se(3). The transformation T and ξ are used interchangeably to denote a camera pose. As used herein the term “PSM coordinate space” (which may also be referred to herein as the “world coordinate space” or “world space”) refers to the 3D coordinate space of the keyframe PSM 64 and the global PSM 62.
Projection of a point pi in the camera coordinate space onto the 2D image space of the an RGB-D image frame is denoted herein by the projection function xi=π(pi). Similarly, back-projection of a 2D point xi with depth value Z(xi) into the camera coordinate space is denoted herein as pi=π−1(xi·Z(xi)). For a projected point 2D point xi, the corresponding intensity and depth measurements based on the respective RGB-D image are denoted herein as I(xi) and Z(xi), respectively, and may be determined as an interpolation of neighboring pixels in the RGB-D image.
As mentioned above, the processor 12 is configured to estimate a relative camera pose ξ between the current RGB-D image frame and the keyframe PSM 64. Particularly, the processor 12 is configured to calculate the optimal relative camera pose that minimizes photometric and geometric errors between the points i of the keyframe PSM 64 and the measurements of the current RGB-D image frame, when the points i of the keyframe PSM 64 are transformed into the camera coordinate space of the current RGB-D image frame using the relative camera pose. Given the camera projection model and rigid body motion, the photometric and geometric errors of each point are defined by a residual error term ri according to the equation:
where riI is the photometric error, riZ is the geometric error, the 2D point
However, with the Keyframe PSM 64, we can re-define the residual term of equation (5) with the following changes:
Īi=Ii
where π(g(pi,ξ)) represents the represents a projection of a respective point i in the keyframe PSM 64 onto the current RGB-D image frame with the relative camera pose ξ, Ii denotes the photometric intensity of the respective point i, pi denotes the 3D position of the respective point i in the PSM coordinate space, and [g(pi,ξ)]z denotes the z component of the respective point i transformed into the camera coordinate space.
Using the notations described above, the calculation of the optimal relative camera pose ξ can be expressed mathematically according to the energy function:
{circumflex over (ξ)}=arg minξΣi=1nriTWiri (7),
where n is the total number of points i or number of reliable points i in the keyframe PSM 64, riT is the transpose of the residual error term ri, and Wi is a 2×2 weighting matrix which leverages the importance of each residue term. In at least one embodiment, the optimization of this energy function can be solved through an image-warping optimization framework such as the Lucas-Kanade algorithm with an inverse-composition loss function.
As the energy function from the equation (7) is non-linear with respect to the camera pose ξ, the least squares optimization is sensitive to outliers. So the matrix Wi plays a vital role in rejecting/under-weighting outliers to achieve robust camera pose estimation. In at least one embodiment, the distribution of photometric residuals is modeled as being close to a zero-mean student-t distribution with five degrees of freedom (i.e., v=5):
In contrast, in one embodiment, the sensor noise model described above is used to model the distribution of geometric residuals. Thus, in at least one embodiment, a hybrid weighting strategy is used, which applies student-t distribution for the photometric residuals and the sensor noise model for the geometric residuals. Such a hybrid weighting advantageously achieves a significant boost in robustness and accuracy. In the hybrid weighting strategy, the weight matrix Wi is modeled as
In equations (9)-(12), ωiI is a weight based on the intensity student-t distribution, while ωiZ is propagated from the sensor noise model via first-order Taylor expansion. Σp
are the covariance of the 3D point pi and its corresponding point in current image p′i. Λ here is a normalizer of all the weights in W1. By combining the equations (6)-(12) with equation (7), the calculation of the optimal relative camera pose with hybrid approach can be described as:
{circumflex over (ξ)}Hybrid=argminξΣi=1nriTi1/2∇i1/2ri (13).
The method 100 continues with a step of updating the keyframe PSM using the current frame (block 125). Particularly, with respect to the embodiments described in detail herein, once the relative camera pose ξ for the current RGB-D image frame is calculated, the processor 12 is configured to execute instructions of the visual SLAM program 60 to update the keyframe PSM 64 with the observation points of the current RGB-D image frame. More particularly, once the relative camera pose ξ for the current RGB-D image frame is calculated, the processor 12 is configured to update the 3D positions pi, the positional uncertainties Σ
of each point in the keyframe PSM 64 having an associated point in the current RGB-D image frame.
As discussed above, due to various factors including sensor noise, changes of lighting conditions, etc., the sensor measurements in any particular RGB-D image frame are considered to be biased observations with modeled with uncertainties. However, over several frames, the processor 12 is configured to update the uncertainties via a fusion of their respective probabilistic density functions. Through the gradual update of the keyframe PSM 64 during visual odometry, the method 100 builds a consistent and reliable model of the environment.
Once the relative camera pose ξ for the current RGB-D image frame is calculated, the processor 12 is configured to update the keyframe PSM 64 with the observations from the current RGB-D image frame. First, the processor 12 is configured to calculate a plurality of observation points {pi, Σpi, Ii, σIi
pi′=π−1(xi·Z(xi))
pi=g(pi′,T−1)
Σpi=RTΣpi,R
Ii=I(xi)
σIi=σI(xi)
n=g((pxni′−pi′)×(pyni′−pi′),T−1) (14),
where pi′ is the 3D position of the respective observation point in the camera coordinate space, pi is the 3D position of the respective observation point in the PSM coordinate space, Σpi is the positional uncertainty of the respective observation point transformed into the PSM coordinate space, Ii is the intensity of the respective observation point, σIi is the intensity uncertainty of the respective observation point, and n is the normal of the respective observation point transformed into the PSM coordinate space.
It will be appreciated that each of these observation points is associated with a corresponding point i in the keyframe PSM 64. Therefore, given a new observation point, the processor 12 is configured to update the corresponding point i in the keyframe PSM 64 with a fusion of two Gaussian distributions for both the position and the intensity according to the equations:
(pi,Σ
(Ii,σ
We denote this update operation with the symbol ⊕. With the Gaussian distribution modeling, the update operation can be described in more detail as follows:
where pi is the updated 3D position of the respective point i in the keyframe PSM 64, Σ
is the updated intensity uncertainty of the respective point i in the keyframe PSM 64, and ni is the normal of the respective point i in the keyframe PSM 64.
By updating the keyframe PSM 64 after each image frame using the above operation, the uncertainties
and Σ
The steps 115, 120, and 125 collectively comprise the front-end 105 of the method 100, and are performed for each RGB-D image frame received by the processor 12 from the RGB-D camera 20. Accordingly, the keyframe PSM 64 is continually updated a refined until a new keyframe is selected, thereby improving the visual odometry performed with respect to each intermediate RGB-D image frame. Below we describe how the PSM data structure, in particular the global PSM 62, is utilized in the back-end 110 to improve performance in keyframe selection and global optimization.
The method 100 continues with a step of determining whether the current frame should be selected as a new keyframe (block 130). Particularly, with respect to the embodiments described in detail herein, the processor 12 is configured to execute instructions of the visual SLAM program 60 to determine whether the current RGB-D image frame should be treated as a new keyframe Kj+1 based on at least one criterion.
In one embodiment, the processor 12 is configured to select the current RGB-D image frame to be a new keyframe Kj+1 in response to the current RGB-D image frame being captured more than a predetermined threshold amount of time since the most recent keyframe Kj was captured.
In one embodiment, the processor 12 is configured to select the current RGB-D image frame to be a new keyframe Kj+1 in response to the calculated relative camera pose ξ for the current RGB-D image frame being greater than a predetermined spatial distance from a camera pose ξK
In one embodiment, the processor 12 is configured to select the current RGB-D image frame to be a new keyframe Kj+1 in response to the entropy of the calculated relative camera pose between the most recent keyframe Kj and the current RGB-D image frame being lower than a predetermined threshold ratio of the entropy of the camera pose between the most recent keyframe Kj and its adjacent frame Kj+1.
In one embodiment, the processor 12 is configured to select the current RGB-D image frame to be a new keyframe Kj+1 in response to a number of points in the keyframe PSM 64 being active points and/or visible points in the current RGB-D image frame being lower than a predetermined threshold.
In one embodiment, the processor 12 is configured to select the current RGB-D image frame to be a new keyframe Kj+1 in response to an average uncertainty of intensity
or an average uncertainty of depth
within the keyframe PSM of current keyframe Kj being higher than a respective predetermined threshold that indicates significant mismatch between the map and current frame.
We note that if the current RGB-D image frame is a first image frame and no keyframes have yet been selected, then the processor 12 is configured to automatically select the current RGB-D image frame as a new keyframe Kj+1.
If a new keyframe is selected, the method 100 continues with a step of updating the global PSM based on the keyframe PSM (block 135). Particularly, with respect to the embodiments described in detail herein, once a new keyframe Kj+1 is selected, the processor 12 is configured to execute instructions of the visual SLAM program 60 to update the global PSM 62 with the points i of the keyframe PSM 64 of the previous keyframe Kj. More particularly, once a new keyframe Kj+1 is selected, the processor 12 is configured to update the 3D positions pi, the positional uncertainties Σ
or each point i in the global PSM 62 having an associated point i in the keyframe PSM 64. Additionally, in some embodiments, the processor 12 is configured to add any new points i of the keyframe PSM 64 that do not have an associated point i in the global PSM 62 to the global PSM 62.
The processor 12 is configured to update the global PSM 62 with the points of the keyframe PSM 64 in a similar manner as described above with respect to block 125 of
of each point i in the global PSM 62 having an associated point i in the keyframe PSM 64 using the equations (15)-(17), in the manner discussed above.
The method 100 continues with a step of generating a new keyframe PSM for the new keyframe (block 140). Particularly, with respect to the embodiments described in detail herein, the processor 12 is configured to execute instructions of the visual SLAM program 60 to generate a new keyframe PSM 64 based on the global PSM and any new observations in the new keyframe Kj+1.
During visual odometry, the keyframe PSM 64 typically covers only part of scene. With a new keyframe Kj+1 identified, it is very likely there will be new observations of the scene that does not exist in the global PSM 62 yet. The processor 12 is configured to generate the new keyframe PSM 64 for new keyframe Kj+1 by combining existing parts in the global PSM 62 that are visible in new keyframe Kj+1 with the new observations from new keyframe Kj+1. As shown in
In at least one embodiment, the processor 12 is configured to prune and/or remove the points i of the new keyframe PSM 64 having with positional uncertainty Σ
Returning to
In addition to returning the to the processes of the front-end 105, the method 100 continues in the back-end 110 with a step of detecting whether there is a loop closure (block 145) and, if a loop closure is detected, with steps of adding pose-pose constraints (block 150) and performing local pose-graph optimization (block 155). Particularly, with respect to the embodiments described in detail herein, the processor 12 is configured to execute instructions of the visual SLAM program 60 to detect whether the newly selected keyframe Kj+1 comprises a loop-closure with respect to a previous keyframe K0, . . . , Kj and, if loop closure is detected, generate pose-pose constraints based on the detected loop-closure, and perform local pose-graph optimization based on the pose-pose constraints.
Particularly, in the back-end 110, the processor 12 is configured to maintain and/or store in the memory 14 a pose-graph having nodes and edges. As used herein, a “pose-graph” refers to a data structure having a plurality of nodes which define values or measurements and a plurality of edges connecting pairs of nodes which define a relationship between the values or measurements of the connected nodes.
As new keyframes Kj are selected, the processor 12 is configured to add their keyframe camera poses ξK
To correct for the accumulated error in the trajectory, the processor 12 is configured to search for loop-closures to previously visited keyframes. As used herein, a loop-closure refers to when a keyframe pose is determined to be the same as a previous keyframe pose or within a predetermined closeness to a previous keyframe pose. To identify candidate previous keyframes that might form a loop closure with the new keyframe, the processor 12 is configured to search nearest neighbor keyframe poses in a kd-tree. In one embodiment, in order to validate a loop-closure, the processor 12 use the entropy ratio test discussed above for keyframe selection. However, instead of the entropy of the transformation of the first frame to the keyframe, the processor 12 is configured to use the average entropy of all successful matches of intermediate frames to the keyframe. If a valid loop closure is detected, the processor 12 adds a new edge (pose-pose constraint) to the pose-graph. As shown in
Afterwards, the processor 12 is configured to correct errors in the keyframe poses ξK
Returning to
If global optimization is to be performed, the method 100 continues with a step of iteratively alternating between pose-pose optimization and pose-point optimization (block 165). Particularly, with respect to the embodiments described in detail herein, the processor 12 is configured to execute instructions of the visual SLAM program 60 to add pose-point constraints to the pose-graph and perform global optimization based on the pose-point constraints and the pose-pose constraints.
Particularly, as discussed above, the processor 12 is configured to maintain and/or store in the memory 14 a pose-graph having a plurality of nodes which define values or measurements and a plurality of edges connecting pairs of nodes which define a relationship between the values or measurements of the connected nodes. As described above, a plurality keyframe camera pose ξK
When global optimization is performed, the global PSM 62 provides pose-point constraints that are properly weighted through the encoded uncertainties, which are be used to improve both trajectory and map accuracy. Particularly, the processor 12 is configured to add points i from the global PSM 62 to the pose-graph and add pose-point constraints indicating the consistencies and/or differences between the PSM points i and their local observations from a transformation/projection to the camera coordinate space of each keyframes Kj in which they are visible.
For the pose-point constraints, the measurements from PSM point i are the intensity Ii and the depth value of the point's projection in the keyframe [g(pi,ξ)]z; the local measurements are II (
Īi=Ii
Note that, in the equations (18), ξ represents the transformation from the PSM coordinate space of the global PSM 62 to the camera coordinate space of the corresponding keyframe Kj instead of a relative camera pose. By minimizing the error function, both camera poses and 3D positions of the points can be optimized.
Given the complete pose-graph having poses, points, pose-pose constraints, and pose-point constraints, the processor 12 is configured to correct errors in and/or optimize the keyframe poses ξK
However, the error function for pose-point constraints is non-linear and sensitive to noises and outliers. Accordingly, in at least one embodiment, the processor 12 is configured to perform an iterative optimization process which alternates between optimizing only pose-pose constraints and optimizing pose-point constraints. This alternating strategy is summarized in
Returning to
A common task for visual SLAM is to reconstruct the environment, and generate a dense mesh as an output. In conventional dense visual SLAM approaches, the map is represented either in a volumetric form which suffers from limited spatial resolution, or with 3D point cloud or surfels.
If dense mesh reconstruction is requested, the processor 12 is configured to generate a consistent dense mesh by deforming the depth map of each keyframe Kj towards the global PSM 62 and fusing the deformed depth maps together to generate the dense mesh. More particularly, for each keyframe Kj, the processor 12 is configured to project the points that have registered pose-point constraints with the keyframe to the 2D image coordinate space. Each projected point {circumflex over (x)}i has a depth value [g(pi,ξ)]z and its uncertainty σZi=RΣ
Next, the processor 12 fuses the deformed depth maps together to generate the dense mesh. Using this approach, the method 100 is able to produce consistent dense mesh once requested and also save memory and computations during visual SLAM.
While the disclosure has been illustrated and described in detail in the drawings and foregoing description, the same should be considered as illustrative and not restrictive in character. It is understood that only the preferred embodiments have been presented and that all changes, modifications and further applications that come within the spirit of the disclosure are desired to be protected.
This application claims the benefit of priority of U.S. provisional application Ser. No. 62/535,038, filed on Jul. 20, 2017 the disclosure of which is herein incorporated by reference in its entirety.
Entry |
---|
(Jörg Stückler and Sven Behnke. 2014. Multi-resolution surfel maps for efficient dense 3D modeling and tracking. J. Vis. Comun. Image Represent. 25, 1 (Jan. 2014) (Year: 2014). |
Lin, Bingxiong, “Visual SLAM and Surface Reconstruction for Abdominal Minimally Invasive Surgery” (2015). Graduate Theses and Dissertations. (Year: 2015). |
Gauglitz, S. et al., “Live Tracking and Mapping from Both General and Rotation-Only Camera Motion,” IEEE International Symposium on Mixed and Augmented Reality, pp. 13-22, 2012 (10 pages). |
Kähler, O. et al., “Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure,” ECCV 2016, Part VIII, LNCS 9912, pp. 500-516, 2016 (17 pages). |
Schöps, T. et al., “Semi-Dense Visual Odometry for AR on a Smartphone,” IEEE International Symposium on Mixed and Augmented Reality, pp. 145-150, 2014 (6 pages). |
Vistér, D. et al., “Visual Odometry,” Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'04), 2004 (8 pages). |
Babu, B. W. et al., “σ-DVO: Sensor Noise Model Meets Dense Visual Odometry,” In Mixed and Augmented Reality (ISMAR), 2016 IEEE International Symposium on, IEEE, 2016 (9 pages). |
Kerl, C. et al., “Dense Visual SLAM for RGB-D Cameras,” In Proc. of the Int. Conf. on Intelligent Robot Systems (IROS), 2013 (7 pages). |
Kerl, C. et al., “Robust Odometry Estimation for RGB-D Cameras,” In Int. Conf. on Robotics and Automation, 2013 (8 pages). |
Kümmerle, R. et al., “G20: A General Framework for Graph Optimization,” In Robotics and Automation (ICRA), 2011 IEEE International Conference on, 2011 (7 pages). |
Baker, S. et al., “Lucas-Kanade 20 Years on: A Unifying Framework,” International Journal of Computer Vision 56 (3), 221-255, 2004 (35 pages). |
Dai, A. et al., “BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-the-fly Surface Re-integration,” 2017, accessed and retrieved Jul. 16, 2018 from internet: https://arxiv.org/abs/1604.01093 (19 pages). |
Endres, F. et al., “An Evaluation of the RGB-D SLAM System,” in Proceedings of IEEE International Conference on Robotics and Automation, 2012 (6 pages). |
Engel, J. et al., “Direct Sparse Odometry,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017 (14 pages). |
Engel, J. et al., “LSD-SLAM: Large-Scale Direct Monocular SLAM,” In European Conference on Computer Vision, 2014 (16 pages). |
Forster, C. et al., “SVO: Fast Semi-Direct Monocular Visual Odometry,” In IEEE International Conference on Robotics and Automation (ICRA), 2014 (8 pages). |
Handa, A. et al., “A Benchmark for RGB-D Visual Odometry, 3D Reconstruction and SLAM,” In IEEE International Conference on Robotics and Automation, ICRA, 2014 (8 pages). |
Keller, M. et al., “Real-time 3D Reconstruction in Dynamic Scenes using Point-based Fusion,” 2013 International Conference on 3D Vision—3DV, 2013 (8 pages). |
Klein, G. et al., “Parallel Tracking and Mapping for Small AR Workspaces,” 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, 2007 (10 pages). |
Konolige, K. et al., “Outdoor Mapping and Navigation using Stereo Vision,” in Experimental Robotics, 2008 (12 pages). |
Leonard, J. J. et al., “Simultaneous Map Building and Localization for an Autonomous Mobile Robot,” IEEE International Workshop on Intelligent Robots and Systems, IROS '91, 1991 (6 pages). |
Liu, H. et al., “Robust Keyframe-based Monocular SLAM for Augmented Reality,” 2016 IEEE International Symposium on Mixed and Augmented Reality, 2016 (10 pages). |
Mur-Artal, R. et al., “ORB-SLAM: a Versatile and Accurate Monocular SLAM System,” IEEE Transactions on Robotics, vol. 31, Issue 5, 2015 (17 pages). |
Newcombe, R. A. et al., “Live Dense Reconstruction with a Single Moving Camera,” In Computer Vision and Pattern Recognition (CVPR), 2010 (8 pages). |
Newcombe, R. A., et al., “KinectFusion: Real-Time Dense Surface Mapping and Tracking,” in IEEE International Symposium on Mixed and Augmented Reality, 2011 (10 pages). |
Newcombe, R. A. et al., “DTAM: Dense Tracking and Mapping in Real-Time,” in IEEE International Conference on Computer Vision, 2011 (8 pages). |
Pradeep, V. et al., “MonoFusion: Real-time 3D Reconstruction of Small Scenes with a Single Web Camera,” in IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2013 (6 pages). |
Strasdat, H. et al., “Double Window Optimisation for Constant Time Visual SLAM,” in 2011 International Conference on Computer Vision, 2011 (8 pages). |
Stückler, J. et al., “Multi-Resolution Surfel Maps for Efficient Dense 3D Modeling and Tracking,” Journal of Communication and Image Representation, vol. 25, Issue 1, 2014 (30 pages). |
Sturm, J. et al., “A Benchmark for the Evaluation of RGB-D SLAM Systems,” in IEEE International Conference on Intelligent Robots and Systems (IROS), 2012 (8 pages). |
Sturm, P. et al. ,“A Factorization Based Algorithm for Multi-Image Projective Structure and Motion,” in 4th European Conference on Computer Vision, 1996 (10 pages). |
Tykkälä, T. et al., “Direct Iterative Closest Point for Real-time Visual Odometry,” 2011 International Conference on Computer Vision Workshops (ICCV Workshops), 2011 (7 pages). |
Whelan, T. et al., “Kintinuous: Spatially Extended KinectFusion,” Computer Science and Artificial Intelligence Laboratory, Technical Report, Massachusetts Institute of Technology, 2012 (9 pages). |
Whelan, T. et al., “Real-time large scale dense RGB-D SLAM with volumetric fusion,” International Journal of Robotics Research, vol. 34, Issue 4-5, 2014 (29 pages). |
Whelan, T. et al., “ElasticFusion: Dense SLAM Without a Pose Graph,” in proceedings of Robotics: Science and Systems, 2015 (9 pages). |
International Search Report corresponding to International Patent Application No. PCT/EP2018/069483 (5 pages). |
Whelan, T. et al., “ElasticFusion: Real-time dense SLAM and light source estimation,” The International Journal of Robotics Research, vol. 35, No. 14, pp. 1697-1716, 2016 (20 pages). |
Yan, Z. et al., “Dense Visual SLAM with Probabilistic Surfel Map,” IEEE Transactions on Visualization and Computer Graphics, vol. 23, No. 11, pp. 2389-2398, 2017 (10 pages). |
Number | Date | Country | |
---|---|---|---|
20190026943 A1 | Jan 2019 | US |
Number | Date | Country | |
---|---|---|---|
62535038 | Jul 2017 | US |