The present application relates to neural networks for processing images. More particularly, the present application relates to systems and methods for generating a three-dimensional (3D) representation of a scene from a plurality of images of one or more viewpoints of the scene acquired using an imaging device.
The entire contents of 1 (one) computer program listing appendix electronically submitted with this application—mast3r_dust3r.txt, 941,771 bytes, the submitted file created 11 Oct. 2024—are hereby incorporated by reference.
The background description provided here is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Image-based 3D reconstruction from one or multiple views is a 3D image reconstruction task that aims at estimating the 3D geometry and camera parameters of a particular scene, given a set of images of the scene. Methods for solving such a 3D reconstruction task have numerous applications including: mapping, navigation, archaeology, cultural heritage preservation, robotics, and 3D vision. 3D reconstruction may involve assembling a pipeline of different methods including: keypoint detection and matching, robust estimation, Structure-from-Motion (SfM), Bundle Adjustment (BA), and dense Multi-View Stereo (MVS). SfM and MVS pipelines equate to solving a series of sub-problems including: matching points, finding essential matrices, triangulating points, and densely re-constructing the scene. One disadvantage of the above is that each sub-problem may not be solved faultlessly, possibly introducing noise to subsequent steps in the pipeline. Another disadvantage of the above is the inability to solve the monocular case (e.g., when a single image of a scene is available).
There is consequently a need for an improved systems and methods for image-based 3D reconstruction.
In a feature, a computer-implemented method for reconstructing a scene in three dimensions from a plurality of images of one or more viewpoints of the scene acquired using an imaging device includes: receiving the plurality of images without receiving extrinsic or intrinsic properties of the imaging device; and processing the plurality of images using a neural network to produce a plurality of pointmaps of the scene that correspond to the plurality of images and that are aligned in a common coordinate frame, where each pointmap is a one-to-one mapping between pixels of one of the plurality of images and three-dimensional points of the scene.
In further features, the processing the plurality of images using the neural network further includes: processing the plurality of images using the neural network to produce a plurality of local feature maps that correspond to each of the plurality of images.
In further features, the method further includes performing one of the following applications using the plurality of pointmaps of the scene: (i) rendering a pointcloud of the scene for a given camera pose; (ii) recovering camera parameters of the scene; (iii) recovering depth maps of the scene for a given camera pose; and (iv) recovering three dimensional meshes of the scene.
In further features, the method further includes performing visual localization in the scene using the recovered camera parameters.
In further features, the processing the plurality of images using the neural network further includes: processing a plurality of image subsets of the plurality of images using the neural network, where each image subset of the plurality of image subsets includes different ones of the plurality of images; and aligning pointmaps from the plurality of image subsets into the plurality of pointmaps that are aligned in the common coordinate frame using a global aligner that performs regression based alignment.
In further features, the processing the plurality of images using the neural network further includes: processing a plurality of image subsets of the plurality of images using the neural network, where each image subset of the plurality of image subsets includes different ones of the plurality of images; and aligning pointmaps from the plurality of image subsets into the plurality of pointmaps that are aligned in the common coordinate frame using an alignment module that performs pixel correspondence based alignment using the plurality of local feature maps.
In further features, the processing the plurality of images using the neural network further includes: (a) for each of the plurality of images: (i) generating patches with a pre-encoder; (ii) encoding the generated patches with a transformer encoder to define token encodings that represent the generated patches; and (iii) decoding the token encodings with a transformer decoder to generate token decodings; (b1) for one of the token decodings, generating a pointmap corresponding to one of the plurality of images with a first regression head that produces pointmaps in a coordinate frame of the one of the plurality of images; and (c1) for each of other of the token decodings, generating a pointmap corresponding to each of the other of the plurality of images with a second regression head that produces pointmaps in the coordinate frame of the one of the plurality of images.
In further features, the processing the plurality of images using the neural network further includes: (b2) for the one of the token decodings, generating with a first descriptor head a local feature map of the one of the plurality of images; and (c2) for each of the other of the token decodings, generating with a second descriptor head local feature maps of the other of the plurality of images, respectively; where the first and second descriptor heads match features between image pixels of the plurality of images.
In further features, the processing the plurality of images using the neural network further includes: (a) initializing the plurality of pointmaps; (b) for each of the plurality of images: (i) generating image patches with a pre-encoder; and (ii) encoding the generated image patches with a transformer encoder to define image token encodings that represent the generated image patches; (c) for each of the plurality of pointmaps: (i) generating pointmap patches with a pre-encoder; and (ii) encoding the generated pointmap patches with a transformer encoder to define pointmap token encodings that represent the pointmap generated patches; (d) for each of the plurality of image token encodings and corresponding of the plurality of pointmap token encodings, aggregating each pair with a mixer to generate mixed token encodings; (e) for each of the generated mixed token encodings, decoding the mixed token encodings with a transformer decoder to generate mixed token decodings; (f) for each of the mixed token decodings, replacing the plurality of pointmaps corresponding to the plurality of images with pointmaps generated by a regression head that produces pointmaps in a coordinate frame that is common to the plurality of images; and (g) repeating (c)-(f) for a predetermined number of iterations.
In further features, each pointmap represents a two-dimensional field of three-dimensional points of the scene, and the processing the plurality of images using the neural network further includes generating a confidence score map for each pointmap.
In further features, processing the plurality of images further includes expressing the plurality of images using a co-visibility graph and processing the co-visibility graph using the neural network to produce the plurality of pointmaps of the scene that correspond to the plurality of images and that are aligned in the common coordinate frame.
In further features, the method further includes, based on the plurality of pointmaps of the scene, determining extrinsic or intrinsic parameters of the imaging device.
In further features, at least one of: the extrinsic parameters of the imaging device include rotation and translation of the imaging device; and the intrinsic parameters of the imaging device include skew and focal length.
In further features, the neural network performs cross attention between views of the scene.
In further features, computer program product includes code instructions which, when the program is executed by a computer, cause the computer to carry out the method.
In further features, the processing includes: encoding the images into features corresponding to the images, respectively; determining similarities between pairs of the images based on the features, the similarity of each pair being determined based on the features of the images of that pair; generating a graph of the scene based on ones of the pairs; and determining the pointmaps of the scene that correspond to ones of the images of ones of the pairs and that are aligned in a common coordinate frame.
In further features, the processing includes: filtering out some ones of the pairs of the images based on the similarities, where the ones of the pairs include the ones of the pairs not filtered out.
In further features, the encoding the images includes: generating token features based on the images, respectively; applying whitening to the token features; quantizing the whitened token features according to a codebook; and aggregating and binarizing the residuals for each codebook element.
In further features, the codebook is obtained by k-means clustering.
In a feature, a computer-implemented method for reconstructing a scene in three dimensions from a plurality of images of one or more viewpoints of the scene acquired using one or more imaging devices includes: receiving the plurality of images without receiving extrinsic or intrinsic properties of the one or more imaging devices; encoding the images into features corresponding to the images, respectively; determining similarities between pairs of the images based on the features, the similarity of each pair being determined based on the features of the images of that pair; filtering out ones of the pairs of the images based on the similarities; generating a graph of the scene based on the pairs not filtered out; and determining pointmaps of the scene that correspond to ones of the images of the pairs not filtered out and that are aligned in a common coordinate frame, where each of the pointmaps is a one-to-one mapping between pixels of one of the plurality of images and three-dimensional points of the scene.
In further features, vertexes of the graph correspond to one of the images of the pairs not filtered out.
In further features, an edge between each vertex corresponds to an undirected connection between two likely overlapping images.
In further features, the features are tokens output from the encoding with whitening applied.
In further features, the encoding the images includes; generating token features based on the images, respectively; applying whitening to the token features; quantizing the whitened token features according to a codebook; and aggregating and binarizing the residuals for each codebook element.
In further features, the codebook is obtained by k-means clustering.
In further features, the determining the similarities includes determining the similarities based on summing a kernel function on binary representations over the common codebook elements.
In further features, the method further includes filtering out ones of the pairs includes selecting a predetermined number of key ones of the images.
In further features, the selecting a predetermined number of key ones of the images includes selecting the key ones of the images using farthest point sampling (FPS) based on the similarities.
In further features, the generating the graph includes using the key ones of the images as connected nodes and connecting all other not filtered out images to their respective closest keyframe and their k nearest neighbors according to the similarities, where k is an integer greater than or equal to zero.
In further features, k is an integer greater than zero and less than 15.
In further features, the predetermined number of key ones of the images includes less than 25 images.
In further features, the method further includes aggregating the pointmaps into a canonical pointmap.
In further features, the method further includes determining a canonical depthmap based on the canonical pointmap.
In further features, aligning the pointmaps in the common coordinate frame.
In further features, the aligning the pointmaps includes aligning pixels of the pointmaps having matching three dimensional points in the scene.
In further features, the aligning the pointmaps further includes aligning the pointmaps using gradient descent based on minimizing a two dimensional reprojection error of three dimensional points of the imaging devices.
In further features, the aligning the pointmaps in the common coordinate frame includes aligning the pointmaps using a kinematic chain relating orientations of the imaging devices.
In further features, the kinematic chain includes a root node corresponding to one of the imaging devices and a set of directed edges relating the one of the images corresponding to the root node to the other ones of the imaging devices.
In further features, the method further includes rendering a three dimensional construction of the scene using the three dimensional points of the pointmaps.
In further features, the rendering the three dimensional construction of the scene includes constructing the three dimensional points using an inverse reprojection function as a function of the camera intrinsics, camera extrinsics, pixel coordinates, and depthmaps.
In further features, reparameterizing the camera extrinsics based on changing a rotation center of an imaging device from an optical center to a point at intersection of (a) a z vector from the imaging device center and (b) a median depth plane of the three dimensional points.
In further features, reparameterizing the camera extrinsics based on changing a rotation center of an imaging device from an optical center to a point at intersection of (a) a z vector from the imaging device center and (b) a point within a predetermined distance of a median depth plane of the three dimensional points.
In further features, controlling one or more propulsion devices of an mobile robot based on the pointmaps and navigating the scene.
In further features, controlling one or more actuators of a robot based on the pointmaps and interacting with one or more objects in the scene.
In further features, the plurality of images includes at least 500 images.
In further features, using the pointmaps for one or more of camera calibration, depth estimation, pixel correspondences, camera pose estimation and dense 3D reconstruction.
In a feature, a system includes: one or more processors; and memory including code that, when executed by the one or more processors, perform to: receive the plurality of images without receiving extrinsic or intrinsic properties of the one or more imaging devices; encode the images into features corresponding to the images, respectively; determine similarities between pairs of the images based on the features, the similarity of each pair being determined based on the features of the images of that pair; filter out ones of the pairs of the images based on the similarities; generate a graph of the scene based on the pairs not filtered out; and determine pointmaps of the scene for that correspond to ones of the images of the pairs not filtered out and that are aligned in a common coordinate frame, where each of the pointmaps is a one-to-one mapping between pixels of one of the plurality of images and three-dimensional points of the scene.
In a feature, a system for reconstructing a scene in three dimensions from a plurality of images of one or more viewpoints of the scene acquired using an imaging device includes: one or more processors; and memory including code that, when executed by the one or more processors, perform to: receive the plurality of images without receiving extrinsic or intrinsic properties of the imaging device; and process the plurality of images using a neural network to produce a plurality of pointmaps of the scene that correspond to the plurality of images and that are aligned in a common coordinate frame, where each pointmap is a one-to-one mapping between pixels of one of the plurality of images and three-dimensional points of the scene.
In further features, the code, when executed by the one or more processors, further perform to: process the plurality of images using the neural network to produce a plurality of local feature maps that correspond to each of the plurality of images.
In further features, the code, when executed by the one or more processors, further perform one of the following applications using the plurality of pointmaps of the scene to: (i) render a pointcloud of the scene for a given camera pose; (ii) recover camera parameters of the scene; (iii) recover depth maps of the scene for a given camera pose; and (iv) recover three dimensional meshes of the scene.
Further areas of applicability of the present disclosure will become apparent from the detailed description, the claims and the drawings. The detailed description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the disclosure.
The present disclosure will become more fully understood from the detailed description and the accompanying drawings, wherein:
Day-Night and InLoc datasets with results reported for different number of retrieved database images (topN);
In the drawings, reference numbers may be reused to identify similar and/or identical elements.
The disclosed methods for generating 3D representations of scenes from a plurality of images may be implemented within a system 100 architected as illustrated in
In one example, the server 101b (with processors 112e and memory 113e) shown in
The autonomous machine 202 may be powered, such as via an internal battery and/or via an external power source, such as alternating current (AC) power. AC power may be received via an outlet, a direct connection, etc. In various implementations, the autonomous machine 202 may receive power wirelessly, such as inductively. In alternate embodiments, the autonomous machine 202 may include alternate propulsion devices 226, such as one or more wheels, one or more treads/tracks, one or more propellers, and/or one or more other types of devices configured to propel the autonomous machine 202 forward, backward, right, left, up, and/or down. In operation, the control module 209 actuates the propulsion device(s) 226 to perform tasks issued by the inference module 207. In one example, speaker 220 receives a natural language description of a task that is input after being processed by an audio-to-text converter to inference module 207 that provides input to control module 209 to carry out the task.
With reference to
Advantageously, the neural network 304 and the scene generator 308 reconstruct from uncalibrated and unposed imaging devices, without prior information regarding the scene or the imaging devices, including extrinsic parameters (e.g., rotation and translation relative to some coordinate frame: (i) the absolute pose of the imaging device (i.e., the relation between the camera and a scene coordinate frame), (ii) relative pose of the different viewpoints of the scene (i.e., the relation between different camera poses)) and intrinsic parameters (e.g., camera lens focal length and distortion). The resulting scene representation is generated based on pointmaps 306 with properties that encapsulate (a) scene geometry, (b) relations between pixels and scene points and (c) relations between viewpoints. From aligned pointmaps 308 alone, scene parameters (i.e., cameras and scene geometry) may be recovered.
The neural network 304 uses an objective function that minimizes the error between ground-truth and predicted pointmaps 306 (after normalization) using a confidence score function. The neural network 304 in one example is based on or includes large language models (LLMs), which are large neural networks trained on large quantities of unlabeled data. The architecture of such neural networks may be based on the transformer architecture with a transformer encoder and decoder with a self-attention mechanism. Such a transformer architecture as used in an embodiment herein is described in Ashish Vaswani et al., “Attention is all you need”, In I. Guyon et al., editors, Advances in Neural Information Processing Systems 30, pages 5998-6008, Curran Associates, Inc., 2017, which is incorporated herein in its entirety. Additional information regarding the transformer architecture can be found in U.S. Pat. No. 10,452,978, which is incorporated herein in its entirety. Alternative attention-based architectures include recurrent, graph and memory-augmented neural networks. To apply the transformer network to images, the neural network 304 in an example herein is based on the Vision Transformer (ViT) architecture (see Alexey Dosovitskiy et al., entitled “An image is worth 16×16 words: Transformers for image recognition at scale”, in ICLR, 2021, which is incorporated herein in its entirety).
W×H). At 606, one implementation of pointmap X is illustrated as a 2D field of 3D scene points 608, where mappings 609 for pointmap 306a are given by the position of the 2D field of 3D scene points 608 (i.e., a 5×5 matrix of 3D scene points) relative to the position of each pixel in the image 607 (i.e., a 5×5 matrix of 2D image pixels).
Further, examples disclosed hereunder assume that each camera ray hits a single 3D point (i.e., the case of translucent surfaces may be ignored). In addition, given camera intrinsics K∈3×3, the pointmap X of the observed scene can be obtained from the ground-truth depthmap D∈
W×H as Xi,j=K−1 [iDi,j, jDi,j, Di,j]T, where i, j∈NW×H denote the x-y pixel coordinates. Here, X is expressed in the camera frame. Herein, Xn,m may denote the pointmap Xn from camera n expressed in image m's coordinate frame: Xn,m=PmPn−1 h (Xn) with Pm, Pn∈R3×4 the world-to-camera poses for views n and m, and h: (x, y, z)→(x, y, z, 1) the homogeneous mapping.
This Section sets forth a first neural network architecture of the neural network 304 shown in
At 702, (i) for each of the plurality of images 302 {I1, I2, . . . , IN} a pre-encoder 804 generates patches 805; (ii) a transformer encoder 806 encodes the patches 805 to generate token encodings 807 that represent the generated patches; and (iii) a transformer decoder 808 decodes with decoder blocks 803 the token encodings 805, respectively, to generate token decodings 809 that are fed to regression head 810. In the case of a network 304a adapted to process two input images 302 {I1, I2} (or more generally more than one image), after pre-encoder 804 generates patches, the transformer encoder 806 then reasons over both sets of patches jointly (collectively). In one example, the decoder is a transformer network equipped with cross attention. Each decoder block 803 sequentially performs self-attention (each token of a view attends to tokens of the same view), then cross-attention (each token of a view attends to all other tokens of the other view). Information is shared between the branches during the decoder pass in order to output aligned pointmaps. Namely, each decoder block 803 attends to tokens encodings 807 from the other decoder block 803. Continuing with the example of two input images 302 {I1, I2} this may be given by:
At 704, for one of the token decodings 809a, a pointmap 306a that corresponds to image 302a is generated by a first regression head 811a, which produces pointmaps in a coordinate frame 812a of the image 302a that is input to the regression head 811a. At 706, for each of the other token decodings 809b . . . 809n, pointmaps 306b . . . 306n that correspond to each of the other of the plurality of images 302b . . . 302n are generated by a second regression head 811b that produces pointmaps in the coordinate frame 812a (output by the first regression head 811a, not in the coordinate frames corresponding to the coordinate frames 812b . . . 812n, respectively, in which each image 302b . . . 302n was captured). More specifically, each branch is a separate regression head 811a and 811b which takes the set of decoder tokens D 809a and 809b . . . 809n and outputs at 708 pointmaps X 306a . . . 306n (in common reference frame 812a) and associated confidence maps C 814a . . . 814n, respectively. Returning to the example of two input images 302 {I1, I2} the regression head 808 may be given by:
where, G1 and G2 are the input tokens from the token decodings D 809 and X1,1, C1,1 and X2,1, C2,1 are pairs of pointmaps 306 and confidence maps 814, respectively.
The output pointmaps 306 are regressed up to a scale factor. Also, it should be noted that the DUSt3R architecture may not explicitly enforce any geometrical constraints. Hence, pointmaps 306 may not necessarily correspond to any physically plausible camera model. Rather during training, the DUSt3R neural network 304a may learn all relevant priors present from the training set, which only contains geometrically consistent pointmaps. Using a generic architecture leverages such training.
The DUSt3R neural network model may be trained in a fully-supervised manner using a regression loss, leveraging large public datasets for which ground-truth annotations are either synthetically generated, reconstructed from Structure-from-Motion (SfM) software, or captured using sensors. A fully data-driven strategy based on a generic transformer architecture is adopted, not enforcing any geometric constraints at inference, but being able to benefit from powerful pretraining schemes. The DUSt3R neural network model learns strong geometric and shape priors, like shape from texture, shading or contours.
Additional details concerning the DUSt3R architecture described in this Section are set forth in Section B (below), including training and experimentation.
This Section sets forth an alternate example of the first neural network architecture shown in
The MASt3R architecture shown in
Similar to the first embodiment shown in
where, G1 and G2 are the input tokens from the token decodings 809 and D1 and D2 are local feature maps 818∈H×W×d of dimensional d.
The MASt3R neural network model is trained using a loss function based on or including a regression loss and a local descriptor matching loss. Similar to the DUSt3R architecture the MASt3r architecture may not explicitly enforce any geometrical constraint, in which case, pointmaps 306 do not necessarily correspond to any physically plausible camera model. However, scale invariance is not always desirable. Scale dependence (e.g., metric scale) may be desirable for some applications/tasks (e.g., visual localization without mapping, and monocular metric-depth estimation).
Additional details concerning the MASt3R architecture described in this Section are set forth in Section C (below), including training and experimentation.
This Section sets forth a second neural network architecture of the neural network 304 shown in
At 902, a plurality of pointmaps 306 are initialized using random input generator 950 with random input (e.g., random noise). At 904, (i) for each of the plurality of images 302 {I1, I2, . . . , IN}, image patches 952 are generated with a pre-encoder 951, and (ii) the generated image patches 952 are encoded with a transformer encoder 953 to generate image token encodings 954 representative thereof. At 906, (i) for each of the plurality of pointmaps 306 initialized at 902, pointmap patches 956 are generated by a pre-encoder 955; and (ii) the generated pointmap patches 956 are encoded with a transformer encoder 957 to generate pointmap token encodings 958 representative thereof.
At 908, a mixer 959, e.g., a transformer decoder neural network, aggregates the image token encodings 954 and the pointmap token encodings 958 for each respective image of the plurality of images 302 {I1, I2, . . . , IN} to generate mixed token encodings 960. At 910, the mixed token encodings 960 are decoded by a transformer decoder 961 with decoder blocks 963 to generate mixed token decodings 962, respectively. Similar to the DUSt3R architecture, each decoder block 963 performs self-attention (each token of a view attends to tokens of the same view) and cross-attention (each token of a view attends to all other tokens of the other view).
At 912, for each of the mixed token decodings 962, the plurality of pointmaps 306 corresponding to the plurality of images 302 are replaced with pointmaps 306 generated by a regression head 964 that produces pointmaps 306 in a coordinate frame 965 that is common to the plurality of images 302, which may be different from the coordinate frame 967 from which the respective images 302 are captured. At 914, a determination is made whether a predetermined number of iterations has been performed. In the event the number of iterations has not been performed at 912, the pointmaps 306 generated by regression head 964 now serve as input pre-encoder 955 and steps 906, 908, 910, 912 and 914 are repeated. In the event the number of iterations has been performed at 912, the pointmaps produced by the regression head 964 on the last iteration are output with confidence scores 966, corresponding to the plurality of images 302 that are aligned in the common coordinate frame 965.
Additional details concerning the C-GAR architecture described in this Section are set forth in Section D (below), including training and experimentation.
With reference again to the example shown in
In contrast, the embodiment shown in
Aligning subsets of pointmaps 405 of a scene 301 processed by global aligner 407 in
After constructing a connectivity graph G, globally aligned pointmaps are recovered {Xn∈RW×H×3} for all camera viewpoints n=1 . . . . N that captured images of the scene, by predicting for each image pair e=(n,m)∈E, the pairwise pointmaps Xn,n, Xm,n and their associated confidence maps Cn,n, Cm,n. More specifically, denoting Xn,e: =Xm,n and Xm,e: =Xm,n, and since the goal involves rotating all pairwise predictions in a common frame, a pairwise pose Pe and scaling oe>0 associated with each edge e∈E are defined. Given the forgoing, the following optimization problem may be solved:
Solving such global optimization may be carried out using gradient descent which in an example converges after a few hundred steps, involving seconds on a standard GPU (Graphics Processing Unit). The idea is that, for a given pair e=(n,m), the same rotation Pe should align both pointmaps Xn,e and Xm,e with the world-coordinate pointmaps Xn and Xm, since Xn,e and Xm,we are by definition both expressed in the same coordinate frame. To avoid the trivial optimum where σe=0, ∀e∈E, Πe σe=1 is enforced. An extension to this framework enables the recovery of all cameras parameters: by replacing Xn: =Pn−1 h(Kn−1 [U Dn; V Dn; Dn]), all camera poses {Pn}, associated intrinsics {Kn} and depthmaps {Dn} for n=1 . . . . N may be estimated.
Generally speaking, the neural networks disclosed herein reconstruct a 3D scene from un-calibrated and un-posed images by unifying monocular and binocular 3D reconstruction. The pointmap representation for Multi-View Stereo (MVS) applications enables the neural network to predict 3D shapes in a canonical frame, while preserving the implicit relationship between pixels and the scene. This effectively drops many constraints of the usual perspective camera formulation. Further, an optimization procedure may be used to globally align pointmaps in the context of multi-view 3D reconstruction by optimizing the camera pose and geometry alignment directly in 3D space. This procedure can extract intermediary outputs of existing Structure-from-Motion (SfM) and MVS pipelines. Finally, the neural networks disclosed herein are adapted to handle real-life monocular and multi-view reconstruction scenarios seamlessly, even when the camera is not moving between frames.
In addition to methods set forth for generating 3D representations of scenes from a plurality of images, the present application includes a computer program product comprising code instructions to execute the methods described herein (particularly data processors 112 of the servers 101 and the client devices 102), and storage readable by computer equipment (memory 113) provided with this computer program product for storing such code instructions.
Multi-view stereo reconstruction (MVS) in the wild involves first estimating by one or more processors the camera parameters (e.g., intrinsic and extrinsic parameters). These may be tedious and cumbersome to obtain, yet they are used to triangulate corresponding pixels in 3D space, which may be important. In this disclosure, an opposite stance is taken and DUSt3R is introduced, a novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction (DUSt3R) of arbitrary image collections (operating without prior information about camera calibration nor viewpoint poses). The pairwise reconstruction problem is cast as a regression of pointmaps, relaxing the hard constraints of projective camera models. This present application shows that this formulation smoothly unifies the monocular and binocular reconstruction cases. In the case where more than two images are provided, this application proposes a simple yet effective global alignment strategy that expresses all pairwise pointmaps in a common reference frame. The disclosed network architecture is based on transformer encoders and decoders, which allows powerful pretrained models to be leveraged. The disclosed formulation directly provides a 3D model of the scene as well as depth information, but interestingly, pixel matches, relative and absolute cameras can be seamlessly recovered from it. Experiments on all these tasks showcase that DUSt3R can unify various 3D vision tasks and set high performance on monocular/multi-view depth estimation as well as relative pose estimation. Advantageously, DUSt3R makes many geometric 3D vision tasks easy to perform.
Unconstrained image-based dense 3D reconstruction from multiple views is useful for computer vision. Generally speaking, the task aims at estimating the 3D geometry and camera parameters of a scene, given a set of images of the scene. Not only does it have numerous applications/tasks like mapping, navigation, archaeology, cultural heritage preservation, robotics, but perhaps more importantly, it holds a fundamentally special place among all 3D vision tasks. Indeed, it subsumes nearly all of the other geometric 3D vision tasks. Thus, some approaches for 3D reconstruction include keypoint detection and matching, robust estimation, Structure-from-Motion (SfM) and Bundle Adjustment (BA), dense Multi-View Stereo (MVS), etc.
SfM and MVS pipelines may involve solving a series of minimal problems: matching points, finding essential matrices, triangulating points, sparsely reconstructing the scene, estimating cameras and finally performing dense reconstruction. This rather complex chain may be a viable solution in some settings, but here may be unsatisfactory: each sub-problem may not be solved perfectly and adds noise to the next step, increasing the complexity and the engineering effort for the pipeline to work as a whole. In this regard, the absence of communication between each sub-problem may be telling: it would seem more reasonable if they helped each other, i.e., dense reconstruction may benefit from the sparse scene that was built to recover camera poses, and vice-versa. In addition, functions in this pipeline may be brittle. For instance, a stage of SfM that serves to estimate all camera parameters, may fail in situations, e.g., when the number of scene views is low, for objects with non-Lambertian surfaces, in case of insufficient camera motion, etc.
In this Section B, DUSt3R, a novel approach for Dense Unconstrained Stereo 3D Reconstruction from un-calibrated and un-posed cameras, is presented.
A component is a network that can regress a dense and accurate scene representation solely from a pair of images, without prior information regarding the scene nor the cameras (not even the intrinsic parameters). The resulting scene representation is based on 3D pointmaps with rich properties: they simultaneously encapsulate (a) the scene geometry, (b) the relation between pixels and scene points and (c) the relation between the two viewpoints. From this output alone, practically all scene parameters (i.e., cameras and scene geometry) can be extracted. This is possible because the disclosed systems and methods jointly processes the input images and the resulting 3D pointmaps, thus learning to associate 2D structures with 3D shapes, and having the opportunities of solving multiple minimal problems simultaneously, enabling internal ‘collaboration’ between them.
As set forth above, the disclosed model may be trained in a fully-supervised manner using a regression loss, leveraging large public datasets for which ground-truth annotations are either synthetically generated, reconstructed from SfM software or captured using dedicated sensors). The disclosed embodiments drift away from integrating task-specific modules, and instead adopt a fully data-driven strategy based on a transformer architecture, not enforcing any geometric constraints at inference, but being able to benefit from powerful pretraining schemes. The network learns strong geometric and shape priors, like shape from texture, shading or contours.
To fuse predictions from multiple images pairs, bundle adjustment (BA) for the case of pointmaps may be used, thereby achieving full-scale MVS. The disclosed embodiments introduce a global alignment procedure that, contrary to BA, does not involve minimizing reprojection errors. Instead, the camera pose and geometry alignment directly in 3D space are optimized, which is fast and shows excellent convergence in practice. The disclosed experiments show that the reconstructions are accurate and consistent between views in real-life scenarios with various unknown sensors. The disclosed embodiments further demonstrate that the same architecture can handle real-life monocular and multi-view reconstruction scenarios seamlessly. Examples of reconstructions using the DUSt3R network shown in
The disclosed contributions are fourfold. First, the first holistic end-to-end 3D reconstruction pipeline from un-calibrated and un-posed images is presented that unifies monocular and binocular 3D reconstruction. Second, the pointmap representation for MVS applications is introduced that enables the network to predict the 3D shape in a canonical frame, while preserving the implicit relationship between pixels and the scene. This effectively drops many constraints of perspective camera formulations. Third, an optimization procedure to globally align pointmaps in the context of multi-view 3D reconstruction is introduced. The disclosed procedure can extract effortlessly all usual intermediary outputs of the classical SfM and MVS pipelines. The disclosed approaches unify 3D vision tasks and considerably simplify other reconstruction pipelines, making DUSt3R seem simple and easy in comparison. Fourth, promising performance is demonstrated on a range of 3D vision tasks, such as multi-view camera pose estimation.
Some related works in 3D vision are summarized in this Section. Additional related works are summarized in Section B.6.2 (below).
Structure-from-Motion (SfM) involves reconstructing sparse 3D maps while jointly determining camera parameters from a set of images. Some pipelines starts from pixel correspondences obtained from keypoint matching between multiple images to determine geometric relationships, followed by bundle adjustment to optimize 3D coordinates and camera parameters jointly. Learning-based techniques may be incorporated into subprocesses. The sequential structure of the SfM pipelines persist however, making it vulnerable to noise and errors in each individual component.
MultiView Stereo (MVS) involves the task of densely reconstructing visible surfaces, which is achieved via triangulation between multiple viewpoints. In a formulation of MVS, all camera parameters may be provided as inputs. Approaches may depend on camera parameter estimates obtained via calibration procedures, either during the data acquisition or using Structure-from-Motion approaches for in-the-wild reconstructions. In real-life scenarios, inaccuracy of pre-estimated camera parameters can be detrimental for proper performance. This present application proposes instead to directly predict the geometry of visible surfaces without any explicit knowledge of the camera parameters.
Direct RGB-to-3D. Some approaches may directly predict 3D geometry from a single RGB image. Neural networks that learn strong 3D priors from large datasets to solve ambiguities may be leveraged. These methods can be classified into two groups. A first group leverages class-level object priors. For instance, learning a model that can fully recover shape, pose, and appearance from a single image, given a large collection of 2D images may be used. A second group may involve general scenes. Systematically built may be monocular depth estimation (MDE) networks. Depth maps encode a form of 3D information and, combined with camera intrinsics, can yield pixel-aligned 3D point-clouds. SynSin (see Wiles et al., “SynSin: End-to-end view synthesis from a single image”, in CVPR, pp. 7465-7475, 2020), for example, performs new viewpoint synthesis from a single image by rendering feature-augmented depthmaps knowing all camera parameters. Without camera intrinsics, they can be inferred by exploiting temporal consistency in video frames, either by enforcing a global alignment or by leveraging differentiable rendering with a photometric reconstruction loss. Another way is to explicitly learn to predict camera intrinsics, which enables performing metric 3D reconstruction from a single image when combined with MDE networks. These methods are, however, intrinsically limited by the quality of depth estimates, which is poorly suited for monocular settings.
The proposed network processes two viewpoints simultaneously in order to output depthmaps, or rather, pointmaps. In theory, at least, this makes triangulation between rays from different viewpoint possible. The disclosed networks output pointmaps (i.e., dense 2D field of 3D points), which handle camera poses implicitly and makes the regression problem better posed.
Pointmaps. Using a collection of pointmaps as shape representation is counter-intuitive for MVS.
Before discussing the details of the disclosed method, this Section introduces some concepts of pointmaps introduced above.
Pointmap. In the following, a dense 2D field of 3D points is denoted as a pointmap X∈W×H×3. In association with its corresponding RGB image I of resolution W×H, X forms a one-to-one mapping between image pixels and 3D scene points, i.e., Ii,j←Xi,j, for all pixel coordinates (i, j)∈{1 . . . . W}×{1 . . . . H}. The disclosed embodiments assume that each camera ray hits a single 3D point (i.e., ignoring the case of translucent surfaces).
Camera and scene. Given the camera intrinsics K∈3×3, the pointmap X of the observed scene can be obtained by one or more processors from the ground-truth depthmap D∈
W×H as Xi,j=K−1 [iDi,j, jDi,j, Di,j]T. Here, X is expressed in the camera coordinate frame. In the following, Xn,m is denoted as the pointmap Xn from camera n expressed in camera m's coordinate frame:
The disclosed embodiments describe a network that solves the 3D reconstruction task for the generalized stereo (multiple image) case through direct regression. To that aim, a network is trained that takes as input 2 RGB images I1, I2 ∈
W×H×3 and generates 2 corresponding pointmaps X1,1, X2,1 ∈
W×H×3 with associated confidence maps C1,1, C2,1∈
W×H based on the respective RGB images. Both pointmaps are expressed in the same coordinate frame of I1, which offers advantages as described herein. For the sake of clarity and without loss of generality, both images are assumed to have the same resolution W×H, but in practice their resolution can differ.
Network architecture. 1204 may benefit from CroCo pretraining. Details on the Cross-view Completion “CroCo” architecture and pretraining is set forth in Weinzaepfel et al., (i) “CroCo: Self-Supervised Pre-Training for 3D Vision Tasks by Cross-View Completion”, in NeurIPS, 2022 and (ii) “CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow”, in ICCV, 2023 ((i) and (ii) also referred to herein as “Weinzaepfel et al. 2023”), and in (iii) U.S. patent application Ser. Nos. 18/230,414 and 18/239,739, each of which is incorporated herein in its entirety. The resulting token representations F1 and F2 of networks 1204a and 1204b, respectively, are passed to two transformer decoders 1206 that constantly exchange information via cross-attention and finally, two regression heads 1208 output the two corresponding pointmaps 1214 and associated confidence maps 1216. The two pointmaps 1214a and 1214b may be expressed in the same coordinate frame of the first image I1, and the network F is trained using a simple regression loss.
More specifically, as shown in
The network reasons over both token representations jointly in the decoder 1206. Similarly to CroCo, the decoder 1206 may be a transformer network equipped with cross attention. Each decoder block 1206 sequentially performs self-attention (each token of a view attends to tokens of the same view), then cross-attention (each token of a view attends to all other tokens of the other view), and finally feeds tokens to regression head 1208, such as a Multi-Layer Perceptron (MLP). Importantly, information is constantly shared between the two branches during the decoder pass/operation 1206. This is to output properly aligned pointmaps. Namely, each decoder block 1206 attends to tokens from the other branch, such as follows:
for i=1, . . . , B for a decoder with B blocks and initialized with encoder tokens G01: =F1 and G02: =F2. Here, DecoderBlock vi(G1, G2) denotes the i-th block in branch v∈{1,2}, G1 and G2 are the input tokens, with G2 the tokens from the other branch. Finally, in each branch a separate regression head 1208 takes the set of decoder tokens and outputs a pointmap and an associated confidence map:
where X1,1 and X2,1 are output pointmaps and C1,1 and C2,1 are output confidence score maps.
The output pointmaps X1,1 and X2,1 are regressed up to a scale factor, such as by the regression heads. The disclosed architecture may not explicitly enforce any geometrical constraints. Hence, pointmaps may not necessarily correspond to any physically plausible camera model. Rather, the network is allowed learn all relevant priors present from the training set, which only includes geometrically consistent pointmaps. Using the described architecture allows leveraging strong pretraining technique, ultimately surpassing what task-specific architectures can achieve. The learning process is detailed in the next section.
3D Regression loss. A training objective is based on regression in the 3D space. The ground truth pointmaps are denoted as
To handle the scale ambiguity between prediction and ground-truth, the predicted and ground-truth pointmaps may be normalized (e.g., by a normalization module) by scaling factors z=norm (X1,1, X2,1) and
Confidence-aware loss. In reality, there may be ill-defined 3D points, e.g., in the sky or on translucent objects. More generally, some parts in the image may be harder to predict that others. The disclosed embodiments jointly learn to predict a score for each pixel which represents the confidence that the network has about this particular pixel. The final training objective is the confidence-weighted regression loss from Equation B2 over all valid pixels:
where Civ,1 is the confidence score for pixel i, and α is a hyper-parameter controlling the regularization term (see Wan et al., “Confnet: Predict with confidence”, in ICASSP, pp. 2921-2925, 2018, which is incorporated herein in its entirety). To ensure a strictly positive confidence, define Civ,1=1+exp >1. This has the effect of forcing the network to extrapolate in harder areas, e.g., like those ones covered by a single view. Training network F with this objective allows to estimate confidence scores without explicit supervision. Examples of input image pairs with their corresponding outputs are shown in
The rich properties of the output pointmaps allows various convenient operations/tasks to be performed using the pointmaps.
Establishing correspondences between pixels of two images can be achieved using nearest neighbor (NN) search in the 3D pointmap space. To minimize errors, retaining of reciprocal (mutual) correspondences M1,2 between images I1 and I2 may be performed, i.e., providing:
The pointmap X1,1 is expressed in image I1's coordinate frame. It is therefore possible to estimate the camera intrinsic parameters by solving an optimization problem based on the pointmap. In this disclosure, it is assumed that the principal point is approximately centered and pixel are squares, hence only the focal length f1* remains to be estimated:
with
Fast iterative solvers, e.g., based on the Weiszfeld algorithm (see Frank Plastria, “The Weiszfeld Algorithm: Proof, Amendments, and Extensions” in Foundations of Location Analysis, pp. 357-389, Springer, 2011, which is incorporated herein in its entirety), can be used to find the focal length f1* in a few iterations. For the focal length f2* of the second camera, an option is to perform the inference for the image pair (I2, I1) and use Equation B5 with pointmap X2,2 instead of pointmap X1,1.
Relative pose estimation can be achieved in several ways. One way is to perform 2D matching and recover intrinsics as described above, then estimate the Epipolar matrix and recover the relative pose. Another, more direct, way is to compare the pointmaps X1,1↔X1,2 (or, equivalently, X2,2↔X1,2) using Procrustes alignment (see Luo et al., “Procrustes alignment with the EM algorithm”, in CAIP, vol. 1689 of Lecture Notes in Computer Science, pp. 623-631, Springer, 1999, which is incorporated herein in its entirety) to determine the relative pose P*=[R*|t*]:
which can be achieved in closed-form. Procrustes alignment may be sensitive to noise and outliers. Another solution is to use RANSAC (Random Sample Consensus) with PnP (Perspective-n-Point), i.e., PnP-RANSAC (see Fischler et al., “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography”, in Commun. ACM 24(6): 381-95, 1981 and Lepetit et al., “EPnP: An accurate O(n) solution to the PnP problem”, in IJCV, 2009, which is incorporated herein in its entirety).
Absolute pose estimation, also termed visual localization, can likewise be achieved in several different ways. Let IQ denote the query image and IB the reference image for which 2D to 3D correspondences are available. First, intrinsics for IQ can be estimated from pointmap XQ,Q as discussed above. One solution includes obtaining 2D correspondences between IQ and IB, which in turn yields 2D-3D correspondences for 19, and then running PnP-RANSAC. Another solution is to determine the relative pose between IQ and IB as described previously. Then, this pose is converted to world coordinates by scaling it appropriately, according to the scale between XB,B and the ground-truth pointmap for IB. A pose module may determine pose as described herein.
The network presented so far in this Section B.3 can handle a pair of images. Presented now is a fast and simple post-processing optimization for entire scenes that enables the alignment of pointmaps predicted from multiple (e.g., more than two) images into a joint 3D space (i.e., global aligner 407 shown in
Pairwise graph. Given a set of images {I1, I2, . . . , IN} for a given scene, first a connectivity graph G (V, E) is generated where N images form vertices V and each edge e=(n,m)∈E indicates that images In and Im shares some visual content. To that aim, either an image retrieval method is used, or all pairs are passed through network F and their overlap is measured based on the average confidence in both pairs, then out low-confidence pairs are filtered out.
Global optimization. The disclosed embodiments use the connectivity graph G to recover globally aligned pointmaps {Xn∈W×H×3} for all cameras n=1 . . . . N. To that aim, first predict, for each image pair e=(n,m)∈E, the pairwise pointmaps Xn,n, Xm,n and their associated confidence maps Cn,n, Cm,n. For the sake of clarity, let the following be defined as: Xn,e: =Xn,n and Xm,e: =Xm,n. Since the disclosed goal involves rotating all pairwise predictions in a common coordinate frame, a pairwise pose Pe ∈R3×4 and scaling σe>0 associated to each pair e∈E is introduced. Then the following optimization problem may be formulated:
where v∈e for v∈{n,m} if e=(n,m). For a given image pair e, the same rigid transformation Pe should align both pointmaps χn,e and χm,e with the world-coordinate pointmaps χn and χm, since χn,e and χm,e are by definition both expressed in the same coordinate frame. To avoid the trivial optimum where σe=0, ∀e∈E, Πe σe=1 is enforced.
Recovering camera parameters. An extension to this framework enables the recovery of all cameras parameters. By replacing χ_(i, j){circumflex over ( )}n:=P_n{circumflex over ( )}(−1) h(K_n{circumflex over ( )}(−1) [iD_(i, j){circumflex over ( )}n; jD
_(i, j){circumflex over ( )}n; D_(i, j){circumflex over ( )}n] (i.e., enforcing a standard camera pinhole model as in Equation B1), all camera poses {Pn}, associated intrinsics {Kn} and depthmaps {Dn} for n=1 . . . . N can be estimated.
Discussion Different than bundle adjustment, global optimization embodiments are fast and simple to perform in practice. The disclosed examples are not minimizing 2D reprojection errors, as in bundle adjustment, but 3D projection errors. The optimization may be carried out by a training module using gradient descent and typically converges after a few hundred steps, requiring mere seconds on a standard GPU.
Section B.4 Experiments with DUSt3R
Training data. In one embodiment, the disclosed network is trained with a mixture of eight datasets: Habitat (see Savva et al., “Habitat: A Platform for Embodied AI Research” in ICCV, 2019), MegaDepth (see Li et al., “Megadepth: Learning single-view depth prediction from internet photos”, in CVPR, pp. 2041-2050, 2018.), ARKitScenes (see Dehghan et al., “ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data”, in NeurIPS Datasets and Benchmarks, 2021, MegaDepth, Static Scenes 3D (see Mayer et al., “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation”, in CVPR, 2016), Blended MVS (Yao et al., “BlendedMVS: A Large-Scale Dataset for Generalized Multi-View Stereo Networks”, in CVPR, 2020), ScanNet++ (see Yeshwanth et al., “ScanNet++: A high-fidelity dataset of 3d in-door scenes”, in ICCV 2023), CO3Dv2 (see Reizenstein et al., “Common Objects in 3D: Large-Scale Learning and Evaluation of Real-Life 3D Category Reconstruction”, in ICCV, 2021), and Waymo (see Sun et al., “Scalability in Perception for Autonomous Driving: Waymo Open Dataset”, in CVPR, 2020). These datasets feature diverse scenes types: indoor, outdoor, synthetic, real-world, object-centric, etc. When image pairs are not directly provided with the dataset, they are extracted based on the CroCo method. Specifically, image retrieval and point matching algorithms may be utilized to match and verify image pairs. In one embodiment 8.5 M pairs in total were extracted.
Training details. The training described herein may be performed by the training module. During each epoch, an equal number of pairs are randomly sampled from each dataset to equalize disparities in dataset sizes. In an embodiment relatively high-resolution images are fed to the disclosed network that are for example 512 pixels in the largest dimension. To mitigate the high cost associated with such input, the disclosed network is trained sequentially, first on 224×224 images and then on larger 512-pixel images. The image aspect ratios are randomly selected for each batch (e.g., 16/9, 4/3, etc.), so that at test time the disclosed network is familiar with different image shapes. Images are cropped to the target aspect-ratio, and resized so that the largest dimension is 512 pixels.
Data augmentation techniques and training set-up are used. The disclosed network architecture comprises a ViT-Large for the encoder (see Dosovitskiy et al.), a ViT-Base for the decoder and a DPT head (see Ranftl et al., “Vision transformers for dense prediction,” in ICCV, 2021, which is referred to hereinafter as “DPT” or “DPT-KITTI”). Note Section B.6.5 (below) sets forth additional details on the training and the network architecture. Before training, the network is initialized with the weights of a CroCo pretrained model. CroCo is a pretraining paradigm that has been shown to excel on various downstream 3D vision tasks and is thus suited to the disclosed framework. In Section B.4.6 the impact of CroCo pretraining and increase in image resolution is ablated.
Evaluation. In the remainder of this Section, DUSt3R is benchmarked on a representative set of classical 3D vision tasks, each time specifying datasets, metrics and comparing performance with other approaches. All results are obtained with the same DUSt3R model (the disclosed default model is denoted as ‘DUSt3R 512’, other DUSt3R models serves for the ablations in Section B.4.6), i.e., the disclosed model may not be finetuned on a particular downstream task. During testing, all test images are rescaled to 512 pixels while preserving their aspect ratio. Since there may exist different ‘routes’ to extract task-specific outputs from DUSt3R, as described in Section B.3.3 and Section B.3.4, it is noted each time the method is employed.
Qualitative results. DUSt3R yields high-quality dense 3D reconstructions even in challenging situations. See Section B.6.1 for visualizations of pairwise and multi-view reconstructions.
Dataset and metrics. DUSt3R is evaluated in this Section for the task of absolute pose estimation on the 7Scenes (see Shotton et al., “Scene coordinate regression forests for camera relocalization in RGB-D images”, in CVPR, pp. 2930-2937, 2013) and Cambridge Landmarks datasets (see Kendall et al., “PoseNet: a Convolutional Network for Real-Time 6-DOF Camera Relocalization”, in ICCV, 2015). 7Scenes contains 7 indoor scenes with RGB-D images from videos and their 6-DOF camera poses. Cambridge-Landmarks contains 6 outdoor scenes with RGB images and their associated camera poses, which are obtained via SfM. The median translation and rotation errors in (cm/°) respectively, are reported.
Protocol and results. To compute camera poses in world coordinates, DUSt3R is used as a 2D-2D pixel matcher (see Section B.3.3) between a query and the most relevant database images obtained using known image retrieval APGeM (see Revaud et al., “Learning with average precision: Training image retrieval with a listwise loss,” in ICCV, 2019). In other words, the raw pointmaps output from F(IQ, IB) without any refinement are used, where IQ is the query image and IB is a database image. The top 20 retrieved images for Cambridge-Landmarks and top 1 for 7Scenes are used, and query intrinsics are leveraged. For results obtained without using ground-truth intrinsics parameters, refer to Section B.6.4 (below).
Obtained results are compared against others in the table in
DUSt3R is evaluated in this Section on multi-view relative pose estimation after the global alignment from Section B.3.4.
Datasets. Following, two multi-view datasets, CO3Dv2 and RealEstate10 k (Zhou et al., “Stereo Magnification: Learning View Synthesis Using Multiplane Images”, in SIGGRAPH, 2018) are used for the evaluation. CO3Dv2 contains 6 million frames extracted from approximately 37 k videos, covering 51 MS-COCO categories. The ground-truth camera poses are annotated using COLMAP (see Schonberger et al., “Structure-from-motion revisited”, in CVPR, 2016, and Schonberger et al, Pixelwise view selection for unstructured multi-view stereo”, in ECCV, 2016, which are hereinafter referred to as “COLMAP”) from 200 frames in each video. RealEstate10 k is an indoor/outdoor dataset with 10 million frames from about 80K video clips, the camera poses being obtained by SLAM (Simultaneous Localization and Mapping) with bundle adjustment. The protocol introduced in PoseDiffusion (see Wang et al., “PoseDiffusion: Solving Pose Estimation via Diffusion-Aided Bundle Adjustment” in ICCV, 2023) is followed to evaluate DUSt3R on 41 categories from CO3Dv2 and 1.8K video clips from the test set of RealEstate10 k. For each sequence, 10 frames are randomly selected and all possible 45 pairs are fed to DUSt3R.
Baselines and metrics. DUSt3R is compared to pose estimation results, obtained either from PnP-RANSAC or global alignment, against the learning-based RelPose (see Zhang et al., “RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild”, in ECCV, 2022), PoseReg and PoseDiffusion, and structure-based PixSFM (see Lindenberger et al., “Pixel-Perfect Structure-from-Motion with Feature metric Refinement,” in ICCV, pages 5967-5977, 2021), COLMAP+SPSG (COLMAP extended with SuperPoint (see DeTone et al., “Superpoint: Self-supervised Interest Point Detection and Description,” in CVPR Workshops, pages 224-236, 2018) and SuperGlue (see Sarlin et al., “SuperGlue: Learning Feature Matching with Graph Neural Networks,” in CVPR, pp. 4937-4946, 2020). Similar to PoseReg, the Relative Rotation Accuracy (RRA) and Relative Translation Accuracy (RTA) for each image pair to evaluate the relative pose error and select a threshold τ=15 to report RTA@15 and RRA@15 is reported. Additionally, the mean Average Accuracy (mAA) @30 is calculated, defined as the area under the curve accuracy of the angular differences at min (RRA@30, RTA@30).
Results. As shown in tables in
For this monocular task, the same input image I is fed to the network as F(I, I). By design, depth prediction is the z coordinate in the predicted 3D pointmap.
Datasets and metrics. DUSt3R is benchmarked on two outdoor (DDAD (see Guizilini et al., “3D packing for self-supervised monocular depth estimation”, in CVPR, pp 2482-2491, 2020), KITTI (see Geiger et al., “Vision meets robotics: The KITTI dataset”, in Int. J. Robotics Res., 32 (11): 1231-1237, 2013)) and three indoor (NYUv2 (see Silberman et al., “Indoor segmentation and support inference from RGBD images” in ECCV, pp. 746-760, 2012), BONN (see Palazzolo et al., “Refusion: 3d reconstruction in dynamic environments for RGB-D cameras exploiting residuals”, in IROS 2019), TUM (see Sturm et al., “A benchmark for the evaluation of RGB-D SLAM systems”, in IEEE IROS, pp. 573-580, 2012.)) datasets. DUSt3R's performance is compared to other methods categorized in supervised, self-supervised and zero-shot settings, this last category corresponding to DUSt3R. Two metrics commonly used in the monocular depth evaluations are used: the absolute relative error AbsRel between target y and prediction ŷ,
and the prediction threshold accuracy, δ1.25=max (ŷ/y, y/ŷ)<1.25.
Results. In zero-shot setting, SlowTv (see Spencer et al., “Kick back & relax: Learning to reconstruct the world by watching slowtv”, in ICCV, 2023) performs relatively well. This approach collected a large mixture of curated datasets with urban, natural, synthetic and indoor scenes, and trained one common model. For every dataset in the mixture, camera parameters are known or estimated with COLMAP. As the tables in
DUSt3R is evaluated for the task of multi-view stereo depth estimation. Depthmaps, as the z-coordinate of predicted pointmaps, are extracted. In the case where multiple depthmaps are available for the same image, all predictions are rescaled to align them together and aggregate all predictions via an averaging weighted by the confidence.
Datasets and metrics. Following Schroppel et al. (in “A benchmark and a baseline for robust multi-view depth estimation” in 3DV, pp. 637-645, 2022), it is evaluated on the DTU, ETH3D, Tanks and Temples, and ScanNet (see Dai et al., “ScanNet: Richly-annotated 3d reconstructions of indoor scenes”, in CVPR, 2017) datasets. The Absolute Relative Error (rel) and Inlier Ratio (τ) with a threshold of 1.03 on each test set and the averages across all test sets are reported. Note that the ground-truth camera parameters and poses nor the ground-truth depth ranges are not leveraged, the predictions herein are only valid up to a scale factor. In order to perform quantitative measurements, predictions are normalized using the medians of the predicted depths and the ground truth ones, as advocated by Schroppel et al.
Results. In the table in
Finally, the quality of the disclosed full reconstructions obtained after the global alignment procedure described in Section B.3.4 is measured. Again it is emphasized that the disclosed systems and methods method are the first one to enable global unconstrained MVS, in the sense that there is no prior knowledge regarding the camera intrinsic and extrinsic parameters. In order to quantify the quality of the disclosed reconstructions, the predictions are aligned to the ground-truth coordinate system. This is done by fixing the parameters as constants in Section B.3.4. This leads to consistent 3D reconstructions expressed in the coordinate system of the ground-truth.
Datasets and metrics. The disclosed predictions are evaluated on the DTU dataset. The disclosed network is applied in a zero-shot setting, i.e., the disclosed model as is applied without performing any finetuning on the DTU training set. The table in
Results. Other methods all leverage GT (Ground Truth) poses and train specifically on the DTU training set whenever applicable. Furthermore, results on this task are usually obtained via sub-pixel accurate triangulation, requiring the use of explicit camera parameters, whereas the disclosed systems and methods use regression. Yet, without prior knowledge about the cameras, an average accuracy of 2.7 mm is reached, with a completeness of 0.8 mm, for an overall average distance of 1.7 mm. This level of accuracy is of great use in practice, considering the plug-and-play nature of the disclosed systems and methods.
The impact of the CroCo pretraining and image resolution on DUSt3R's performance was ablated. Results are set forth in the tables in
A novel paradigm has been presented to solve not only 3D reconstruction in-the-wild without prior information about scene nor cameras, but a whole variety of 3D vision tasks as well.
This Section provides additional details and qualitative results of DUSt3R. First, Section B.6.1 presents qualitative pairwise predictions of the presented architecture on challenging real-life datasets. Extended related works are set forth in Section B.6.2, encompassing a wider range of methodological families and geometric vision tasks. Section B.6.3 provides auxiliary ablative results on multi-view pose estimation, that are not set out in Section B.4. Then results are reported in Section B.6.4 on an experimental visual localization task, where the camera intrinsics are unknown. Finally, training and data augmentation procedures are detailed in Section B.6.5.
Point-cloud visualizations. Some visualization of DUSt3R's pairwise results are presented in
Note the scenes in
Section B.2 covered some other works. Because this work covers a large variety of geometric tasks, Section B.2 is completed in this Section with additional topics.
Implicit Camera Models. The disclosed systems and methods may not explicitly output camera parameters. Likewise, there are works aiming to express 3D shapes in a canonical space that is not directly related to the input viewpoint. Shapes can be stored as occupancy in regular grids, octree structures, collections of parametric surface elements, point clouds encoders, free-form deformation of template meshes or per-view depthmaps. While these approaches arguably perform classification and not actual 3D reconstruction, all-in-all, they work only in very constrained setups, usually on ShapeNet (see Chang et al., “ShapeNet: An Information-Rich 3D Model Repository”, in arXiv: 1512.03012, 2015) and have trouble generalizing to natural scenes with non object-centric views. The question of how to express a complex scene with several object instances in a single canonical frame had yet to be answered: in this disclosure, the reconstruction is expressed in a canonical reference frame, but due to the disclosed scene representation (pointmaps), a relationship is preserved between image pixels and the 3D space, and thus 3D reconstruction may be performed consistently.
Dense Visual SLAM. In visual SLAM, 3D reconstruction and ego-motion estimation may use active depth sensors. Dense visual SLAM from RGB video stream may be able to produce high-quality depth maps and camera trajectories, but they inherit the traditional limitations of SLAM, e.g., noisy predictions, drifts and outliers in the pixel correspondences. To make the 3D reconstruction more robust, R3D3 (see Schmied et al., “R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras”, in arXiv: 2308.14713, 2023) jointly leverages multi-camera constraints and monocular depth cues. Most recently, GO-SLAM (see Zhang et al., “GO-SLAM: Global optimization for consistent 3d instant reconstruction”, in ICCV, pp. 3727-3737 October 2023) proposed real-time global pose optimization by considering the complete history of input frames and continuously aligning all poses that enables instantaneous loop closures and correction of global structure. Still, all SLAM methods assume that the input consists of a sequence of closely related images, e.g., with identical intrinsics, nearby camera poses and small illumination variations. In comparison, the disclosed systems and methods handle completely unconstrained image collections.
3D reconstruction from implicit models has undergone advancements, such as by the integration of neural networks. Multi-Layer Perceptrons (MLP) may be used to generate continuous surface outputs with only posed RGB images. Others involve density-based volume rendering to represent scenes as continuous 5D functions for both occupancy and color, showing ability in synthesizing novel views of complex scenes. To handle large-scale scenes, geometry priors to the implicit model may be used, leading to much more detailed reconstructions. In contrast to the implicit 3D reconstruction, this disclosure focuses on the explicit 3D reconstruction and showcases that DUSt3R can not only have detailed 3D reconstruction but also provide rich geometry for multiple downstream 3D tasks.
RGB-pairs-to-3D takes its roots in two-view geometry and may be considered as a stand-alone task or an intermediate step towards the multi-view reconstruction. This process may involve estimating a dense depth map and determining the relative camera pose from two different views. This problem may be formulated either as pose and monocular depth regression or pose and stereo matching. A goal is to achieve 3D reconstruction from the predicted geometry. In addition to reconstruction tasks, learning from two views also gives an advance in unsupervised pretraining; CroCo introduces a pretext task of cross-view completion from a large set of image pair to learn 3D geometry from unlabeled data and to apply this learned implicit representation to various downstream 3D vision tasks. Instead of focusing on model pretraining, the systems and methods herein leverage this pipeline to directly generate 3D pointmaps from the image pair. In this context, the depth map and camera poses are only by-products in the disclosed pipeline.
Additional results are included for the multi-view pose estimation task from Section B.4.2. Namely, the pose accuracy is computed for a smaller number of input images (they are randomly selected from the entire test sequences). The table in
An example of reconstruction on RealEstate10K is shown in
Additional results of visual localization on the 7-scenes and Cambridge-Landmarks datasets are included herein. Namely, experiments with a scenario where the focal parameter of the querying camera is unknown were performed. In this case, the query image and a database image are input into DUSt3R, and an un-scaled 3D reconstruction is output. The resulting pointmap is then scaled according to the ground-truth pointmap of the database image, and extract the pose as described in Section B.3.3. The table in
Ground-truth pointmaps. Ground-truth pointmaps
where X·Y denotes element-wise multiplication, U, V∈RW×H are the x, y pixel coordinate grids and h is the mapping to homogeneous coordinates, see Equation B1 in Section B.3.
Relation between depthmaps and pointmaps. As a result, the depth value Di,j1 at pixel (i, j) in image I1 can be recovered as
Therefore, depthmaps set forth in Section B are extracted from DUSt3R's output as X:,:,21,1 and X:,:,22,2 for images I1 and I2, respectively.
Dataset mixture. DUSt3R may be trained with a mixture of eight datasets: Habitat, ARKitScenes, MegaDepth, Static Scenes 3D, Blended MVS, ScanNet++, CO3Dv2 and Waymo. These datasets feature diverse scene types: indoor, outdoor, synthetic, real-world, object-centric, etc. Table B-1 shows the number of extracted pairs in each dataset (i.e., the dataset mixture and sample sizes for DUSt3R training), which amounts to 8.5 M in total.
Data augmentation. Data augmentation techniques may be used for the training, such as random color jittering and random center crops, the latter being a form of focal augmentation. Indeed, some datasets are captured using a single or a small number of camera devices, hence many images have practically the same intrinsic parameters. Centered random cropping thus helps in generating more focal lengths. Crops may also be centered so that the principal point is centered in the training pairs. At test time, little impact is observed on the results when the principal point is not exactly centered. During training, each training pair (I1, I2) as well as its inversion (I2, I1) are systematically fed to help generalization. Naturally, tokens from these two pairs do not interact.
The detailed hyperparameter settings used for training DUSt3R is reported in the table in
Image matching is a component of algorithms and pipelines in 3D vision. Yet despite matching being a 3D problem, intrinsically linked to camera pose and scene geometry, it may be treated as a 2D problem. This makes sense as the goal of matching is to establish correspondences between 2D pixel fields.
This disclosure takes a different stance and propose to cast matching as a 3D task. This Section discloses another example of the DUSt3R architecture referred to as the MASt3R architecture. Based on pointmap regression, the DUSt3R architecture displays impressive robustness in matching views with extreme viewpoint changes, yet with limited accuracy. The MASt3R architecture aims to improve the matching capabilities of such an approach while preserving its robustness. More specifically, the MASt3R architecture augments the DUSt3R network with a new head that outputs dense local features (local descriptors), trained with an additional matching loss. Further, the MASt3R architecture address the issue of quadratic complexity of dense matching, which may become prohibitively slow for downstream applications if not treated carefully. In addition, the MASt3R architecture uses a fast reciprocal matching scheme that not only accelerates matching by orders of magnitude, but also comes with theoretical guarantees and, lastly, yields improved results. Extensive experiments show that the MASt3R architecture outperforms others on multiple matching tasks. In particular, it outperforms other methods by 30% (absolute improvement) in Virtual Correspondence Reprojection Error (VCRE) area under the curve (AUC) on the extremely challenging Map-free localization dataset.
Being able to establish correspondences between pixels across different images of the same scene, denoted as image matching, is a component of all 3D vision applications, spanning mapping, localization, navigation, photogrammetry and autonomous robotics/navigation, and others. Other methods for visual localization, for instance, overwhelmingly rely upon image matching during the offline mapping stage, e.g., using COLMAP, as well as during the online localization step, typically using PnP. This example focuses on this core task and aims at producing, given two images, a list of pairwise correspondences, denoted as matches. In particular, this example seeks to output highly accurate and dense matches that are robust to viewpoint and illumination changes because these are, in the end, the limiting factor for real-world applications.
Matching methods may be cast into a three-step pipeline including first extracting sparse and repeatable keypoints, then describing them with locally invariant features, and finally pairing the discrete set of keypoints by comparing their distance in the feature space. This pipeline has merits: keypoint detectors are precise under low-to-moderate illumination and viewpoint changes, and the sparsity of keypoints makes the problem computationally tractable, enabling very precise matching whenever the images are viewed under similar conditions. This explains the success and persistence of methods like SIFT (see David G. Lowe, “Distinctive image features from scale-invariant keypoints”, in IJCV, 2004) in 3D reconstruction pipelines like COLMAP.
Unfortunately, keypoint-based methods, by reducing matching to a bag-of-keypoint problem, discard the global geometric context of the correspondence task. This makes them especially prone to errors in situation with repetitive patterns or low-texture areas, which are in fact ill-posed for local descriptors. One way to remedy this is to introduce a global optimization strategy during the pairing step, typically leveraging some learned priors about matching. However, leveraging global context during matching might be too late, if keypoints and their descriptors do not already encode enough information. For this reason, another direction is to consider dense holistic matching, i.e., avoiding keypoints altogether, and matching the entire image at once. Images as a whole may be considered and the resulting set of correspondences is dense and more robust to repetitive patterns and low-texture areas. This provides positive results on benchmarks, such as the Map-free localization benchmark (see Arnold et al., “Map-Free Visual Relocalization: Metric Pose Relative to a Single Image”, in ECCV, 2022).
Nevertheless, some methods score a relatively disappointing VCRE precision of 34% on the Map-free localization benchmark. This may be because of treating matching as a 2D problem in image space. The formulation of the matching task may intrinsically and fundamentally be a 3D problem: pixels that correspond are pixels that observe the same 3D point. Indeed, 2D pixel correspondences and a relative camera pose in 3D space are two sides of the same coin, as they are directly related by the epipolar matrix. While DUSt3R may perform 3D reconstruction rather than matching, and for which matches are only a by-product of the 3D reconstruction, and that correspondences obtained naively from this 3D output currently outperform other keypoint- and matching-based methods on the Map-free benchmark provides additional evidence this reasoning is correct.
While the DUSt3R architecture may be used for matching, it may be extremely robust to viewpoint changes. To remedy this limitation, the MASt3R architecture attaches a second head that regresses dense local feature maps, that is trained with an InfoNCE loss (see Oord et al., “Representation Learning with Contrastive Predictive Coding” arxiv.org/abs/1807.03748, 2018, which is incorporated herein in its entirety). A coarse-to-fine matching scheme during which matching may be performed at several scales to obtain pixel-accurate matches. Each matching step involves extracting reciprocal matches from dense feature maps which, may be more time consuming than computing the dense feature maps themselves. The solution set forth in this Section is an algorithm for finding reciprocal matches that is almost two orders of magnitude faster while improving pose estimation quality.
To summarize, this Section C sets forth in detail the MASt3R architecture, a 3D-aware matching approach building on the DUSt3R architecture, that outputs local feature maps, which advantageously enable accurate and robust matching, and which advantageously outperforms other methods on several absolute and relative pose localization benchmarks. In addition, this Section sets forth a coarse-to-fine matching scheme associated with a fast-matching algorithm that may be used with high-resolution images.
Keypoint-based matching has been an important feature of computer vision. Matching is carried out in three stages: keypoint detection, locally invariant description, and nearest-neighbor search in descriptor space. Departing from the former handcrafted methods like SIFT (see Lowe 2004), learning-based data-driven schemes for detecting keypoints and describing them or both at the same time may be used. Keypoint-based approaches may be used, underscoring their enduring value in tasks requiring high precision and speed. One notable issue, however, may be that they reduce matching to a local problem, discarding its holistic nature. Global reasoning may be used in the last pairing step leveraging stronger priors to guide matching, yet leaving the detection and description local. While successful, it is still limited by the local nature of keypoints and their inability to remain invariant to strong viewpoint changes.
Dense matching. In contrast to keypoint-based approaches, semi-dense and dense approaches offer a different paradigm for establishing image correspondences, considering all possible pixel associations. Used may be coarse-to-fine schemes to decrease computational complexity. Matching may be considered from a global perspective, at the cost of increased computational resources. Dense matching may be effective where detailed spatial relationships and textures are helpful for understanding scene geometry, leading to high performance on benchmarks that are especially challenging for keypoints, such as due to extreme changes in viewpoint or illumination. Matching may be cast as a 2D problem, which limits usage for visual localization.
Camera Pose estimation techniques vary, but may be based on pixel matching. Camera pose estimation benchmarks include Aachen Day-Night (see Zhang et al., “Reference Pose Generation for Long-Term Visual Localization via Learned Features and View Synthesis”, in IJCV, 2021), InLoc (see Taira et al., “InLoc: Indoor Visual Localization with Dense Matching and View Synthesis”, in PAMI, 2019), CO3D (an earlier version of CO3Dv2) or Map-free, all featuring strong viewpoint and/or illumination changes. Another benchmark is Map-free (see Arnold et al.), a localization dataset for which a single reference image is provided but no map, with viewpoint changes up to 180°.
Grounding matching in 3D thus becomes important in challenging conditions where classical 2D-based matching utterly falls short. Leveraging priors about the physical properties of the scene in order to improve accuracy or robustness has been widely explored in the past, but most previous works settle for leveraging epipolar constraints for semi-supervised learning of correspondences without any fundamental change. Toft et al. (“Single-Image Depth Prediction Makes Feature Matching Easier”, in ECCV, 2020) proposes to improve keypoint descriptors by rectifying images with perspective transformations obtained from a known monocular depth predictor. Diffusion for pose or rays, although not matching approaches strictly speaking, show promising performance by incorporating 3D geometric constraints into their pose estimation formulation.
Given two images I1 and I2, respectively captured by two cameras C1 and C2 with unknown parameters, a set of pixel correspondences {(i, j)} are recovered where i, j are pixels i=(ui, vi),j=(uj, vj)∈{1, . . . , W}×{1, . . . , H}, with W, H being the respective width and height of the images. It is assumed for the purpose of explanation that the images have the same resolution for the sake of simplicity, yet without loss of generality, namely that the MASt3R network may handle image pairs of variable aspect ratios.
An example embodiment of the MASt3R architecture, illustrated in
The DUSt3R architecture, which is discussed in detail Sections A.3 and B, jointly solves the calibration and 3D reconstruction problems from images alone. A transformer-based network predicts a local 3D reconstruction given two input images, in the form of two dense 3D point-clouds X1,1 and X2,1, denoted as pointmaps, where a pointmap Xa,b ∈H×W×3 represents a dense 2D-to-3D mapping between each pixel i=(u, v) of the image Ia and its corresponding 3D point Xu,va,b∈
3 expressed in the coordinate system of camera Cb. By regressing two pointmaps X1,1, X2,1 expressed in the same coordinate system of camera C1, the DUSt3R architecture effectively solves the joint calibration and 3D reconstruction problem. In the case where more than two images are provided, a second step of global alignment merges all pointmaps in the same coordinate system. Note that, in this Section, this step is not used given the example discussed herein is limited to the binocular case. Inference for the binocular case is now explained in greater detail with reference to
Both images 1202 are first encoded in a Siamese manner with a ViT encoder 1204 (see Dosovitskiy et al.), yielding two representations H1 and H2:
Then, two intertwined decoders 1206 process these representations jointly, exchanging information via cross-attention to ‘understand’ the spatial relationship between viewpoints and the global 3D geometry of the scene. The new representations augmented with this spatial information are denoted as H′1 and H′2:
Finally, two prediction heads 1208 regress the final pointmaps X and confidence maps C from the concatenated representations output by the encoder and decoder:
Regression loss. The DUSt3R network is trained in a fully-supervised manner using a regression loss:
where v∈{1,2} is the view and i is a pixel for which the ground-truth 3D point {circumflex over (X)}v,1∈3 is defined. In one formulation, normalizing factors z, {circumflex over (z)} may be introduced to make the reconstruction invariant to scale. These are simply defined as the mean distance of all valid 3D points to the origin.
Scale-dependent predictions. Alternative to scale-invariant predictions, as some potential use-cases like map-free visual localization involve scale-dependent (e.g., metric-scale) predictions. In this alternate example, the regression loss may be modified to ignore normalization for the predicted pointmaps when the ground-truth pointmaps are scale-dependent.
That is, in the case of scale dependence z and {circumflex over (z)} are set such that z: =2 whenever ground-truth is scale-dependent (e.g., metric-scale), so that:
As in Section B.3.2 discussing the DUSt3R architecture, the final confidence-aware regression loss conf is defined as:
To obtain reliable pixel correspondences from pointmaps, a solution is to look for reciprocal matches in some invariant feature space (see Wu, Sankaranarayanan, and Chellappa, “In Situ Evaluation of Tracking Algorithms Using Time Reversed Chains”, in CVPR 2007). While such a scheme works well with the DUSt3R architecture's regressed pointmaps (i.e., in a 3-dimensional space) even in presence of extreme viewpoint changes, the resulting correspondences are imprecise, yielding suboptimal accuracy. This may be a natural result (i) as regression is inherently affected by noise, and (ii) because the DUSt3R architecture was never explicitly trained for matching.
Matching head. For these reasons, a second descriptor head 1210 is added that outputs two dense feature maps D1 and D2 ∈H×W×d of dimensional d:
In one embodiment, the descriptor head 1210 is a 2-layer MLP interleaved with a non-linear Gaussian Error Linear Unit (GELU) activation function (see Hendrycks et al., “Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units”, in arXiv, 1606.08415, 2016, which is incorporated herein in its entirety), and each local feature is normalized to unit norm.
Matching objective. A matching objective is to encourage each local descriptor from one image to match with at most a single descriptor from the other image that represents the same 3D point in the scene. To this aim, the InfoNCE loss (see Oord et al.) is leveraged over the set of ground-truth correspondences ={(i, j)|{circumflex over (X)}i1,1={circumflex over (X)}j2,1} to define the matching loss
match as:
Here, 1={i|(i,j)∈
} and
2={j|(i, j)∈
} denote the subset of considered pixels in each image and τ is a temperature hyper-parameter. Note that this matching objective may essentially be a cross-entropy classification loss: contrary to regression in Equation C6, the network is only rewarded if it gets the correct pixel right, not a nearby pixel. This strongly encourages the network to achieve high-precision matching. Finally, both regression loss
conf and matching loss
match are combined to get the final training objective:
where β is a hyperparameter to balance the two losses.
Given two predicted feature maps D1, D2 ∈H×W×d, (or pointmaps or combinations thereof) a set of reliable pixel correspondences are extracted by matching modules 1212 to perform tasks 1215 such as geometrical matching 1215a and feature-based matching 1215b, i.e., mutual nearest neighbors of each other's:
Naive implementation of reciprocal matching has a high computational complexity of O(W2H2), since every pixel from an image must be compared to every pixel in the other image. While optimizing the nearest-neighbor (NN) search is possible (e.g., using K-D Trees), this kind of optimization may become inefficient in high dimensional feature space and, in all cases, orders of magnitude slower than the inference time of the MASt3R architecture to output D1 and D2.
Fast matching. In one embodiment, a fast matching approach based on subsampling may be performed by matching modules 1212. This embodiment is based on an iterated process that starts from an initial sparse set of k pixels U0={Un0}n=1k typically sampled regularly on a grid in the first image I1. Each pixel is then mapped to its NN on the second I2, yielding V1, and the resulting pixels are mapped back again to the first I1 in the same way:
The set of reciprocal matches (those which form a cycle, i.e., _k{circumflex over ( )}t={(U_n{circumflex over ( )}t, V_n{circumflex over ( )}t)|U_n{circumflex over ( )}t=U_n{circumflex over ( )}(t+1)}) are then collected. For the next iteration, pixels that already converged are filtered out, i.e., updating Ut+1: =Ut+1\U
k=Ut
kt, which shows that the performance-versus time trade-off on the Map-free dataset, where performance actually improves, along with matching speed, when performing moderate levels of subsampling.
Theoretical. The overall complexity of the fast matching is 0 (KWH), which is
times faster than the naive approach denoted all, as illustrated in , which is bounded in size by |
k|≤k. This study of convergence guarantees of this algorithm and how it evinces outlier-filtering properties, explains why the end accuracy is actually higher than when using the full correspondence set M.
Due to the quadratic complexity of attention with respect to the input image area (W×H), an embodiment of the MASt3R architecture handles images of 512 pixels in their largest dimension. Larger images may involve more compute power to train, and ViTs may not generalize yet to larger test-time resolutions. As a result, high-resolution images (e.g., 1 M pixel) may be downscaled to be matched, afterwards the resulting correspondences are upscaled back to the original image resolution. This can lead to some performance loss, sometimes sufficient to cause degradation in term of localization accuracy or reconstruction quality.
Coarse-to-fine matching is a technique to preserve the benefit of matching high-resolution images with a lower-resolution algorithm that may be used for the for the MASt3R architecture. The procedure starts with performing matching on downscaled versions of the two images. The set of coarse correspondences obtained with subsampling k are denoted as k0. Next, a grid of overlapping window crops W1 and W2∈
W×4 are generated on each full-resolution image independently. Each window crop measures 512 pixels in its largest dimension and contiguous windows overlap by 50%. Then enumerating the set of all window pairs (w1, w2)∈W1×W2, a subset is selected that covers most of the coarse correspondences
k0. Specifically, window pairs are added one by one in a greedy fashion until 90% of correspondences are covered. Finally, matching for each window pair is performed independently:
Correspondences obtained from each window pair are finally mapped back to the original image coordinates and concatenated, thus providing dense full-resolution matches.
This Section is organized as follows, initially the training procedure of MASt3R is detailed (Section C.4.1). Then, several inference tasks are evaluated, each time comparing with others, starting with visual camera pose estimation on the Map-Free Relocalization Benchmark (Section C.4.2), the CO3D and RealEstate datasets (Section C.4.3) and other standard Visual Localization benchmarks (Section C.4.4). Finally, MASt3R is leveraged for the task of Dense Multi-View Stereo (MVS) reconstruction (Section C.4.5).
Training data. The MASt3R network in one example is trained with a mixture of 14 datasets: Habitat, ARKitScenes (see Dehghan et al.), Blended MVS, MegaDepth (Li and Snavely 2018), Static Scenes 3D, ScanNet++, CO3Dv2, Waymo (Sun et al. 2020), Map-free (see Arnold et al.), WildRgb (Xia et al., “RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos”, arxiv.org/abs/2401.12592, 2024), VirtualKitti (see Cabon et al., “Virtual KITTI 2.” in arXiv 2001.10773, 2020), Unreal4K (Tosi et al., “SMD-Nets: Stereo Mixture Density Networks” in CVPR, 2021), TartanAir (see Wang et al., “TartanAir: A Dataset to Push the Limits of Visual Slam”, in arXiv 2003.14338 2020) and an internal dataset. These datasets feature diverse scenes types: indoor, outdoor, synthetic, real-world, object-centric, etc. Among them, ten datasets have metric ground-truth. When image pairs are not directly provided with the dataset, they are extracted based on the method described in Weinzaepfel et al. 2023. Generally in one example, systems and methods may be used in for retrieval and point matching to match and verify image pairs.
Training. As set forth above, the MASt3R model architecture is based on the DUSt3R model architecture, permitting the use of the same backbone (e.g., ViT-Large encoder and ViT-Base decoder). To benefit from the DUSt3R architecture's 3D matching abilities, the model weights are initialized to predetermined values for a DUSt3R checkpoint. During each epoch, 650 k pairs equally distributed between all datasets are randomly sampled. The MASt3R network is trained for 35 epoch with a cosine schedule and initial learning rate set to 0.0001. Similar to the training for MASt3R, the image aspect ratio is randomized at training time, ensuring that the largest image dimension is 512 pixels. The local feature dimension is set to d=24 and the matching loss weight to B=1. It is important that the network sees different scales at training time, because coarse-to-fine matching starts from zoomed-out images to then zoom-in on details (see Section C.3.4). Consequently, aggressive data augmentation is performed during training in the form of random cropping. Image crops are transformed with a homography to preserve the central position of the principal point. While example training parameters are provided, the present application is also applicable to other values.
Correspondence sampling. To generate ground-truth correspondences for the matching loss (Equation C11), the reciprocal correspondences between on the ground-truth 3D pointmaps
Fast nearest neighbors. For fast reciprocal matching disclosed in Section C.3.3, the nearest neighbor function NN(x) from Equation C15, may be implemented differently depending on the dimension of x. When matching 3D points x∈3, NN(x) may be implemented using K-d trees (see Maneewongvatana et al, “Analysis of Approximate Nearest Neighbor Searching with Clustered Point Sets”, in DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 1999). For matching local features with d=24, however, K-d trees may be inefficient due to dimensionality. Therefore, the optimized FAISS library is relied on in such cases.
Dataset description. Experiments begin with the Map-free relocalization benchmark (see Arnold et al.), an extremely challenging dataset aiming at localizing the camera in metric space given a single reference image without any map. It comprises a training, validation and test sets of 460, 65 and 130 scenes, respectively, each featuring two video sequences. Following the benchmark, evaluations are performed in term of Virtual Correspondence Reprojection Error (VCRE) and camera pose accuracy, see Arnold et al. for details.
Impact of subsampling. Coarse-to-fine matching may not be performed for this dataset, as the image resolution is already close to MASt3R working resolution (720×540 vs. 512×384 resp.). As mentioned in Section C.3.3, computing dense reciprocal matching may be slow even with optimized code for searching nearest neighbors. Therefore the set of reciprocal correspondences are subsampled, keeping at most k correspondences from the complete set (Equation C14).
Ablations on losses and matching modes. Results are reported on the validation set in the table in
First, it is noted that all proposed MASt3R methods outperforms others. All other things being equal, matching descriptors perform better than matching 3D points (II versus IV). Regression may be inherently unsuited to compute pixel correspondences, see Section C.3.2.
Also the impact of training only with a single matching objective (match from Equation C11, III) is studied. In this case, the performance overall may degrade compared to training with both 3D and matching losses (IV), in particular in term of pose estimation accuracy (e.g., median rotation of 10.8° for (III) compared to 3.0° for (IV)). It is noted that this is in spite of the decoder now having more capacity to carry out a single task, instead of two when performing 3D reconstruction simultaneously, indicating that grounding matching in 3D is indeed crucial to improve matching. Lastly, it is observed that, when using metric depth directly output by MASt3R, the performance largely improves. This suggests that, as for matching, the depth prediction task is largely correlated with 3D scene understanding, and that the two tasks strongly benefit from each other.
Comparisons on the test set is reported in the table in
Also the results of direct regression with MASt3R, i.e., without matching, simply using PnP on the pointmap X2,1 of the second image is provided. These results are on par with the MASt3R matching-based variant, even though the ground-truth calibration of the reference camera is not used. As shown below, this does not hold true for other localization datasets, and computing the pose via matching (e.g., with PnP or essential matrix) with known intrinsics may be safer.
Qualitative results.
Datasets and protocol. Next, the task of relative pose estimation on the CO3Dv2 and RealEstate10 k datasets is evaluated. CO3Dv2 includes 6 million frames extracted from approximately 37 k videos, covering 51 MS-COCO categories. Ground-truth camera poses are obtained using COLMAP from 200 frames in each video. RealEstate10 k is an indoor/outdoor dataset that features 80K video clips on YouTube totaling 10 million frames, camera poses being obtained via SLAM with bundle adjustment. Following PoseDiffusion, MASt3R is evaluated on 41 categories from CO3Dv2 and 1.8K video clips from the test set of RealEstate10 k. Each sequence is 10 frames long, relative camera poses are evaluated between all possible 45 pairs, not using ground-truth focal lengths.
Baselines and metrics. As before, matches obtained with MASt3R are used to estimate Essential Matrices and relative pose. Note that predictions are done pairwise, contrary to all other methods that leverage multiple views (with the exception of DUSt3R-PnP). Other data-driven approaches like RelPose, RelPose++ (see Lin et al., “Relpose++: Recovering 6d poses from sparse-view observations”, in arXiv: 2305.04926, 2023), PoseReg and PoseDiff, the recent RayDiff (see Zhang et al., “Cameras as Rays: Pose Estimation via Ray Diffusion”, in ICLR, 2024) and DUSt3R are compared. Also results are reported for more traditional SfM methods like PixSFM and COLMAP extended with SuperPoint (see DeTone et al.) and SuperGlue (COLMAP+SPSG). The Relative Rotation Accuracy (RRA) and Relative Translation Accuracy (RTA) are reported for each image pair to evaluate the relative pose error and select a threshold T=15 to report RTA@15 and RRA@15. Additionally, the mean Average Accuracy (mAA30), defined as the area under the accuracy curve of the angular differences at min (RRA@30, RTA@30), is calculated.
Results. As shown in the tables in
Datasets. MASt3R is evaluated for the task of absolute pose estimation on the Aachen Day-Night and InLoc (see Taira et al.) datasets. Aachen includes 4,328 reference images taken with hand-held cameras, as well as 824 daytime and 98 nighttime query images taken with mobile phones in the old inner city of Aachen, Germany. InLoc (see Taira et al.) is an indoor dataset with challenging appearance variation between the 9,972 RGB-D+6DOF pose database images and the 329 query images taken from an iPhone 7.
Metrics. The percentage of successfully localized images within three thresholds are reported: (0.25 m, 2°), (0.5 m, 5°) and (5 m, 10°) for Aachen and (0.25 m, 10°), (0.5 m, 10°), (1 m, 10°) for InLoc.
Results are reported in the table in
Interestingly, MASt3R still performs well even with a single retrieved image (top1), showcasing the robustness of 3D grounded matching. Also included are direct regression results, which are poor, showing a striking impact of the dataset scale on the localization error, i.e., small scenes are much less affected (see results on Map-free in Section C.4.2). This confirms the importance of feature matching to estimate reliable poses.
Multi-View Stereo (MVS) is performed by triangulating the obtained matches. Note that the matching is performed in full resolution without prior knowledge of cameras, and the latter are only used to triangulate matches to 3D in ground-truth reference frame. To remove spurious 3D points, geometric consistency post-processing (see F. Wang et al., “PatchmatchNet: Learned Multi-View Patchmatch Stereo”, in CVPR, 2021) is applied.
Datasets and metrics. Predictions on the DTU dataset are evaluated. Contrary to all competing learning methods, the MASt3R network is applied in a zero-shot setting, i.e., no training or finetuning is performed on the DTU training set and the MASt3R model is applied as is. In the tables in
Results. Data-driven approaches trained on this domain significantly outperform handcrafted approaches, cutting the Chamfer error by half. This is inventive for a zero shot setting. MASt3R outperforms and competes with others, all without leveraging camera calibration nor poses for matching, neither having seen the camera setup before.
Grounding image matching in 3D with the MASt3R architecture raised the bar on camera pose and localization tasks on many public benchmarks, and improved the DUSt3R architecture with matching, advantageously achieving enhanced robustness, while attaining and even surpassing what could be done with pixel matching alone. In addition, a fast reciprocal matcher and a coarse to fine approach for efficient processing is disclosed, allowing users to balance between accuracy and speed. The MASt3R architecture is believed to greatly increase versatility of localization.
This Section discloses an alignment module 820 and procedure 855 therefor, with reference to
By way of overview, the method disclosed in this Section begins with a collection of N images ={In}1≤n≤N of views of a given 3D scene, with N potentially large (e.g., on the order of thousands of images). Each image In is acquired by a camera
n=(Kn, Pn) where Kn∈R3×3 denotes the intrinsic parameters (i.e., the camera's calibration in terms of focal length and principal point) and Pn∈R4×4 the camera's pose (i.e., rotation and translation, from world coordinates to camera coordinates). Given the set of images {In} as input, the method recovers all cameras parameters {
n} as well as the underlying 3D scene geometry {Xn}, with Xn∈
W×H×3 a pointmap relating each pixel y=(i, j)∈
2 to its corresponding 3D point Xyn in the scene in a common world coordinate frame. It is assumed for simplicity in this Section that all images have the same pixel resolution W×H, which may differ in practice. The method 850 of recovering camera parameters and 3D reconstruction shown in
At 852 (in observing a scene to reconstruct may be expressed by co-visibility graph module 303 (in
=(
, ε), where each vertex of the graph I∈
is an image, and each edge e=(n, m)∈ε is a (directed) pairwise connection between likely-related images In and Im. In this embodiment, the edges of the scene graph are directed because the pairwise relationship (i.e., the inference pass of MASt3R—see Section C.6.2 and Equation C19 below) is asymmetric. The task performed at 852 may be reformulated as: estimating the vertices properties (i.e., the camera parameters and view-dependent pointmaps) from the pairwise properties of the edges. Without prior information about the image views
, each image 302 could be related to any other image (i.e., the approach might consider a graph where all possible edges are present). However, doing so would make the rest of the approach not scalable for large image collections as the overall complexity of the next step at 854 is in 0(|ε|)=O(N2).
At 852, a scalable pairwise image matcher, h (In, Im)→s, is used that is able to provide an approximate co-visibility score s∈ between two images In and Im. In examples, image retrieval methods, efficient matching methods or averaged MASt3R's confidence predictions may be used to compute the score. By evaluating all possible pairs and thresholding the score with τs, irrelevant edges may be pruned from the scene graph
to retain only those where h(In, Im)>τs is sufficiently high, always ensuring the scene graph
remains connected (i.e., there exists a path between any pair of vertices).
More generally, the co-visibility graph module 303 may be used to preprocess images 302 to generate a scene graph for input to the DUSt3R network 304a (shown in
At 854 (in
Sparse correspondences. For each image pair, sparse correspondences (or matches) are recovered by application of MASt3R's fast reciprocal matching disclosed in Section C.3.3. More specifically, here a fast neural network (FastNN) searches for a subset of reciprocal correspondences from two feature maps Fn,n and Fm,n by initializing seeds on a regular pixel grid at intervals s∈ and iteratively converging to mutual correspondences
n,m, as defined by:
where n,m=
is a set of pixel correspondences between In and Im. Grid density parameter s sets a trade-off between coverage density and computation cost. In one embodiment, s=8 pixels is used as a compromise between these two aspects. Since both MASt3R and FastNN are order-dependent functions in terms of their parameters, typically they are computed in both directions and all unique correspondences are gathered:
where the upper line of a set ={ycm↔ycn} denotes the n↔m swap of the original set
={ycn↔ycm}.
At 856, coarse global alignment is performed in 3D space. Sections A.6 and B.3.4, with reference to global aligner 407, above discloses an alternate example of global alignment procedure that aims to express all pairwise pointmaps in a common world coordinate frame using regression based alignment. At 856, an alternate embodiment of global alignment is performed that takes advantage of explicit pixel correspondences. This alternate embodiment of global alignment that also aims to express all pairwise pointmaps in a common world coordinate frame advantageously reduces the overall number of parameters.
Averaging per-image pointmaps. At 856, a first step is performed to obtain a canonical pointmap for each image, expressed in its own camera coordinate system. For example, consider an image In and εn={e|e0=n} is the set of all edges e=(e0, e1) starting from image In. For each such edge e=(n, m)∈εn, there is a different estimate of Xn,n and Cn,n, which is denoted herein as Xn,e and Cn,e with the convention that e=(n, m). One strategy to arrive at the canonical pointmap is to compute a per-pixel weighted average of all estimates using:
Because the initial estimates may not necessarily be well aligned in terms of scale and depth offset, an alternate strategy may be to compute a scale-aligned version of the canonical pointmap by minimizing the following robust energy function:
where an∈E
E
Global alignment. A goal of the coarse global alignment is to find the scaled rigid transformation of all canonical pointmaps {{circumflex over (X)}n} such that any two matching 3D points are as close as possible in the world coordinate frame:
where the rigid transformation for each image In is expressed as a 6D pose Pn∈4×4, with a scaling factor σn>0. To avoid degenerate solutions,
is imposed, which may be achieved by setting σ=softmax (σ′), σ′∈N. In contrast to global alignment performed by global aligner 407 (shown in
e={ycn↔ycm} (see Equation C20) and are weighted by the geometric average of their respective confidence qc=√{square root over (Qcn,eQcm,e)} (here any matching pixels ycn↔ycm are denoted as c with a slight abuse of notation for the sake of clarity). This objective is minimized using standard gradient descent for simplicity, but other strategies such as second-order optimization (e.g., with a Linear Optimization (LM) solver) could be faster.
Coarse alignment is fast and simple with good convergence in practice, but unfortunately, it may not fix inaccuracies in the predictions of the canonical pointmaps. Yet, these are bound to happen as these pointmaps originate from an approximate regression process that (i) is subject to depth ambiguities (e.g., in regions seen by only one of the two cameras) and (ii) was never meant to be extremely accurate to begin with. To further refine the camera parameters using 2D reprojection error at 858 in
where πn (χi,jm) denotes the 2D reprojection of a 3D point χi,jm from image Im's optimizable pointmap χm∈W×H×3 onto the camera screen of In, and ρ:
2→
+ is a robust error function to deal with the potential outliers among all extracted correspondences. It is typically of the form ρ(x)=∥x∥γ with 0<γ≤1 (e.g., γ=0.5), but other choices are possible. For a standard pinhole camera model with intrinsics and extrinsics parameter matrices Kn and Pn, the reprojection function π is given as:
To ensure geometry-compliant reconstructions, pointmaps {χn} are themselves obtained using the inverse reprojection function π−1. For a given 3D point at pixel coordinate (i, j)∈2 and depth zi,jn>0, then:
Since this method is only interested in a few pixel locations of χn (those with pixel correspondences c∈n,m), it is not necessary to store all depth values explicitly during optimization thereby reducing memory and computations and making the optimization fast and scalable. Note that this approach is compatible with different camera models (e.g., fisheye, omnidirectional, etc) as long as the corresponding definition of the camera reprojection function π is accordingly set and the pairwise inference network has been trained accordingly. Again, this objective is minimized using standard gradient descent for simplicity, but other strategies such as second-order optimization (e.g., with an LM solver) could potentially be faster.
Anchor points. Contrary to the 3D matching loss from (see Equation C21) where 3D points are rigidly bundled together (embedded within canonical pointmaps), optimizing the individual position of 3D points χi,jn without constraints can lead to inconsistent 3D reconstructions. To remedy this, the traditional solution is to create point tracks (i.e., to assign the same 3D point to a connected chain of several correspondences that spans several images). While this is relatively straightforward with keypoint-based methods, these correspondences do not necessarily overlap with each other. One solution is to glue 3D points together via anchor points that provide a regularization on the type of deformations that can be applied to the pointmaps.
For each image In, a regular 2D grid of anchor points {{dot over (y)}u,vn} is created, spaced by the same sampling parameter s∈+ as used for FastNN matching:
Then each correspondence ycn=(i, j) inIn is tied with its closest anchor {dot over (y)}u,vn at
The corresponding 3D point is then fully characterized by the depth of its anchor point żu,vn as:
where σn is a global scale factor for image In and the relative offset
is a constant obtained once for all from the canonical depthmap {tilde over (z)}i,jn={tilde over (X)}i,j,2n and its subsampled version
In other words, it is assumed that the canonical pointmaps are locally accurate in terms of relative depth values. On the whole, optimizing a pointmap χn∈W×H×3 only comes down to optimizing a small set of anchor depth values żn∈
a global scale factor σn>0 and camera parameters Kn and Pn.
Initialization. Since the energy function in Equation C21 may be non-convex and subject to falling into sub-optimal local minima, the coarse global alignment is leveraged as a good initialization, while setting Pinit=P*, σinit=σ* and approximating χinit ≃{tilde over (X)}. Note that the canonical pointmap {tilde over (X)}n may not be fully consistent with a camera model, so the optimal focal length fn* is extracted from pointmap {tilde over (X)}n as described in Section B.3.3 to get intrinsic camera parameters Kinit (with centered principal point), and anchor depth values are initialized as
Low-rank representation of the pointmaps. The anchor-based 3D scene representation disclosed in this Section lowers the effective dimensionality of pointmaps from 3WH to WH/s2. Yet, in the case where some regions of an image are completely unmatched by any other images, these regions will not receive any updates during the optimization (by definition of Equation C21). As a result, the pointmap for this image may gradually distort awkwardly, resulting in unsatisfying reconstructions. Intuitively, it seems more satisfying to have a quasi-rigid way of deforming pointmaps during optimization, akin to the coarse global alignment but with more leeway. One way of achieving this is to represent żn with a low-rank approximation:
where for simplicity the dimensions of żn∈D are flattened. Here, Un ∈
D×D′ is a constant orthogonal low-rank basis and {umlaut over (z)}n∈
D′ is a small set of coefficients encoding żn, with
and D′<<D. A solution is to compute a content-adaptive frequency decomposition of the original depthmap żn using spectral analysis. More specifically, first a graph is built where all anchor depth values żu,vn, ∈żn constitutes nodes, and edge similarities are computed using a gaussian kernel on depth-invariant 3D distances as:
where {dot over (X)}u,vn is defined w.r.t {tilde over (X)}u,vn similarly to żu,vn. Then the normalized D×D Laplacian matrix
is computed where I is the identity matrix, A is the reshaped graph adjacency matrix with A(u,v),(u′,v′)=αu,v,u′,v′, and Λ is the diagonal degree matrix with Λ(u,v)(u,v)=Σu′,v′ A(u,v),(u′,v′). Computing the eigenvectors associated with the lowest D′ eigenvalues yields a good low-frequency and boundary-aware orthogonal basis Un. In one embodiment, only D′=64 basis vectors (i.e., coefficients) are retained to represent an anchored depthmap.
In another embodiment, an approximation is obtained by computing Un using a 2D Fourier transform on the
domain, and only keeping (flattened) low-frequency 2D Fourier transform on the basis vectors. This has the advantage of being fast to compute and independent of the anchor depthmap content. Indeed the basis vector is not particularly fit to the depthmap at hand (i.e., it is generic (same for any image) and not tailored to the image-at-hand (not taking into account the depth boundaries for this particular image)), especially around depth boundaries where the reconstruction error is expected to be high since they can only be well represented with high frequencies.
Compared with global alignment performed by global aligner 407 (shown in
This Section presents a novel approach for Multiple View Stereopsis (MVS) from un-calibrated and un-posed cameras. A disclosed network is trained to predict a dense and accurate scene representation from a collection of image(s), without prior information regarding the scene nor the camera(s), not even the intrinsic parameters. This is in contrast to MVS where, given estimates of the camera intrinsic and extrinsic parameters, one needs to triangulate matching 2D points to recover the 3D information.
In real life, recovering accurate un-calibrated camera parameters from RGB images is still an open problem and is known to be extremely challenging, depending on the visual content and the acquisition scenario. Historically, tackling the problem as a whole has always been considered too hard of a task. This task may be split it into a hierarchy of simpler sub-problems, a.k.a., minimal problems, sequentially aggregated together, as shown in
Proposed in this Section D is a shift in paradigm shown in
To this aim, a multi-modal and multi-view Vision Transformer (ViT) network (see Dosovitskiy et al.) is leveraged, that takes both the RGB images and a scene representations as input, similar to that of SACReg (see Revaud et al., “SACReg: Scene-agnostic coordinate regression for visual localization” in arXiv 2307.11702, 2023, which is hereinafter referred to as “SACReg”, which is incorporated herein in its entirety). In contrast however, it is embedded in a diffusion framework: starting from a random scene initialization, the disclosed network predicts scene updates, and iteratively converge towards a satisfactory reconstruction. As scene representation, view-wise pointmaps are leveraged, e.g., each pixel of each image stores the first 3D point that it observes. In terms of architecture, each image and pointmap are processed through a Siamese ViT encoder and are then mixed together in a single representation. A decoder module performs cross attention between views to allow for global scene reasoning. The outputs are the pointmap updates for all views of the scene. The disclosed model is trained in a fully-supervised manner, where the ground-truth annotations are synthetically generated, or ideally, captured using other modalities such as Time of Flight sensors, scanners. Experiments show that this approach is viable, the reconstructions are accurate and consistent between views in real-life scenarios with various unknown sensors. This disclosure demonstrates that the same architecture can handle real-life monocular 2102 and multi-view 2104 reconstruction scenarios seamlessly, even when the camera is not moving between frames, as seen in
Contributions are threefold: First, presented herein is a holistic end-to-end 3D reconstruction pipeline from uncalibrated and unposed images, that unifies monocular and multiview reconstruction. Second, introduced herein is the Pointmap representation for MVS applications, that enables the network to predict the 3D shape in a canonical frame, while preserving the implicit relationship between pixels and the scene. This effectively drop many constraints of the usual perspective camera formulation. Third, leveraged herein is the iterative nature of diffusion processes for MVS: Gaussian noise is gradually turned into a global and coherent scene representation. This is believed to be a milestone with a broader impact than simply performing 3D reconstructions, as elaborated in Section D.5 (below).
Summarized in this Section are the main directions of MVS under the scope of geometric camera models.
MVS. Camera parameters may be considered to be explicit inputs to the reconstruction methods. MVS thus amounts to finding the surface along the defined rays, using the other views via triangulation. Be it the fully handcrafted, the more recent scene optimization based or learning based approaches, they all rely on perfect cameras obtained via complex calibration procedures, either during the data acquisition or using Structure-from-Motion approaches for in-the-wild reconstructions. Yet, in real-life scenarios, the inaccuracy of pre-estimated cameras can be detrimental for these algorithms to work properly. In various implementations, camera poses can be refined along with the dense reconstruction.
Refining the Cameras. Cameras may be refined by jointly optimizing for the scene and the cameras. The camera model may be a pinhole with known intrinsics, and the scene can be point clouds or implicit shape representations. These however are still limited and do not work well without either (i) strong assumptions about the camera motion and/or (ii) good initial camera intrinsics and poses. In fact, applying iterative optimization methods to camera parameters estimation may be prone to falling into sub-optimal local minima, and may only be effective when the current state is close to a good minimum. This is why SfM pipelines may use a hierarchy of sub-problems with RANSAC schemes in order to find valid solutions in the enormous search space, and only then try to optimize the scene and cameras jointly. Interestingly, explicit camera model formulations may be dropped, and more relaxed approaches may be used, purely driven by the data distribution, as explained in the following.
Implicit Camera Models. An idea regarding implicit camera models is to express shapes in a canonical space (e.g., not related to the input view points). Shapes can be stored as occupancy in regular grids, octree structures, collections of parametric surface elements, point clouds encoders, free-form deformation of template meshes or view-wise depthmaps. While they arguably perform classification and not actual 3D reconstruction, all-in-all, these approaches, may only work in very constrained setups, and have trouble generalizing to natural scenes with non object-centric views. The question of how to express a complex scene with several object instances in a single canonical frame had yet to be answered: in this Section D, the reconstruction is expressed in a canonical reference frame, but due to the disclosed scene representation (Pointmaps), a relationship between image pixels and the 3D space is preserved, and are able to reconstruct all the scene consistently.
Pointmaps. Using a collection of pointmaps as shape representation is counter-intuitive for MVS, but may be used for Visual Localization tasks, either in scene-dependent optimization approaches or scene-agnostic inference methods. Similarly, view-wise modeling may be used in monocular 3D reconstruction, the idea being to store the canonical 3D shape in multiple canonical views to work in image space. Explicit perspective camera geometry may be leveraged, via rendering of the canonical representation.
In this Section, the present disclosure involves leveraging pointmaps as a relaxation of the hard projective constraints of the usual camera models to directly predict 3D scenes from un-calibrated and un-posed cameras without ever enforcing projective rules. Predictions made herein are purely data-driven yet the outputs still respect the underlying camera geometry, even though no camera center nor ray behavior is explicitly specified.
One component that makes the disclosed systems and methods feasible is the scene representation S. A Pointmap representation is used herein, e.g., each pixel of each image stores the coordinates of the 3D point it observes. It is a one-to-one mapping between pixel coordinates and 3D points, assuming that occupancy is a binary information in 3D space, e.g., no transparency, nor double/triple points.
Leveraging priors. The disclosed systems and methods consider both the pointmaps and the images, thus are able to leverage 2D structure priors relevant to the scene, like the relationship between image RGB gradients and 3D shape properties. This gives the networks the ability to learn shape priors, like shape from texture, from defocus, from shading, or from contours. Importantly, it can also compare 3D values between views, that live in a common reference frame, which is of importance to leverage 3D priors of the scenes.
Fundamental Motivation Finally, the disclosed systems and methods may not rely on triangulation operations, which is in contrast to MVS. Triangulation may be noisy and inaccurate when rays are close to being colinear, i.e., when there is a very small camera translation between views. Even worse, it may collapse and be undefined when only a single view observes a region, or when rays are colinear (still camera, or pure rotation). Because triangulation involves parallax, triangluation may not be feasible when there is none or not enough. In contrast, the disclosed systems and methods can tackle pure rotations, still cameras and even the monocular case, in a common framework (see
The camera pose estimation, and thus the reconstruction, may be performed in a 3D/4D world of arbitrary reference frame and scale. This means that infinitely many reconstructions may be valid explanations of the observed RGB data and one has to choose one at inference. In other words, a-priori the mapping between pixels and scene coordinates is not known. To tackle this problem, diffusion processes may be leveraged: an initial reference coordinate system is defined by sampling random pointmaps, in a normalized cube. At first all views will be inconsistent, but all coordinates live in the same 3D reference frame. The 3D diffusion process can then be seen as an update estimator, that iteratively optimizes the scene representation towards a valid solution. Image diversity may not sought but rather the present application is able to converge towards a sample in the infinite space of valid solutions. More formally, the disclosed diffusion model learn a series of state transitions, e.g., corrections, to map pure Gaussian noise ST to S0 that belongs to the target data distribution. A neural network is trained to reverse a Gaussian noising process that transforms S0 to St s.t.:
with ϵ˜N(0,I), and t˜U(0, T) a continuous variable. λ(t) is a function that monotonously decreases from 1 to 0, effectively blending the target S0 with noise.
Training. Diffusion processes include predicting either the target S0, the noise ϵ or the update S˜0 to apply at each step in v prediction convention (see Salimans et al, “Progressive distillation for fast sampling of diffusion models”, in ICLR, 2022). With no loss of generality, the present application may involve training a neural network f to predict S0 from St, conditioned by the RGB observations c. This is possible as the update St-Δ may be derived from St and the predicted S˜0. The objective Ls
In short, the network always tries to predict the target reconstruction.
Sampling. In order to generate samples from a learned model, a random sample ST is drawn and a series of (reverse) state transitions St-Δ is iterated by applying the denoising function f(St, t, c) on each intermediate state St. Again, with no loss of generality, the updates can be performed using the transition rules described e.g., in DDPM (see Ho et al., “Denoising diffusion probabilistic models”, in NeurIPS, 2020), or DDIM (see Song et al., “Denoising diffusion implicit models”, in ICLR, 2021) for faster inference, which are incorporated herein in their entireties.
As depicted in
More specifically, the multiple view ViT architecture of CroCo is leveraged with a mixer block of the encoder 2308 to merge the different input modalities. Each image 2304 and each pointmap 2306 is encoded using a ViT encoder denoted ERGB and E3D respectively in the encoder 2308. For each view, both representations are mixed together using a Mixer module in the encoder 2308. This Mixer module also includes a ViT encoder equipped with cross attention: it updates the 3D representation of each view by cross-attending the RGB representation. All views are processed in a Siamese manner, meaning the weights of the encoders are shared for all views. From this encoder, N mixed representations are obtained. The network then reasons over all of them jointly in the decoder. Similar to CroCo, it is also a ViT, equipped with cross attention. It takes the representation of each view and sequentially performs self-attention (each token of a view attends to tokens of the same view) and cross-attention (each token of a view attends to all other tokens of other views). It was initially designed for pairwise tasks, but it can also cross-attend to more views by concatenating the tokens from other views. In a simplistic view, due to the global cross-attention, the network is able to reason over all possible matches between all views jointly, contrary to the classical approach that computes pairwise matches between sparse keypoints.
Positional Embedding. Similar to other known methods, Rotary Positional Embeddings (RoPE) (see Lu et al., “RoFormer: Enhanced transformer with rotary position embedding” arXiv pr2104.09864, 2021, which is incorporated herein in its entirety) may be leveraged. Each token could be associated to any other token in other view, meaning the relative displacement between views should not matter. In experiments herein, it can be seen that cross-attention may not benefit nor suffer from RoPE, and is optional for this layer.
Prediction Head. A dense prediction head is added on top of the decoder. The dense prediction head may be, for example, a linear projection head, i.e., a single layer after the decoder to transform the embeddings to 3D coordinates. It is possible to leverage DPT or other convolutional heads to this purpose, with no loss of generality.
The disclosed network is trained in a diffusion framework, in a fully supervised manner: the ground truth pointmaps associated to the RGB views is known. While it would be theoretically feasible to have a varying number of views for each batch during training, for simplicity, training is performed on N-tuples, where N may be between 1 and 4. Training for more views increases training complexity quickly and may become intractable. A more elegant solution for inference with more views is preferred as detailed in Section D.3.5. In case real data is used, it is not always possible to have dense annotations. To overcome this, the missing annotations may be set to a constant, and a mask on the input RGB values may be added (either a predefined RGB value or a binary mask as a 4th channel in the pointmap). The loss is masked on undefined 3D coordinates. The disclosed experiments show that, using this strategy, the network can handle undefined coordinates flawlessly. The weights of the encoders and decoders may be initialized with CroCo pretrained weights. Training for 560K steps with 4 views may be performed a single A100 GPU.
Data Augmentation. As usual, adding data augmentation may be performed to artificially increase the variability of the training data. As an example, the data augmentation may include applying a random 3D rotation to the pointmaps, in order to ensure a full coverage of the output distribution and achieve rotation equivariance. Additionally, RGB augmentations may be used such as color jitter, solarization, and style-transfer; simple geometric transformations that preserve chirality are also allowed, like scaling, cropping, rotations or perspective augmentations, as long as they are performed similarly onto the images and the pointmaps.
Win-Win. In the case of high resolution applications, training may be on N-tuples directly. However, it is possible to train the disclosed systems and methods for high resolution predictions using the Win-Win strategy (see Leroy et al., “Win-Win: Training high-resolution vision transformers from two windows”, ICLR 2023, which is incorporated herein in its entirety): at training time, each view only sees three square windows in a high resolution image, effectively decreasing the number of tokens for a more tractable training. The window selection ensures that a significant portion of the pixels observe similar 3D points, e.g., between 15 and 50%. At test time, temperatures of the softmax are quickly tuned for the expected output resolution.
After training of the diffusion model, samples may be conditioned by the observations at test time. For simplicity, the resolution may be kept fixed (224×224 in the disclosed examples), but again, it is possible to leverage the windowed masking strategy of Win-Win strategy for inference in higher resolution with reduced training costs.
Simplest Inference. The training scenario may be: N-tuples are seen at test time. DDPM is leveraged for faster inference: a random initialization was drawn and 250 diffusion steps were performed (whereas 1000 steps were performed during training). In the case where less views are available at test time, the views to match the N-tuple size of training may be duplicated. For example, training may be performed with four views as seen in
More Views. When more views are available at test time, all of them are input in the disclosed network and a block masking is applied in the cross attention, such that each view only attends the same number of views than seen during training. In practice, when trained with 4 views, the top-4 most similar views are computed using an Image Retrieval algorithm, and sparsify the complete attention matrix via block masking, such that the number of compared tokens in the cross attention remains the same as that of training.
Incremental Reconstruction. It is also possible to perform incremental reconstruction: a scene has already been reconstructed from N views, and a new view comes in. In this case, the existing reconstruction may not be updated, and only add the new view. At inference, thus the cross attention is modified such that only the new view queries the whole scene. This is in contrast to the scenario where the whole scene is cross-attending the whole scene. Still, starting occurs from a random scene initialization. At each time step, the new view with the network's prediction is updated, and all the other views are obtained via the Equation D1 of the diffusion process. This way, the prediction of the novel view is anchored on the existing scene model. When the existing model is a metric scene, it is first normalized in [α, 1−α] and the scaling factors are kept for un-normalization after prediction. α is a pre-determined constant that leaves “room” for the new view, in the case where it would increase the size of the bounding box of the scene.
MVS refinement. Note that scene refinement is also possible in a similar spirit. From a noisy or low resolution 3D scene, the scene may be converted to a pointmap representation and blend it with noise using the noising process (C1), taking t<T. It is now possible to refine the scene with a few diffusion iterations “resuming” the diffusion process from t and not from pure Gaussian noise. This will effectively anchor the reconstruction using the rough estimate.
Camera Localization. From any reconstruction, it is possible to recover the camera poses via PnP using a provided focal length f, as expected in a Visual Localization framework. This is possible since the pointmap representation associates pixel coordinates with 3D points. Also matches may be recovered between views by comparing the 3D coordinates of each pixels. Thus it is possible to extract Essential of Fundamental Matrices, and even a sparse scene by only keeping 3D points that are geometrically consistent. All sub-problems of the SfM pipeline from the direct prediction may thus be effectively solved.
For these experiments, training and testing occurred on a hard, synthetically generated, dataset. N multi-view images of thousands of scenes are rendered using a simulator, such as the Habitat Simulator. For each N-tuple, the camera relative positions and focal lengths are randomized during the data generation process. In more details, the Field Of View (FOV) varies between 18 and 100 degrees. The camera locations are adapted such that N-tuples reasonably overlap between each other, meaning that when focal lengths are drastically different, the relative displacement can be rather extreme compared to MVS (which most of time consists in small displacements with the same, or similar, sensors). Even though the dataset may be purely synthetic, it is extremely challenging, due to the focal length variation, with very ambiguous matching, lack of salient features, many planar and low-contrast regions. Training may be performed on 1 M N-tuples, and test on 1K samples, of scenes unseen during training. The encoders, mixer and decoder are all ViT-Base, and a linear prediction head was used for its simplicity.
A convolutional head may improve the results, but this is not the purpose of this disclosure.
Because the reconstruction happens in a rotation- and scale-equivariant space, the prediction is re-aligned with the ground truth point cloud. Note that the latter is also expressed in the form of a pointmap, thus giving direct pixel-to-pixel correspondences. Performed may be a Procrustes alignment with scale, before measuring the reconstruction quality. Because it is a global alignment, any metric measured after alignment is a lower bound of the “true” metric over the whole reconstruction. Still, after visual inspection, the alignment is quite satisfactory so the metrics are still relevant for evaluation. The prediction in normalized space is aligned to the ground-truth in metric space, and in Table D-1 measures the average Euclidean distance over the whole reconstruction, as well as the average view-wise median, both after Procrustes Alignment (PA), where reconstruction error is in mm, under different scenarios: the monocular, binocular and multi-view setups.
Note that these values are not directly comparable since they are not evaluated on the same number of views. These results prove that the approach is viable and that it is possible to directly reconstruct a scene from a collection of un-posed and un-calibrated cameras in a very challenging scenario.
Qualitative results of the C-GAR network are shown in
The predicted reconstructions obtained from the 4-view network on the test set, containing scenes and camera setups that were never observed during training, are reviewed. As depicted in
To further demonstrate the feasibility and effectiveness of the approach, the disclosed network (trained on synthetic 4-view data) is tested on real data. Several scenes of a room are reconstructed that were acquired with a handheld smartphone, downscaled and center-cropped at the network input resolution (224×224). This data is tested two scenarios: the monocular and multi-view setup. In the monocular case, instead of having a single image as input, the static camera case is simulated by replicating the image four times as input. The output is thus four pointmaps observed by the same camera. Interestingly, they are all very consistent and almost perfectly overlapping, as shown at 2108 in
Stepping back. A scene representation may be expressed in a 3-dimensional (static scenes) or 4-dimensional (dynamic scenes) world. The former being a special case of the latter where the scene is constant through time. These can be point clouds, surface meshes, implicit forms, etc., but in all cases, a mapping between the 2D pixels and the 3D world is found in order to build these representation.
Coarsely put, geometric vision tasks include studying the mapping between 2D pixels and the world representation. For instance, static 3D reconstruction with depth maps aims at finding a one-to-one mapping between pixels and the observed surface points. In this case the surface is the interface of a binary indicator in 3D space whether matter is transparent (empty space) and opaque (matter to be found). This mapping idea extends to novel-view synthesis, where the mapping is now a one-to-many: each pixel is associated with a ray in 3D space, and it's appearance is the integration of scene colors weighted by the transparency, e.g., the relaxed equivalent of binary occupancy. Visual Localization, that includes recovering camera poses in a mapped environment, can also be expressed in the form of a one-to-many mapping to be found. While it may be cast as methods to recover relative poses (rotation and translation) with a known camera model (usually pinhole), it is possible to reformulate it as the task of finding, for the 2D pixels in the image, the set of 3D points seen by this pixel. Because light travels in straight lines at human scale, it amounts to finding the parameters of the rays that traverse the scene representation. Note that this may only be valid for static scenes however.
The disclosed philosophy. MVS may involve estimates of the camera parameters. Curiously, recovering approximate camera parameters in in MVX involves solving the reconstruction problem (SfM) to obtain the camera parameters, that are then used to solve the reconstruction problem. Also, because these camera models are theoretical formulations, they may never be perfectly accurate, even when including more elaborate distortion parameters. This means 3D reconstruction is a highly non-convex optimization problem solved using inexact camera model constraints. In this Section, it is shown that such constraints can be relaxed and that the maximal problem can be tackled as a whole. By not enforcing any camera model, the network learns robust priors about the image formation process (including complex distortions) while being able to leverage 2D priors due to the convolutional nature of the architecture, and 3D priors due to the global attention mechanisms. Again, the disclosed approach is robust to pure camera rotation and still/monocular scenes cameras, meaning that this Section effectively bridges monocular and multi-view reconstruction in a common framework.
Broader Impact. The present disclosure details significant steps toward a foundation model for geometric tasks: an input collection of images is used to build an explicit world representation. The fact that the output point clouds roughly respect the laws of projective geometry means that the network discovered a set of soft rules about the world, purely from the properties of the training data. Importantly for a foundation model, it is possible to extract from this output representation all the usual geometric measurements (i.e., downstream tasks). For instance, camera pose, depth, normals, 3D shape, and edges may be recovered. Adding RGB color to the pointmaps could even allow for view rendering via Gaussian Splatting approaches, if ever needed. In the context of autonomous navigation though, the explicit nature of this model may be explored: a well behaved navigating robot is expected to navigate properly in a complex environment, and the user might not care about the quality of it's pixel matching algorithm, or the geometric accuracy of its internal representation. For a capable agent, an explicit 3D model might not be the end result, but rather yet another intermediate step towards autonomous navigation. As for this Section, the intermediate representation might not need be explicit, and could instead be implicitly encoded in the network's representation. Yet, being able to output an explicit world model from the pretrained CroCo implicit representation is a clear indicator that the learned representation and matching mechanisms are relevant for geometric tasks. If another proof of CroCo's performance, the same approach with a random initialization struggles to converge and performs significantly worse.
Limitations. Predictions made herein live in a normalized 3D world, so there is no notion of scale. The disclosed systems and methods may have a quadratic dependency with respect to the number of views. This may become a complex for real-world navigation, e.g., in SLAM scenarios. Of course, Image Retrieval (IR) and/or keyframes can be relied on to sparsify the dense attention graph. A possible direction to overcome this problem would be to work with transformer networks that could access and update a fixed-size memory. With each incoming observation, the network would build or update an implicit world model, with fixed computational and memory complexity. The scene reconstruction would then be extracted from this representation, if ever needed. This approach would only have a linear complexity with respect to the number of frames, where the disclosed approach without retrieval has a quadratic dependency on the number of views.
The number of possible unique pairs with a large number of input images, however, yields an even larger set of possible pairs of images. This makes the computational efficiency of SFM high.
According to the present application, a pairing module 4708 pairs sets of features for images and inputs the pairs to a similarity and filtering module 4712. The pairing module 4708 may determine and input each possible unique pair to the similarity and filtering module 4712.
The similarity and filtering module 4712 determines a similarity (e.g., value) between each pair received. The similarity may increase as closeness between the features of the respective input images increases and vice versa. The similarity and filtering module 4712 may determine the similarity, for example, using a cosine similarity (e.g., dot product) between the feature vectors of the pair or in another suitable manner.
For each input image (first image), the similarity and filtering module 4712 determines Y most similar input images to that first image based on the similarities for the pairs with the first image. Y is an integer greater than or equal to 1. For example, the similarity and filtering module 4712 may determine the Y most similar images to the first image based on those pairs having the Y highest similarity scores when paired with the first image. In various implementations, the similarity and filtering module 4712 may determine the Y most similar images, for example, using nearest neighbor searching as discussed herein. The similarity and filtering module 4712 discards the other pairs with the first input image and proceeds the same for each image as the first image. The similarity and filtering module 4712 provides the Y most similar images to each first image (and not the other pairs) to a graphing module 4716.
The graphing module 4716 generates a sparse scene graph from the similar pairs. An example sparse seen graph is illustrated by 4604 in
The present application involves a network that may be referred to as MASt3R-SfM, a fully-integrated SfM pipeline that can handle completely unconstrained input image collections, ranging from a single view to large-scale scenes, possibly without any information on camera motion. The network builds upon DUSt3R and more particularly on MASt3R that is able to perform local 3D reconstruction and matching in a single forward pass.
Since MASt3R processes each unique set of image pairs, it may computationally scale poorly to large image collections. The network of this patent application uses the frozen (indicated by a snow flake in
The SfM optimization is carried out by the network in two successive gradient descents based on frozen local reconstructions output by MASt3R: first, using a matching loss in 3D space; then with a 2D reprojection loss to refine the previous estimate. Interestingly, the systems and methods described herein go beyond SFM, as it works even when there is no motion (purely rotational case).
In summary, the examples of
As described above, SFM involves matching and Bundle Adjustment (BA). Matching involves the task of finding pixel correspondences across different images observing the same 3D points. Matching builds the basis to formulate a loss function to minimize during BA. BA aims at minimizing reprojection errors for the correspondences extracted during the matching phase by jointly optimizing the positions of 3D point and camera parameters. It may be expressed a non-linear least squares problem. By triangulating 3D points to provide an initial estimate for BA, incrementally built may be a scene, adding images one by one by formulating hypothesis and discarding the ones that are not verified by the current scene state. Due to the large number of outliers, and the fact that the structure of other pipelines propagate errors rather than fix them, robust estimators like RANSAC may be used for relative pose estimation, keypoint track construction and multi-view triangulation. The architecture of the present application, however, enables the non-use of RANSAC.
Since matching is used in other SfM options, it has a quadratic complexity which becomes prohibitive for large image collections. The present application, however, involves comparing the image features of pairs as described above. Image matching is cascaded in two steps: first, a coarse but fast comparison is carried out between all possible pairs (e.g., by computing the similarity between global image descriptors/features as discussed above), and for image pairs that are similar enough (e.g., similarity greater than a threshold), a second stage of keypoint matching is then carried out. This is much faster and scalable. The frozen MASt3R's encoder(s) are used to generate features, considering the (token) features as local features and directly performing efficient retrieval, such as with Aggregated Selective Match Kernels (ASMK), by Tolias, Avrithis, and Jégou, 2013, which is incorporated herein in its entirety.
MASt3R model which, given two input images In, Im∈H×W×3, performs joint local 3D reconstruction and pixel-wise matching as discussed above. As discussed above, it is assumed here for simplicity that all images have the same pixel resolution W×H, but the present application is also applicable to images of different resolutions.
MASt3R can be viewed as a function
where Enc(I)→F denotes the Siamese ViT encoder that represents image I as a feature map (e.g., vector) of dimension d, width w and height h, F∈h×w×d, and Dec(Fn, Fm) denotes twin ViT decoders that regress pixel-wise pointmaps X and local features D for each image, as well as their respective corresponding confidence maps. These outputs intrinsically include rich geometric information from the scene, to the extent that camera intrinsics and (metric) depthmaps can straightforwardly be recovered from the pointmap. Likewise, sparse correspondences (or matches) can be recovered by application of the fastNN algorithm described in Vincent Leroy, et al., Grounding Image Matching in 3D with Mast3r, ECCV, 2024, which is incorporated herein in its entirety, with the regressed local feature maps Dn, Dm. More specifically, the fastNN searches for a subset of reciprocal correspondences from two feature maps Dn and Dm by initializing seeds on a regular pixel grid and iteratively converging to mutual correspondences. These correspondences between In and Im are denoted as
n,m=
, where ycn, ycm∈
2 denotes a pair of matching pixels.
Given an unordered collection of N images ={In}1≤n≤N of a static 3D scene, captured with respective cameras
n=(Kn, Pn), where Kn∈R3×3 denotes the intrinsic parameters (calibration in term of focal length and principal point) and Pn∈R4×4 its world-to-camera pose, a goal is to recover all cameras parameters {
n} as well as the underlying 3D scene geometry {Xn}, with Xn∈
W×H×3 a pointmap relating each pixel y=(i, j)∈
2 from In to its corresponding 3D point Xi,jn in the scene expressed in a world coordinate system.
The largescale 3D reconstruction approach is illustrated in
The network first aims at spatially relating scene objects seen under different viewpoints. The present application feeds a small but sufficient subset of all possible pairs to the graphing module 4716 which forms a scene graph . Formally,
=(
, ε) is a graph where each vertex I∈
is an image, and each edge e=(n, m)∈ε is an undirected connection between two likely-overlapping images In and Im. Importantly,
has a single connected component, all images are (perhaps indirectly) linked together.
Image retrieval. To select the right subset of pairs, the similarity and filtering module 4712 uses a scalable pairwise image matcher h(In, Im)s, able to predict the approximate co-visibility score s∈[0,1] between two images In and Im. This may be done using the encoder(s) Enc(⋅). The encoder module(s) 4704, the pairing module 4708, and the similarity and filtering module 4712 may be implemented within the encoder in various implementations. The encoder, due to its role of laying foundations for the decoder, is implicitly trained for image matching. To that aim, ASMK image retrieval method may be used considering the token features output by the encoder as local features. Generally speaking, the output F of the encoder can be considered as a bag of local features, and the encoder may apply feature whitening, quantize them according to a codebook previously obtained by k-means clustering, then aggregate and binarize the residuals for each codebook element, thus yielding high-dimensional sparse binary representations. The ASMK similarity between two image representations can be computed by summing an (e.g., small) kernel function on binary representations over the common codebook elements. Note that this method is training-free, only involving the determination of the whitening matrix and the codebook once from a representative set of features. In various implementations, a projector may be included in the encoder on top of the encoder features following the HOW approach described in Giorgos Tolias, et al., Learning and Aggregating Deep Local Descriptors For Instance-Level Recognition, ECCV, 2020, which is incorporated herein in its entirety.
The output from the retrieval step may be a similarity matrix S∈[0,1]N×N.
Graph construction. To get a small number of pairs while still ensuring a single connected component, the graphing module 4716 may build the graph as follows. The graphing module 4716 may first select a fixed number Nα of key images (or keyframes), such as using farthest point sampling (FPS) based on S. FPS is described in Yuval Eldar, et al., The Farthest Point Strategy for Progressive Image Sampling, ICPR, 1994, which is incorporated herein in its entirety.
These keyframes provide a core set of nodes and are densely connected together. All remaining images are then connected by the graphing module 4716 to their closest keyframe as well as their k nearest neighbors according to S. The graph includes 0(Nα2+(k+1)N)=O(N)<<O(N2) edges, which is linear in the number of images N. In various implementations, Nα=20 and k=10, although the present application is also applicable to other suitable values. While the retrieval step has quadratic complexity in theory, it is extremely fast and scalable in practice.
The inference of decoder of MASt3R is executed for every pair e=(n, m)∈ε, yielding raw pointmaps and sparse pixel matches n,m. Since MASt3R may be order-dependent in terms of its input,
n,m may be the union of correspondences obtained by running both f(In, Im) and f(Im, In). Doing so, also obtains pointmaps Xn,n, Xn,m, Xn,m, Xm,n and Xm,m, where Xn,m∈
H×W×3 denotes a 2D-to-3D mapping from pixels of image In to 3D points in the coordinate system of image Im. Since the encoder features {Fn}n=1 . . . N have already been extracted and cached during scene graph construction, a ViT decoder Dec( ) only is executed, which substantially saves time and increases computational efficiency.
Canonical pointmaps. The network estimates an initial depthmap Zn and camera intrinsics Kn for each image In. These can be recovered from a raw pointmap Xn,n, such as described in S. Wang et al., Dust3r; Geometric 3D Vision made Easy, CVPR, 2024, which is incorporated herein in its entirety. However, each pair (n, ⋅) or (⋅, n)∈ε may yield its own estimate of Xn,n. To average out regression imprecision, the decoder may aggregate these pointmaps into a canonical pointmap {tilde over (X)}n. Let εn={e|e∈ε∧n∈e} be the set of all edges connected to image In. For each edge e∈εn, there is a different estimate of Xn,n and its respective confidence maps Cn,n, which will be denoted as Xn,e and Cn,e in the following. The decoder may determine the canonical pointmap as a per-pixel weighted average of all estimates:
From it, the decoder can recover the canonical depthmap {tilde over (Z)}n={tilde over (X)}:,:,3n and the focal length, such as using Weiszfeld algorithm as described in S. Wang et al., Dust3r; Geometric 3D Vision made Easy, CVPR, 2024, which is incorporated herein in its entirety, such as using:
which, assuming centered principal point and square pixels, yields the canonical intrinsics {tilde over (K)}n. A pinhole camera model may be assumed without lens distortion, the present application is also applicable to other camera types.
Constrained pointmaps. Camera intrinsics K, extrinsics P and depthmaps Z will serve as basic ingredients (or rather, optimization variables) for the global reconstruction phase performed by the 3D reconstruction module 310. Let πn: denote the reprojection function onto the camera screen of In, πn(x)=Kn Pn σnx for a 3D point x∈
3 (σ_n>0) is a per-camera scale factor. In various implementations, scaled rigid transformations may be used. To ensure that pointmaps satisfy the pinhole projective model (they may be over-parameterized), a constrained pointmap may be defined χn∈
H×W×3 explicitly as a function of Kn, Pn, σn and Zn. Formally, the 3D point χi,jn seen at pixel (i, j) of image In is defined using inverse reprojection as
DUSt3R introduced a global alignment procedure aiming to rigidly move dense pointmaps in a world coordinate system based on pairwise relationships between them. In this application, this procedure is simplified and made computationally more efficient by taking advantage of pixel correspondences, thereby reducing the overall number of parameters and its memory and computational footprint.
Specifically, the global aligner 407 determines the scaled rigid transformations σ*, P* of every canonical pointmaps χ=π−1(σ, {tilde over (K)}, P, {tilde over (Z)}) (fixing intrinsics K=K and depth Z={tilde over (Z)} to their canonical values) such that any pair of matching 3D points gets as close as possible:
where c denotes the matching pixels in each respective image by a slight abuse of notation. In contrast to the global alignment procedure in DUSt3R, this minimization only applies to sparse pixel correspondences ycn↔ycm weighted by their respective confidence qc. To avoid solutions, the global aligner module 407 enforce
by reparameterizing
The global aligner module 407 may minimize this objective using the Adam optimizer for a fixed number v1 of iterations.
Coarse alignment converges well and fast in practice, but may be restricted to rigid motion of canonical pointmaps. Unfortunately, pointmaps may be noisy due to depth ambiguities during local reconstruction. To further refine cameras and scene geometry, the global aligner 407 may perform a second global optimization, similar to bundle adjustment, with gradient descent for v2 iterations and starting from the coarse solution σ*, P* obtained above. In other words, the global aligner module 407 may minimize the 2D reprojection error of 3D points in all cameras:
with ρ: a robust error function able to deal with potential outliers among all extracted correspondences. In various implementations, ρ(x)=|x|λ
Optimizing may have a little effect, possibly because sparse pixel correspondences m,n are rarely exactly overlapping across several pairs. As an illustration, two correspondences y.,.m↔yi,jn and yi+1,jn≃y.,.l from image pairs (m, n) and (n, l) would independently optimize the two 3D points χi,jn and χi+1,jn, possibly moving them very far apart despite this being very unlikely as (i, j)≃(i+1, j). SfM may resort to forming point tracks, which is relatively straightforward with keypoint-based matching. In the present application, the global aligner module 407 forms pseudo-tracks by creating anchor points and rigidly tying together every pixel with their closest anchor point. This way, correspondences that do not overlap exactly are still both tied to the same anchor point with a high probability. Formally, anchor points with a regular pixel grid {dot over (y)}∈
spaced by δ pixels are defined as:
The global aligner module 407 may then tie each pixel (i, j) in In with its closest anchor {dot over (y)}u,v at coordinate
Concretely, the global aligner module 407 may index the depth value at pixel (i, j) to the depth value Żu,v of its anchor point, and defined may be Zi,j=oi,jŻu,v where
is a constant relative depth offset calculated at initialization from the canonical depthmap {tilde over (Z)}. Here, assumed may be that canonical depthmaps are locally accurate. All in all, optimizing a depthmap Zn∈W×H thus may come down to optimizing a reduced set of anchor depth values Żn∈
(reduced by a factor of 64 if δ=8).
When building the sparse scene graph, in various implementations Nα=20 anchor images and k=10 non-anchor nearest neighbors may be used. In various implementations, a grid spacing of δ=8 pixels for extracting sparse correspondences with FastNN and defining anchor points may be used. For the two gradient descents, in various implementations the Adam optimizer may be used with a learning rate of 0.07 (resp. 0.014) for v1=300 iterations and λ1=1.5 (resp. v2=300 and λ2=0.5) for the coarse (respectively refinement) optimization, each time with a cosine learning rate schedule and without weight decay. While examples are provided, the present application is also applicable to other examples. Shared intrinsics and optimizing a shared per-scene focal parameter may be assumed for all cameras may be used.
Other methods may crash when dealing with large sets of input images due to insufficient memory despite using 80 GB GPUs. Regardless, even choosing sets of input images that do not cause crashing, the present application performs better than other methods.
As described above, the graphing module may generate the scene graph based on the similarity matrix. Generating the scene graph includes building a small but complete graph of keyframes, and then connecting each image with the closest keyframe and with k nearest non-keyframes. In various implementations, k=13 to compensate for the missing edges. In various implementations, the scene graph can be generated using only the keyframes (k=0). Generating the scene graph to include short-range (k-NN) and long-range (keyframes k=0) connections provides high performance.
Above discusses the use of ASMK on the token features output from the MASt3R encoder, after applying whitening. In various implementations, a global descriptor representation per image may be used with a cosine similarity between image representations as also discussed above. As discussed above, in various implementations, a projector is learned on top of the frozen MASt3R encoder feature with ASMK, following an approach similar to HOW for training.
In various implementations, the whitening may include PCA-whitening. In various implementations, the training module may train the examples of
In various implementations, the optimization of anchor depth values (fixing depth to the canonical depthmaps) may be disabled. This may improve performance.
Regarding generating the scene graphs, increasing the number of key images (Nα) or nearest neighbors (k) may improve performance. The improvements however may saturate above Nα≥20 or k≥10.
A good parameterization of cameras can accelerate convergence. Above is described a camera n=(Kn, Pn) classically as intrinsic and extrinsic parameters, where:
Here, fn>0 denotes the camera focal,
is the optical center, Rn∈3×3 is a rotation matrix typically represented as a quaternion qn∈
4 internally, and tn∈
3 is a translation.
Camera parameterization. During optimization, 3D points are constructed by the global aligner module 407 using the inverse reprojection function π−1(⋅) as a function of the camera intrinsics Kn, extrinsics Pn, pixel coordinates and depthmaps Zn. Small changes in the extrinsics however can induce larger changes in the reconstructed 3D points. For example, small noise on the rotation Rn can result in a potentially large absolute motion of 3D points, motion whose amplitude would be proportional to the points' distance to camera (their depth).
The present application may therefore reparameterize cameras so as to better balance the variations between camera parameters and 3D points. To do so, the global aligner module 407 may switch (or change) the camera rotation center from the optical center to a point ‘in the middle’ of the 3D point-cloud generated by this camera, such as, at the intersection of the {right arrow over (z)} vector from the camera center and the median depth plane or within a predetermined distance of the median depth plane. In more details, the global aligner module 407 may determine the extrinsics Pn using a fixed post-translation Tn∈4 on the z-axis as
with
where
is the median canonical depth for image In modulated by the ratio of the current focal length the canonical focal {tilde over (f)}n, and P′n is again parameterized as a quaternion and a translation. This way, rotation and translation noise in Rn are naturally compensated and have a lot less impact on the positions of the reconstructed 3D points.
Kinematic chain. A second source of potentially undesirable correlations between camera parameters stems from the intricate relationship between overlapping viewpoints. If two views overlap, then modifying the position or rotation of one camera will most likely also result in a similar modification of the second camera, since the modification will impact the 3D points shared by both cameras. Thus, instead of representing all cameras independently, the present application involves expressing the cameras relative to each other using a kinematic chain. This naturally conveys the idea than modifying one camera will impact the other cameras by design. In practice, the global aligner module 407 defines a kinematic tree over all cameras
,
consists of a single root node r∈
and a set of directed edges (n→m)∈
, with |
|=N−1 since
is a tree. The pose of all cameras is then computed in sequence, starting from the root as
Different strategies may be used to build the kinematic tree as shown in
and the kinematic tree
may share no relation other than being defined over the same set of nodes.
The above regarding
The components described herein functionally may be referred to as modules. For example, an encoder may be referred to as an encoder module, a decoder may be referred to as a decoder module, etc.
The foregoing description is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. The broad teachings of the disclosure can be implemented in a variety of forms. Therefore, while this disclosure includes particular examples, the true scope of the disclosure should not be so limited since other modifications will become apparent upon a study of the drawings, the specification, and the following claims. It should be understood that one or more steps within a method may be executed in different order (or concurrently) without altering the principles of the present disclosure. Further, although each of the embodiments is described above as having certain features, any one or more of those features described with respect to any embodiment of the disclosure can be implemented in and/or combined with features of any of the other embodiments, even if that combination is not explicitly described. In other words, the described embodiments are not mutually exclusive, and permutations of one or more embodiments with one another remain within the scope of this disclosure.
Spatial and functional relationships between elements (for example, between modules, circuit elements, semiconductor layers, etc.) are described using various terms, including “connected,” “engaged,” “coupled,” “adjacent,” “next to,” “on top of,” “above,” “below,” and “disposed.” Unless explicitly described as being “direct,” when a relationship between first and second elements is described in the above disclosure, that relationship can be a direct relationship where no other intervening elements are present between the first and second elements, but can also be an indirect relationship where one or more intervening elements are present (either spatially or functionally) between the first and second elements. As used herein, the phrase at least one of A, B, and C should be construed to mean a logical (A OR B OR C), using a non-exclusive logical OR, and should not be construed to mean “at least one of A, at least one of B, and at least one of C.”
In the figures, the direction of an arrow, as indicated by the arrowhead, generally demonstrates the flow of information (such as data or instructions) that is of interest to the illustration. For example, when element A and element B exchange a variety of information but information transmitted from element A to element B is relevant to the illustration, the arrow may point from element A to element B. This unidirectional arrow does not imply that no other information is transmitted from element B to element A. Further, for information sent from element A to element B, element B may send requests for, or receipt acknowledgements of, the information to element A.
In this application, including the definitions below, the term “module” or the term “controller” may be replaced with the term “circuit.” The term “module” may refer to, be part of, or include: an Application Specific Integrated Circuit (ASIC); a digital, analog, or mixed analog/digital discrete circuit; a digital, analog, or mixed analog/digital integrated circuit; a combinational logic circuit; a field programmable gate array (FPGA); a processor circuit (shared, dedicated, or group) that executes code; a memory circuit (shared, dedicated, or group) that stores code executed by the processor circuit; other suitable hardware components that provide the described functionality; or a combination of some or all of the above, such as in a system-on-chip.
The module may include one or more interface circuits. In some examples, the interface circuits may include wired or wireless interfaces that are connected to a local area network (LAN), the Internet, a wide area network (WAN), or combinations thereof. The functionality of any given module of the present disclosure may be distributed among multiple modules that are connected via interface circuits. For example, multiple modules may allow load balancing. In a further example, a server (also known as remote, or cloud) module may accomplish some functionality on behalf of a client module.
The term code, as used above, may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, data structures, and/or objects. The term shared processor circuit encompasses a single processor circuit that executes some or all code from multiple modules. The term group processor circuit encompasses a processor circuit that, in combination with additional processor circuits, executes some or all code from one or more modules. References to multiple processor circuits encompass multiple processor circuits on discrete dies, multiple processor circuits on a single die, multiple cores of a single processor circuit, multiple threads of a single processor circuit, or a combination of the above. The term shared memory circuit encompasses a single memory circuit that stores some or all code from multiple modules. The term group memory circuit encompasses a memory circuit that, in combination with additional memories, stores some or all code from one or more modules.
The term memory circuit is a subset of the term computer-readable medium. The term computer-readable medium, as used herein, does not encompass transitory electrical or electromagnetic signals propagating through a medium (such as on a carrier wave); the term computer-readable medium may therefore be considered tangible and non-transitory. Non-limiting examples of a non-transitory, tangible computer-readable medium are nonvolatile memory circuits (such as a flash memory circuit, an erasable programmable read-only memory circuit, or a mask read-only memory circuit), volatile memory circuits (such as a static random access memory circuit or a dynamic random access memory circuit), magnetic storage media (such as an analog or digital magnetic tape or a hard disk drive), and optical storage media (such as a CD, a DVD, or a Blu-ray Disc).
The apparatuses and methods described in this application may be partially or fully implemented by a special purpose computer created by configuring a general purpose computer to execute one or more particular functions embodied in computer programs. The functional blocks, flowchart components, and other elements described above serve as software specifications, which can be translated into the computer programs by the routine work of a skilled technician or programmer.
The computer programs include processor-executable instructions that are stored on at least one non-transitory, tangible computer-readable medium. The computer programs may also include or rely on stored data. The computer programs may encompass a basic input/output system (BIOS) that interacts with hardware of the special purpose computer, device drivers that interact with particular devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, etc.
The computer programs may include: (i) descriptive text to be parsed, such as HTML (hypertext markup language), XML (extensible markup language), or JSON (JavaScript Object Notation) (ii) assembly code, (iii) object code generated from source code by a compiler, (iv) source code for execution by an interpreter, (v) source code for compilation and execution by a just-in-time compiler, etc. As examples only, source code may be written using syntax from languages including C, C++, C#, Objective-C, Swift, Haskell, Go, SQL, R, Lisp, Java®, Fortran, Perl, Pascal, Curl, OCaml, Javascript®, HTML5 (Hypertext Markup Language 5th revision), Ada, ASP (Active Server Pages), PHP (PHP: Hypertext Preprocessor), Scala, Eiffel, Smalltalk, Erlang, Ruby, Flash®, Visual Basic®, Lua, MATLAB, SIMULINK, and Python®.
Number | Date | Country | Kind |
---|---|---|---|
2314650 | Dec 2023 | FR | national |
24305954 | Jun 2024 | EP | regional |
This application claims the benefit of U.S. Provisional Application No. 63/559,062, filed on Feb. 28, 2024, U.S. Provisional Application No. 63/633,125 filed on Apr. 12, 2024, U.S. Provisional Application No. 63/658,294 filed on Jun. 10, 2024, and U.S. Provisional Application No. 63/700,101 filed on Sep. 27, 2024. This application claims the benefit of French Application No. 2314650 filed on Dec. 20, 2023, and European Application No. 24305954 filed on Jun. 17, 2024. The entire disclosures of the applications referenced above are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63559062 | Feb 2024 | US | |
63633125 | Apr 2024 | US | |
63658294 | Jun 2024 | US | |
63700101 | Sep 2024 | US |