This invention is related to the field of 3D vision which collects technologies and systems devoted to the extraction of a 3D geometrical model from a scene. More particularly, this invention relates to the extraction of feature correspondences from multiple images.
Feature matching among multiple images, which attempts to extract correspondences between feature points in distinct and spatially separated images, is a typical and important technique for almost every 3D vision mechanism, e.g. 3D reconstruction, motion estimation, etc. Despite the variety of 3D reconstruction systems, which differ at their specific assumptions, aims and scenarios, the acquisition of a set of feature points from multiple images and the establishment of feature correspondences therefrom are essential tasks at the early stage. However, the detection of the feature correspondences might be difficult and inefficient, which results in a high ratio of false detections (outliers).
In response to the problem, random sample consensus (RANSAC) framework has been proposed and is nowadays generally integrated in most 3D reconstruction mechanisms. RANSAC is an iterative method to estimate parameters of a mathematical model from a set of observed data containing outliers [I]. It includes a non-deterministic algorithm which achieves a predefined level of performance with a certain probability and allows for further iterations with an increase of the probability. Several refinements of RANSAC have been proposed especially aiming at the problems arising in the field of computer vision, and have become some standards for geometrical model estimation [II, III, IV]. Nevertheless, the iterative algorithm of RANSAC makes it a time consuming method accompanied with the problem of the sensibility to increase the ratio of outliers in the input data sample. Moreover, although RANSAC methods are widely used for the estimation of geometrical models between two images, it has not yet been successfully employed in a complete multi-view context. Therefore, the task of multi-view 3D reconstruction, at least in the early phase of feature matching, is generally tackled as a repetition of two-view estimation based on RANSAC.
For example, structure-from-motion (SfM) is a well-known example of 3D modeling techniques without any a-priori knowledge of the camera poses, and attempts to estimate camera poses and scene structure from a sequence of uncorrelated images in the form of a point cloud [V, VI, VII]. The SfM methods utilize the techniques of Sparse Bundle Adjustment (SBA) that is a variation of a Gauss-Newton numerical optimization scheme and is designed to use the sparse nature of the error function Jacobian matrix [VIII]. A progressive SfM method processes images according to the temporal sequence thereof to track the camera pose in the overall camera trajectory and simultaneously updates the reconstructed scene. At the early step of the method, the establishment of a reliable set of feature correspondences is a crucial step for the subsequent processes.
Normally the establishment of feature correspondences is performed according to only the temporal order of the images, in which case a severe drift of the camera path is likely to happen, resulting in the infeasibility to match a current image against the whole sequence. One possible solution is to extract and use key-images to overcome the camera drift and maintain the camera track on the actual trajectory. However, during this process, a massive amount of features and matches data would appear, and decisions must be continuously taken in order to remove a high number of outliers, which can influence the camera tracking and the result of the SBA process. As a result, a high number of features would be ignored and dropped as potential outliers merely because of the deficient information to support a reliable match between a 3D point and an image feature or between two feature points. Outcome of this approach are the proliferation of compact clusters of 3D points in the reconstructed scene and the disappearance of correct points dropped as outliers soon after their instantiation, which would not be recovered afterwards. This is, however, opposite to the requirement of a robust input dataset for a successful exploitation of SBA, of which the 3D points are uniformly spread in the 3D volume and the features included in the input dataset are as many as possible.
Therefore, it is an objective of the present invention to propose a method and an apparatus to extract a reliable dataset of feature correspondences from images.
According to the invention, the method comprises: acquiring features of the images and preliminary feature correspondences of the features; generating at least one cluster of the features; and determining for each cluster primary feature correspondences of the features. In a same cluster, each feature is coupled to at least one other feature as preliminary feature correspondences.
In one embodiment, the method further comprises iterating said determining primary feature correspondences for each cluster. The iteration is terminated when the amount of the features not determined as primary feature correspondences is smaller than a threshold.
In one embodiment, the method is introduced as an additional stage within a standard SfM pipeline before performing an SBA refinement, aiming at the re-gathering of a more compact and exhaustive dataset as an input for the SBA processing. The attempt is to resume as many features as possible from those that have been previously dropped and to condense 3D points into compact clusters.
In another embodiment, the preliminary feature correspondences are extracted from the acquired features using a basic matcher, without assistance of any outlier-pruning technique. The acquired features are combined and reassembled into clusters, which are represented by undirected graphs. The features of the clusters are defined as nodes, and a consistency measure between two features is defined as the weight of an edge connecting two corresponding nodes of the two features. The graph weights, which represent the coherence of a match with the camera geometrical models, are computed using statistical distributions of the epipolar distance and the reprojection error determined by the matches. The set of graphs are then iteratively segmented using a spectral segmentation technique.
Accordingly, an apparatus configured to extract feature correspondences from images is introduced, which comprises an acquiring unit and an operation unit. The acquiring unit is configured to acquire features of the images and the preliminary feature correspondences of the features. The operation unit is configured to generate at least one cluster of the features and to determine for each cluster primary feature correspondences of the features.
Also, a computer readable storage medium has stored therein instructions for extracting feature correspondences from images, which when executed by a computer, cause the computer to: acquire features of the images and preliminary feature correspondences of the features; generate at least one cluster of the features; and determine for each cluster primary feature correspondences of the features.
The method of this invention provides an improved solution for the extraction of reliable matches from multiple localized views, exploiting simultaneously the constraints of the camera cluster geometry. The feature correspondences extracted according to the method provides a promising input for further processing, e.g. multi-view triangulation, Sparse Bundle Adjustment, etc. Such a technique can be easily and successfully integrated in any feature-based 3D vision application. For example, the method can be integrated directly within the progressive SfM processing as an innovative framework for feature tracking, and the refinement of the extraction of feature correspondences allows for a significant improvement of the overall accuracy achieved by the SfM processing.
For a better understanding the invention shall now be explained in more detail in the following description with reference to the figures. It is understood that the invention is not limited to this disclosed exemplary embodiments and that specified features can also expediently be combined and/or modified without departing from the scope of the present invention as defined in the appended claims.
In one embodiment, the method further comprises iterating 13 said determining primary feature correspondences. The iterating 13 step can be repeated several times depending on different demands, and can be terminated according to various conditions. For example, the iterating 13 can be terminated when the amount of the features not determined as primary feature correspondences is smaller than a given threshold. The specific threshold can of course be given by a user or calculated and provided automatically by an apparatus.
Preferably, the input data further includes a set of camera poses in the 3D space, the statistical distributions and/or the epipolar distance and the reprojection errors responsive to the preliminary feature correspondences. Of course, the input data is not limited to the above mentioned data and can include other types of data.
The image dataset included in the input data can be acquired by any known methods. For example, the images can be captured by a set of calibrated and fixed cameras, or by a multi-view stereo camera system. In alternative, the images can also be extracted from a video sequence which is captured by a camera and subjected to a SfM processing.
In the preferred embodiment, the method is exemplarily implemented and applied as an intermediate stage of a general progressive SfM processing and is described in detail below. The implementation particularly aims at the refinement of an input dataset for an SBA stage of the SfM processing. It should be noted that this embodiment shows merely an exemplary implementation of the method of this invention, and the method can of course be used in any other suitable processing and techniques.
Upon the acquisition of the image dataset, preliminary feature correspondences are extracted and obtained by processing the images with a feature selector followed by a feature matcher. In this embodiment, SIFT techniques are utilized as the feature extractor to extract the features of the images, and the Nearest Neighbor Distance Ratio is used as a matching measure to match and select the preliminary feature correspondences. Alternatively, other techniques and other feature types can also be implemented for the extraction of feature correspondences, which are independent of and do not influence the subsequent steps 11, 12 of the method of this invention. Specifically, the preliminary feature correspondences are acquired without any specific outlier rejection scheme. In other words, the original matches of the features, which are considered as the preliminary feature correspondences and are included in the input data, can be acquired by any basic matcher.
For the computation of the feature correspondences, a subset of linked image pairs among the image dataset is selected according to their spatial proximity. In the case where the image dataset is captured by a set of static cameras, the subset of linked image pairs can be easily assembled by grouping cliques of neighboring views. For this embodiment implemented in an SfM processing, it is required to extract a spanning tree from a subset of chosen keyframes of an original video sequence, followed by selecting and connecting the paired views. A camera distance matrix proposed in PCT International Application with Publication Number WO2014/154533 by the same inventor is particularly used here for the computation of the minimum spanning tree that connects the keyframe set.
As mentioned above, the choice of the camera- or view-linking strategy is independent of and does not influence the exploitation of the method and the subsequent steps 11, 12 thereof.
A set of camera poses is preferably included in the provided input data. In this embodiment, a sequence of 3×4 metric projection matrices are used, which represent the rigid motion between the reference frame of the camera system and the reference frame of the point cloud coordinates and are provided by a camera tracker during the typical SfM processing. The motion of the unconstrained moving camera is analyzed in order to extract a subset of keyframes, contributing to further processes and also reducing the visual information redundancy. In the case when a set of static camera is used, the camera poses can be assumed available as a fixed input pre-computed via calibration and all the captured images can be implicitly labeled as keyframes. It is assumed that, in either case, a set of images, which are dislocated in the 3D space and have a sufficient level of overlap among the field of views, is located and provided in the input data. This assumption is reasonable and easily achieved by, for example, an SfM processing or a multi-view stereo reconstruction system.
The input data preferably further includes the statistical distribution of the error measures, i.e., epipolar distances and reprojection errors responsive to the preliminary feature correspondences, which are typically used as indicators for the reliability of the feature matching. The epipolar distance is computed from pairs of matched features in distinct images and represents the coherence of the match with the corresponding two-view epipolar geometry. The reprojection error measures the distance between the analytical projection of a 3D point on a single view and the corresponding image feature. When the cameras are arranged in a rigid cluster and the scene volume is always unchanged and irrespective to an inspected object, the epipolar distances and the reprojection errors can be regarded as random variables, which are independent of the image content. A statistical model can thus be easily inferred for the database of the previously computed 3D reconstructions.
In this preferred embodiment implemented in an SfM processing, a statistical model can be extracted on-the-fly by collecting frame-by-frame the error data and fitting the models once the data samples have reached a sufficient size. Specifically, an exponential model is utilized to represent the statistics of the error measures, i.e., epipolar distances and reprojection errors.
pdf(x)=λe−λx, λ=2752.58.
Similarly,
pdf(x)=λe−λx, λ=1056.28.
Of course, other clustering techniques can also be used to generate the statistical models and are independent of and do not influence the subsequent steps 11, 12 of the method of this invention.
The acquired features and the preliminary feature correspondences from the input data are used to generate 11 at least one cluster of the features. In a same cluster, each feature is coupled to at least one other feature as preliminary feature correspondences. In other words, the preliminary feature correspondences determine the development of the feature clusters.
In this preferred embodiment, the feature clusters are represented in a form of connected and undirected graphs, in which each feature is defined as a node. In addition, a growing strategy is implemented on the clusters to assemble in a same cluster any feature that is coupled to at least one other feature as preliminary feature correspondence. The growing strategy excludes any consistency check and outlier rejection schemes to allow more relevant features being included and combined into clusters, each of which includes an unknown number of outliers and potentially more than one group of actual corresponding features. This aims at collecting in a single cluster the whole native information provided by the preliminary feature correspondences of the input data, and thus generates bigger clusters.
Subsequent to the generation of the feature clusters, primary feature correspondences of the features are determined 12 respectively for each cluster. As shown in
A consistency measure ωi,j between two features (i.e., two nodes i and j) is determined 121 as the sum of three contributes:
The first contribute is given by the probability that the epipolar distance variable assumes a value greater than the the one determined by the features i and j, where the latter is denoted as φi,j. The probability measure is computed by analytical integration of the probability density function provided in the initial input data and as shown in
For the characterization of the other two contributes, a notation Sk is introduced to represent a set of 3D points that can be triangulated from a pair of features including a feature k. When the cluster in which the feature k is included comprises a number of N features, the maximal cardinality of the set Sk would be N−1. This is possibly lower if some feature pairs are not admissible for triangulation, i.e., the feature pairs are in a same image of the input data. Accordingly, Si and Sj represent the sets of 3D points triangulated respectively using the features i and j.
To compute the consistency measure as shown in the above formula, the set of points triangulated using either feature i or j are back-projected towards the other feature, and the one providing the minimum backprojection error is used and selected. Similarly, this is performed by analytical integration of the corresponding probability density function provided in the input data and as shown in
The at least one cluster is then segmented by maximizing 122 an average consensus of the consistency measures of the cluster:
where u is a binary valued N-dimensional vector representing the cluster segmentation and W is the symmetric N×N real valued matrix collecting the consistency measure (ωi,j) between the feature pairs in one cluster. As mentioned in [IX], there is no known polynomial-time solution to maximize this function when u is a discrete-valued indicator vector. However, an approximate solution can be found by relaxing the constraint on u, allowing its elements to take any positive real value. The problem here is then the maximization of the Rayleigh quotient of the matrix W, of which one solution can be given by the dominant eigenvector of W, namely the one associated to the maximum eigenvalue [X]. The vector u is then projected onto a final solution v belonging to the binary discrete space by sequentially setting to the elements of v until the consensus r(v) is maximized. The vector v is initialized to be 0 and its elements are flipped to decrease the ordering of u.
It has been shown above that the growing strategy does not guarantee a unique group of corresponding features inside a single cluster (graph). Specifically, consistent groups of features that actually attain to a reduced number of distinct 3D points are assembled into a same cluster. In other words, the result of the step of determining 12 the primary feature correspondences for each cluster might not be optimized and might exclude an amount of features as outliers from the set of the primary feature correspondences. One solution to cope with the possible situation is to iterate 13 said determining step and to adjust the result and the corresponding outliers.
In one embodiment, the amount of the outliers is the indicator for such iteration and adjustment. For example, when an initial amount of features are excluded and considered as outliers from the primary feature correspondences after the determining 12 step, the iterating 13 is subsequently performed such that a second amount of the outliers is smaller than the initial amount, i.e., more features are determined as primary feature correspondences and less features are excluded as outliers. Accordingly, the iterating 13 can be terminated when the amount of the outliers is smaller than a threshold which can be predetermined by a user or automatically given by an apparatus. Of course, other termination conditions for the iterating 13 can also be applied depending on different demands.
From the original 235 frames of the sequence, 7 keyframes and corresponding camera poses are extracted by processing the sequence with a keyframe-based SfM engine proposed by the same inventor in European Patent Application EP13305993.1. The minimum spanning tree providing the optimal spatial linking of the image features is subsequently computed. From the image dataset of the input data as well as the image features and preliminary feature correspondences extracted therefrom, about 1500 clusters of the features are generated and about 500 clusters thereof are successfully segmented and triangulated.
The results obtained from the above example show the capability of the embodiment to re-gather all the image features consistently correspond to a single 3D point from a highly cluttered set of matches. This makes the method of this invention useful for the refinement of feature correspondences dataset used in a final Bundle Adjustment of a Structure from Motion architecture or in a multi-view stereo reconstruction system.
In one embodiment, the operation unit 22 is further configured to determine consistency measures between every two features in a cluster and to maximize an average consensus of the consistency measures of the cluster to determine primary feature correspondences. The consistency measure between two features is relevant to an epipolar distance and a triangulation result determined by the two features. Furthermore, the operation unit 22 is also configured to define the cluster as a graph, each feature thereof as a node, and the consistency measure between two features as the weight of an edge connecting two corresponding nodes of the two features, and accordingly to perform spectral segmentation on the graph of the cluster.
Number | Date | Country | Kind |
---|---|---|---|
14306677.7 | Oct 2014 | EP | regional |