Three-dimensional modeling is useful in various applications, including object localization, object recognition, and motion capture. There are a variety of methods that are currently used to model three-dimensional objects. One such method is a visual hull-based method in which silhouette information from two-dimensional images taken from multiple views in three-dimensional space are fused. Although such methods are viable, they typically require camera calibration, which is cumbersome. It would be desirable to be able to model three-dimensional objects from two-dimensional images without having to perform such calibration.
The present disclosure may be better understood with reference to the following figures. Matching reference numerals designate corresponding parts throughout the figures, which are not necessarily drawn to scale.
Disclosed herein are systems and methods for three-dimensional modeling using a purely image-based approach to fusing foreground silhouette information from multiple two-dimensional views. Unlike prior solutions, three-dimensional constructs such as camera calibration are unnecessary.
As described in greater detail below, visual hull intersection is performed in the image plane using planar homographies and foreground likelihood information from a set of arbitrary views of an object. A two-dimensional grid of object occupancy likelihoods is generated representing a cross-sectional slice of the object. Subsequent slices of the object are obtained by extending the process to planes parallel to a reference plane in a direction along the body of the object. Occupancy grids are then stacked on top of each other, creating a three-dimensional data structure that encapsulates the object shape and location. Finally, the object structure is segmented out by minimizing an energy function over the surface of the object in a level sets formulation.
The problem of determining a slice of an object can be stated as finding the region on a hypothetical plane that is occupied by the object. Through homographic warping of silhouette information from multiple views to a reference view, visual hull intersection on a plane can be achieved. If foreground information is available in each view, the process delivers a two-dimensional grid of space occupancies: a representation of a slice of the object cut out by the plane. Starting with homographies between views due to a reference plane in the scene (e.g., a ground plane), homographies of successively parallel planes can be obtained in the framework of plane-to-plane homographies using the vanishing point of the reference direction (the direction not parallel to the reference plane). With those homographies, an arbitrary number of occupancy grids/slices along the body of the object can be obtained, each being a discrete sampling of three-dimensional space of object occupancies.
Discussion of the Modeling Approach
Planar homographies will first be discussed in relation to
Computing the shadow is equivalent to determining the region on plane π that falls inside the visual hull of the object image in Ii. The fusion of these shadows projected from various views therefore amounts to performing visual hull intersection on plane π, depicted by the region 20 in
Instead of using binary foreground maps, a more statistical approach can be pursued and the background can be modeled in each view to obtain foreground likelihood maps, thereby using cameras as statistical occupancy sensors (foreground interpreted as occupancy in space). In the case of non-stationary cameras, object detection is achieved in a plane+parallax framework assigning high foreground likelihood where there is high motion parallax. A reason to adopt a soft approach is to delay the act of thresholding preventing any premature decisions on pixel labeling; an approach that has proven to be very useful in visual hull methods due to their susceptibility to segmentation and calibration errors. Assume that Ii is the foreground likelihood map (each pixel value is likelihood of being foreground) in view i of n. Consider a reference plane π in the scene inducing homographies Hi
Visual hull intersection on π (AND-fusion of the shadow regions) is achieved by multiplying these warped foreground likelihood maps:
where θref is the projectively transformed grid of object occupancy likelihoods. Notably, a more elaborate fusion model can be used at the expense of simplicity. For instance, a sensor fusion strategy that explicitly models pixel visibility, sensor reliability, or scene radiance can be transparently incorporated without affecting the underlying approach of fusing at slices in the image plane rather than in three-dimensional space.
Each value in θref identifies the likelihood of a grid location being inside the body of the object, indeed representing a slice of the object cut out by plane π. It should be noted that the choice of reference view is irrelevant, as the slices obtained on all image planes and the scene plane π are projectively equivalent. This computation can be performed at an arbitrary number of planes in the scene, each giving a new slice of the object. Naturally, this does not apply to planes that do not pass through the object's body, since visual hull intersection on these planes will be empty.
Starting with a reference plane in the scene (typically the ground plane), visual hull intersection can be performed on successively parallel planes in the up direction along the body of the object. The probabilistic occupancy grids θis obtained in this fashion can be thresholded to obtain object slices, but this creates the problem of finding the optimum threshold at each slice level. Moreover, the slices have a strong dependency on each other as they are parts of the same object(s), and should as such be treated as a whole. This dependency can be modeled by stacking the slices, creating a three dimensional data structure ⊕=[θ1; θ2; . . . θn]. ⊕ is not an entity in the three-dimensional world or a collection of voxels. It is, simply put, a logical arrangement of planar slices representing discrete samplings of the continuous occupancy space. Object structure is then segmented out from ⊕, i.e., simultaneously segmented out from all the slices as a smooth surface that divides the space into the object and background. Discussed in the following paragraphs is an image-based approach using the homography of a reference plane in a scene to compute homographies induced between views by planes parallel to the reference plane.
Consider a coordinate system XYZ in space. Let the origin of the coordinate frame lie on the reference plane, with the X and Y axes spanning the plane. The Z axis is the reference direction, which is thus any direction not parallel to the plane. The image coordinate system is the usual xy affine image frame, and a point X in space is projected to the image point x via a 3×4 projection matrix M as:
x=MX=[m1m2m3m4]X, [Equation 2]
where x and X are homogenous vectors in the form: x=(x,y,w)T, X=(X,Y,Z,W)T, and “=” means equality up to scale. The projection matrix M can be parameterized as:
M=[vXvYvZÎ], [Equation 3]
where vX, vY, and vZ are the vanishing points for X, Y, and Z directions respectively and Î is the vanishing line of the reference plane normalized.
Suppose the world coordinate system is translated from the plane π onto the plane π′ along the reference direction (Z) by z units as shown in
M′=[vXvYvZαzvz+Î], [Equation 4]
where α is a scale factor. Columns 1, 2, and 4 of the projection matrices are the three columns of the respective plane to image homographies. Therefore, the plane-to-image homographies can be extracted from the projection matrices, ignoring the third column, to give:
Hπ=[vXvYÎ],H′π=[vXvYαzvZ+Î]. [Equation 5]
In general:
HY=Href+[0|γvref], [Equation 6]
where Href is the homography of the reference plane, γ is a scalar multiple encapsulating α and z, [0] is a 3×2 matrix of zeros, and vref is the vanishing point of the reference direction. Using this result it can be shown that if the homography Hi
As described above, slices computed along the body of the object can be stacked to create a three dimensional data structure ⊕ that encapsulates the object structure. To segment out the object, a parameterized surface S(q):[0,1]R3→can be evolved that divides ⊕ between the object and the background. This is achieved by formulating the problem in a variational framework, where the solution is a minimizer of a global cost functional that combines a smoothness prior on slice contours and a data fitness score. The function can be defined as:
where ∇⊕ denotes gradient of ⊕, and g denotes a strictly decreasing function: g(x)=1/(1+x2). The first term at the right side of Equation 8 represents external energy. Its role is to attract the surface towards the object boundary in ⊕. The second term, called the internal energy, computes the area of the surface. Given the same volume, a smoother surface will have a smaller area. Therefore, this term controls the smoothness of the surface to be determined. When the overall energy is minimized, the object boundary will be approached by a smooth surface.
Minimizing Equation 8 is equivalent to computing geodesic in a Riemannian space:
With the Euler-Lagrange equation deduced, this objective function can be minimized by using the gradient descent method by an iteration time t as
{right arrow over (S)}t=g(|∇⊕(S)|)κ{right arrow over (N)}−(∇g(|∇⊕(S)|))·{right arrow over (N)})·{right arrow over (N)}, [Equation 10]
where κ is the surface curvature, and {right arrow over (N)} is the unit normal vector of the surface. Since the objects to be reconstructed may have arbitrary shape and/or topology, the segmentation is implemented using the level set framework. Level sets-based methods allow for topological changes to occur without any additional computational complexity, because an implicit representation of the evolving surface is used. The solution (Equation 10) can be readily cast into level set framework by embedding the surface S into a three-dimensional level set function ψ with the same size as ⊕, i.e., S={(x,y,z)|ψ(x,y,z)=0}. The signed distance transform is used to generate the level set function. This yields an equivalent level set update equation to the surface evolution process in Equation 10:
Starting with an initial estimate for S and iteratively updating the level set function using Equation 11 leads to a segmentation of the object.
Application of the Modeling Approach
Application of the above-described approach will now be discussed with reference to the flow diagram of
Once all the desired views have been obtained, the foreground silhouettes of the object in each view are identified, as indicated in block 32 of
Image subtraction typically cannot be used, however, in cases in which the images were captured by a single camera in a random flyby of an object given that it is difficult to obtain the same viewpoint of the scene without the object present. In such a situation, image alignment can be performed to identify the foreground silhouettes. Although consecutive views can be placed in registration with each other by aligning the images with respect to detectable features of the ground plane, such registration results in the image pixels that correspond to the object being misaligned due to plane parallax. This misalignment can be detected by performing a photo-consistency check, i.e., comparing the color values of two consecutive aligned views. Any pixel that has a mismatch from one view to the other (i.e., the color value difference is greater than a threshold) is marked as a pixel pertaining to the object. The alignment between such views can be determined, by finding the transformation, i.e., planar homography, between the views. In some embodiments, the homography can be determined between any two views by first identifying features of the ground plane using an appropriate algorithm or program, such as scale invariant feature transform (SIFT) algorithm or program. Once the features have been identified, the features can be matched across the views and the homographies can be determined in the manner described in the foregoing. By way of example, at least four features are identified to align any two views. In some embodiments, a suitable algorithm or program, such as a random sample consensus (RANSAC) algorithm or program, can be used to ensure that the identified features are in fact contained within the ground plane.
Once the foreground silhouettes have been identified in each view, visual hull intersection can be performed on multiple planes that cut through the object. Through that process, multiple slices of the object can be estimated, and those slices can then be used to compute a surface that approximates the outer surface of the object. As described above, the visual hull intersection begins with a reference plane. For purposes of this discussion, it is assumed that the reference plane is the ground plane.
With reference to block 34 of
After each of the views, and their silhouettes, has been transformed (i.e., warped to the reference view using the planar homography), the warped silhouettes of each view are fused together to obtain a cross-sectional slice of a visual hull of the object that lies in the reference plane, as indicated in block 38. That is, a first slice of the object (i.e., a portion of the object that is occluded from view) that is present at the ground plane is estimated.
As described above, this process can be replicated to obtain further slices of the object that lie in planes parallel to the reference plane. Given that those other planes are imaginary, and therefore comprise no identifiable features, the transformation used to obtain the first slice cannot be performed to obtain the other slices. However, because the homographies induced by the reference plane and the location of the vanishing point in the up direction are known, the homographies induced by any plane parallel to the reference plane can be estimated. Therefore, each of the views can be warped to the reference view relative to new planes, and the warped silhouettes that result can be fused together to estimate further cross-sectional slices of the visual hull, as indicated in block 40.
As described above, the homographies can be estimated using Equation 7. In that equation, γ is a scalar multiple that specifies the locations of other planes along the up direction. Notably, the value for γ can be selected by determining the range for γ that spans the object. This is achieved by incrementing γ in Equation 7 until a point is reached at which there is no shadow overlap, indicating that the current plane is above the top of the object. Once the range has been determined, the value for γ at that point can be divided by the total number of planes that are desired to determine the appropriate value of γ to use. For example, if γ is 10 at the top of the object and 100 planes are desired, γ can be set to 0.1 to obtain the homographies induced by the various planes.
At this point in the process, multiple slices of the object have been estimated.
Once the slices have been estimated, their precise boundaries are still unknown and, therefore, the precise boundaries of the object are likewise unknown. One way in which the boundaries of the slices could be determined is to establish thresholds for each of the slices to separate image data considered part of the object from image data considered part of the background. In the current embodiment, however, the various slices are first stacked on top of each other along the up direction, as indicated in block 42 of
As described above, the surface can be computed by minimizing Equation 8, which comprises a first term that identifies portions of the data that have high gradient (thereby identifying the boundary of the object) and the second term identifies the surface area of the object surface. By minimizing both terms, the surface is optimized as a surface that moves toward the object boundary and has as small a surface area as possible. In other words, the surface is optimized to be the tightest surface that divides the three-dimensional surface of the object from the background.
After the object surface has been computed, the three-dimensional locations of points on the surface are known and, as indicated in block 46, the surface can be rendered using a graphics engine.
At this point, a three-dimensional model of the object has been produced, which can be used for various purposes, including object localization, object recognition, and motion capture. It can then be determined whether the colors of the object are desired, as indicated in decision block 48 of
Example System
The processing device 108 can comprise a central processing unit (CPU) that controls the overall operation of the computer system 106 and one or more graphics processor units (GPUs) for graphics rendering. The memory 110 includes any one of or a combination of volatile memory elements (e.g., RAM) and nonvolatile memory elements (e.g., hard disk, ROM, etc.) that store code that can be executed by the processing device 108.
The user interface 112 comprises the components with which a user interacts with the computer system 106. The user interface 112 can comprise conventional computer interface devices, such as a keyboard, a mouse, and a computer monitor. The one or more I/O devices 114 are adapted to facilitate communications with other devices and may include one or more communication components such as a modulator/demodulator (e.g., modem), wireless (e.g., radio frequency (RF)) transceiver, network card, etc.
The memory 110 (i.e., a computer-readable medium) comprises various programs (i.e., logic) including an operating system 118 and three-dimensional modeling system 120. The operating system 118 controls the execution of other programs and provides scheduling, input-output control, file and data management, memory management, and communication control and related services. The three-dimensional modeling system 120 comprises one or more algorithms and/or programs that are used to model a three-dimensional object from two-dimensional views in the manner described in the foregoing. Furthermore, memory 110 comprises a graphics rendering program 122 used to render surfaces computed using the three-dimensional modeling system 120.
Various code (i.e., logic) has been described in this disclosure. Such code can be stored on any computer-readable medium for use by or in connection with any computer-related system or method. In the context of this document, a “computer-readable medium” is an electronic, magnetic, optical, or other physical device or means that contains or stores code, such as a computer program, for use by or in connection with a computer-related system or method. The code can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
This application claims priority to U.S. provisional application entitled “A Homographic Framework for the Fusion of Multi-View Silhouettes” having Ser. No. 61/026,561, filed Feb. 6, 2008, which is entirely incorporated herein by reference.
This invention was made with Government support under Contract/Grant No.: NBCHCOB0105, awarded by the U.S. Government VACE program. The Government has certain rights in the claimed inventions.
Number | Name | Date | Kind |
---|---|---|---|
6996256 | Pavlidis | Feb 2006 | B2 |
6996265 | Patnaik | Feb 2006 | B1 |
7558762 | Owechko et al. | Jul 2009 | B2 |
20080211809 | Kim et al. | Sep 2008 | A1 |
Entry |
---|
Khan et al., “A Homographic Framework for the Fusion of Multi-view Silhouettes” IEEE ICCV 2007. |
Grauman, et al., “A Bayesian Approach to Image-Based Visual Hull Reconstruction.” Mass. Institute of Technology, Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1-8. |
Kutulakos, et al. “A theory of Shape by Space Carving.” International Journal of Computer Vision 38(3), 2000. Pages ri 199-218. |
Franco, et al. “Fusion of Multi-View Silhouette Cues Using a Space Occupancy Grid.”Proceedings of the Tenth IEEE International Conference on Computer Vision, 2005. |
Aldo Laurentini, “The Visual Hull Concept for Silhouette-Based Image Understanding.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, No. 2, Feb. 1994, pp. 150-162. |
Broadhurst, et al. “A Probabilistic Framework for Space Carving.” University of Cambridge, Dept. of Engineering, 2001, pp. 388-393. |
Criminisi, et al. “Single View Metrology.” University of Oxford, Dept. of Engineering Science, International Journal of Computer Vision, Nov. 2000. |
Number | Date | Country | |
---|---|---|---|
20090304265 A1 | Dec 2009 | US |
Number | Date | Country | |
---|---|---|---|
61026561 | Feb 2008 | US |