Traditionally, visual hull based approaches have been used to model three-dimensional objects. In such approaches, object silhouettes are obtained from multiple time-synchronized cameras or, if a single camera is used for a fly-by (or a turn table setup), the scene is assumed to be static. Those constraints generally limit the applicability of visual hull based approaches to controlled laboratory conditions. In real-life situations, a sophisticated multiple camera setup may not be practical. If a single camera is used to capture multiple views by going around the object, it is not reasonable to assume that the object will remain static over the course of time it takes to obtain the views of the object, especially if the object is a person, animal, or vehicle on the move. Although there has been some work on using visual hull reconstruction in monocular video sequences of rigidly moving objects to recover shape and motion, these methods involve the estimation of 6 degrees of freedom (DOF) rigid motion of the object between successive frames. To handle non-rigid motion, the use of multiple cameras becomes indispensable.
From the above, it can be appreciated that it would be desirable to have alternative systems and methods for three-dimensionally modeling moving objects.
The present disclosure may be better understood with reference to the following figures. Matching reference numerals designate corresponding parts throughout the figures, which are not necessarily drawn to scale.
Disclosed herein are systems and methods for three-dimensionally modeling, or reconstructing, moving objects, whether the objects are rigidly moving (i.e., the entire object is moving as a whole), non-rigidly moving (i.e., one or more discrete parts of the object are articulating or deforming), or both. The objects are modeled using the concept of motion-blurred scene occupancies, which is a direct analogy of motion-blurred two-dimensional images but in a three-dimensional scene occupancy space. Similar to a motion-blurred photograph resulting from the movement of a scene object or the camera capturing the photograph and the camera sensor accumulating scene information over the exposure time, three-dimensional scene occupancies are mixed with non-occupancies when there is motion, resulting in a motion-blurred occupancy space.
In some embodiments, an image-based fusion step that combines color and silhouette information from multiple views is used to identify temporal occupancy points (TOPs), which are the estimated three-dimensional scene locations of silhouette pixels and contain information about the duration of time the pixels were occupied. Instead of explicitly computing the TOPs in three-dimensional space, the projected locations of the TOPs are identified in each view to account for monocular video and arbitrary camera motion in scenarios where complete camera calibration information may not be available. The result is a set of blurred scene occupancy images in the corresponding views, where the values at each pixel correspond to the fraction of total time duration that the pixel observed an occupied scene location and where greater blur (lesser occupancy value) is interpreted as greater mixing of occupancy with non-occupancy in the total time duration. Motion deblurring is then used to deblur the occupancy images. The deblurred occupancy images correspond to silhouettes of the mean/motion compensated object shape and can be used to obtain a visual hull reconstruction of the object.
Silhouette information has been used in the past to estimate occupancy grids for the purpose of object detection and reconstruction. Due to the inherent nature of visual hull based approaches, if the silhouettes correspond to a non-stationary object obtained at different time steps (e.g., monocular video), grid locations that are not occupied consistently will be carved out. As a result, the reconstructed object will only have an internal body core (consistently occupied scene locations) survive the visual hull intersection. An initial task is therefore to identify occupancy grid locations that are occupied by the scene object and to determine the durations that the grid locations are occupied. In essence, scene locations giving rise to the silhouettes in each view are to be estimated.
Obtaining Scene Occupancies
Let {It,St} be the set of color and corresponding foreground silhouette information generated by a stationary object O in T views obtained at times t=1 . . . , T in a monocular video sequence (e.g., a camera flying around the object).
If, however, object O is non-stationary, as depicted in
In the availability of scene calibration information, ξij and τij can be obtained by successively projecting rij in the image planes and retaining the section that projects to within the maximum number of silhouette images. To refine the localization of the three-dimensional scene point Pij (corresponding to the silhouette pixel pij) along ξij, another construct called the temporal occupancy point (TOP) is used. The temporal occupancy point is obtained by enforcing an appearance/color constancy constraint as described in the next section.
If the views of the object are captured at a rate faster than its motion, then without loss of generality, a non-stationary object O can be considered to be piecewise stationary: O={O1:s
The above-described process is demonstrated on an actual moving object 10 in
Because monocular video sequences are used, it may not be the case that there is complete camera calibration at each time instant, particularly if the camera motion is arbitrary. For that reason, a purely image-based approach is used. Instead of determining each silhouette's corresponding temporary occupancy point explicitly in three-dimensional space, the projections (images) of the temporary occupancy point is obtained for each view. If the object was stationary and the scene point was visible in every view, then a simple stereo-based search algorithm could be used. Given the fundamental matrices between views, the ray through a pixel in one view can be directly imaged in other views using the epipolar constraint. The images of the temporary occupancy point can then be obtained by searching along the epipolar lines (in the object silhouette regions) for a correspondence across views that has minimum color variance. However, when the object is not stationary and the scene point is therefore not guaranteed to be visible from every view, a stereo-based approach is not viable. It is therefore proposed that homographies induced between the views by a pencil of planes for a point-to-point transformation be used instead.
With reference to
The parameter γ determines how far up from the reference plane the new plane is. The projection of the temporal bounding edge ξij in the image planes can be obtained by warping pij with homographies of successively higher planes (by incrementing the value of γ) and selecting the range of γ for which pij warps to within the largest number of silhouette images. The image of pij's temporary occupancy point in all the other views is then obtained by finding the value of γ in the previously determined range, for which pij and its homographically warped locations have minimum color variance in the visible images. The upper bound on occupancy duration τij is evaluated as the ratio of the number of views where ξij projects to within silhouette boundaries and the total number of views. This value is stored for each imaged location of pij's temporary occupancy point in every other view.
Building Blurred Occupancy Images
As described above, the image location of a silhouettes pixel's temporal occupancy point can be obtained in every other view. The boundary of the object silhouette in each view can be uniformly sampled and their temporary occupancy points can be projected in all the views. The accumulation of the projected temporary occupancy points delivers a corresponding set of images referred to herein as blurred occupancy images: Bt; t=1, . . . , T. Example blurred occupancy images are shown in
Motion Deblurring
The motion blur in the blurred occupancy images can be modeled as the convolution of a blur kernel with the latent occupancy image plus noise:
B=L
K+n, [Equation 2]
where B is the blurred occupancy image, L is the latent or unblurred occupancy image, K is the blur kernel also known as the point spread function (PSF), and n is additive noise. Conventional blind deconvolution approaches focus on the estimate of K to deconvolve B using image intensities or gradients. In traditional images, there is the additional complexity that may be induced by the background, which may not undergo the same motion as the object. The PSF has a uniform definition only on the moving object. This however is not a factor for the present case since the information in the blurred occupancy images corresponds only to the motion of the object. Therefore, the foreground object can be segmented as a blurred transparency layer and the transparency information can be used in a MAP (maximum a-priori) framework to obtain the blur kernel. By avoiding taking all pixel colors and complex image structures into computation, this approach has the advantage of simplicity and robustness but requires the estimation of the object transparency or alpha matte. The object occupancy information in the blurred occupancy maps, once normalized in the [0-1] range, can be directly interpreted as the transparency information or an alpha matte of the foreground object.
The blur filter estimation maximizes the likelihood that the resulting image, when convolved with the resulting PSF, is an instance of the blurred image, assuming Poisson noise statistics. The process deblurs the image and refines the PSF simultaneously, using an iterative process similar to the accelerated, damped Lucy-Richardson algorithm. An initial guess of the PSF can be simple translational motion. That is then fed into the blind deconvolution approach that iteratively restores the blurred image and refines the PSF to deliver deblurred occupancy maps Lt; t=1, . . . , T, which are used in the final reconstruction.
It should be noted that the above-described deblurring approach assumes uniform motion blur. However, that may not always be the case in natural scenes. For instance, due to the difference in motion between the arms and the legs of a walking person, the blur patterns in occupancies may be different and hence different blur kernels may be needed to be estimated for each section. Because of the challenges that involves, a user may instead specify different crop regions of the blurred occupancy images, each with uniform motion, that can be restored separately.
Final Reconstruction
Once motion deblurred occupancy maps have been generated, the final step is to perform a probabilistic visual hull intersection. Existing approaches can be used for that purpose. In some embodiments, the approach described in related U.S. patent application Ser. No. 12/366,241 (“the Khan approach”) is used to perform the visual hull intersection given that it handles arbitrary camera motion without requiring full calibration. In the Khan approach, the three-dimensional structure of objects is modeled as being composed of an infinite number of cross-sectional slices, with the frequency of slice sampling being a variable determining the granularity of the reconstruction. Using planar homographies induced between views by a reference plane (e.g., ground plane) in the scene, occupancy maps LiS′ (foreground silhouette information) from all the available views are fused into an arbitrarily chosen reference view performing visual hull intersection in the image plane. This process delivers a two-dimensional grid of object occupancy likelihoods representing a cross-sectional slice of the object. Consider a reference plane π in the scene inducing homographies Hi
where θref is the projectively transformed grid of object occupancy likelihoods, or an object slice. Significantly, using this homographic framework, visual hull intersection is performed in the image plane without going into three-dimensional space.
Subsequent slices or θs of the object are obtained by extending the process to planes parallel to the reference plane in the normal direction. Homographies of those new planes can be obtained using the relationship in Equation 3. Occupancy grids/slices are stacked on top of each other, creating a three dimensional data structure: Θ=[θ1; θ2; . . . θn] that encapsulates the object shape. Θ is not an entity in the three-dimensional world or a collection of voxels. It is, simply put, a logical arrangement of planar slices representing discrete samplings of the continuous occupancy space. Object structure is then segmented out from Θ, i.e., simultaneously segmented out from all the slices, by evolving a smooth surface S: [0,1]→ using level sets that divides Θ between the object and the background.
Application of the above-described approach will now be discussed with reference to the flow diagram of
With reference back to
Image subtraction typically cannot be used, however, in cases in which the images were captured by a single camera in a random flyby of an object given that it is difficult to obtain the same viewpoint of the scene without the object present. In such a situation, image alignment can be performed to identify the foreground silhouettes. Although consecutive views can be placed in registration with each other by aligning the images with respect to detectable features of the ground plane, such registration results in the image pixels that correspond to the object being misaligned due to plane parallax. This misalignment can be detected by performing a photo-consistency check, i.e., comparing the color values of two consecutive aligned views. Any pixel that has a mismatch from one view to the other (i.e., the color value difference is greater than a threshold) is marked as a pixel pertaining to the object.
The alignment between such views can be determined, by finding the transformation, i.e., planar homography, between the views. In some embodiments, the homography can be determined between any two views by first identifying features of the ground plane using an appropriate algorithm or program, such as scale-invariant feature transform (SIFT) algorithm or program. Once the features have been identified, the features can be matched across the views and the homographies can be determined in the manner described above. By way of example, at least four features are identified to align any two views. In some embodiments, a suitable algorithm or program, such as a random sample consensus (RANSAC) algorithm or program, can be used to ensure that the identified features are in fact contained within the ground plane.
Once the silhouettes of the object have been identified, the boundary (i.e., edge) of each silhouette is uniformly sampled to identify a plurality of silhouette boundary pixels (p), as indicated in block 24. The number of boundary pixels that are sampled for each silhouette can be selected relative to the results that are desired and the amount of computation that will be required. Generally speaking, however, the greater the number of silhouette boundary pixels that are sampled, the more accurate the reconstruction of the object will be. By the way of example, one may sample one pixel for every 8 pixel neighborhood.
Referring next to block 26, the temporal bounding edge (ξ) is determined for each silhouette boundary pixel of each view. As described above, the temporal bounding edge is the portion of a ray (that extends from an image point (p) to its associated three-dimensional scene point (P)) that is within the silhouette image of a maximum number of views. In some embodiments, the temporal bounding edge for each silhouette boundary pixel can be determined by transforming the pixel to each of the other views using multiple plane homographies as per Equation 1. In such a process, each pixel is warped with the homographies induced by a pencil of planes starting from the ground reference plane and moving to successively higher parallel plans (φ) by incrementing the value of γ. The range of γ for which the boundary pixel homographically warps to within the largest number of silhouette images is then selected, thereby delineating the temporal bounding edge of the silhouette boundary pixel.
Once the temporal bounding edge for each silhouette boundary pixel has been determined, the occupancy duration (τ) as to each silhouette boundary pixel can likewise be determined, as indicated in block 28. As described above, the occupancy duration is the ratio of the number of views in which the temporal bounding edge projects to within silhouette boundaries and the total number of views.
Next, with reference to block 30, the location of the temporal occupancy point in each view is determined for each silhouette boundary pixel. As described above, the temporal occupancy point is the point along the temporal bounding edge that most closely estimates the localization of the three-dimensional scene point that gave rise to the silhouette boundary pixel. In some embodiments, the temporal occupancy point is determined by finding the value of γ in the previously-determined range of γ for which the silhouette boundary pixel and its graphically warped locations have minimum color variance in the visible images. As mentioned above, if the object is piecewise stationary, it can be assumed that the object is static and a photo-consistency check can be performed to identify the temporal occupancy point. Once the temporal occupancy points have been determined, the occupancy duration values at the temporal occupancy points in each view can then be stored, as indicated in block 32 of
Once the temporal occupancy point has been determined for each silhouette boundary pixel in each view, the temporal occupancy points can be used to generate a set of blurred occupancy images, as indicated in block 34. The set will comprise one blurred occupancy image for each view of the object.
Next, with reference to block 36, motion deblurring is performed on the blurred occupancy images to generate deblurred occupancy maps. In some embodiments, deblurring comprises segmenting the foreground object as a blurred transparency layer and using the transparency information in a MAP framework to obtain the blur kernel. In that process, an initial guess for the PSF is fed into a blind deconvolution approach that iteratively restores the blurred image and refines the PSF to deliver the deblurred occupancy maps.
Once the deblurred occupancy maps have been obtain, visual hull intersection can be performed to generate the object model or reconstruction. For the present embodiment, it is assumed that visual hull intersection is performed using the procedure described in related U.S. patent application Ser. No. 12/366,241 in which multiple slices of the object are estimated, and the slices are used to compute a surface that approximates the outer surface of the object.
With reference to block 38, one of the deblurred occupancy maps is designated as the reference view. Next, each of the other maps is warped to the reference view relative to the reference plane (e.g., ground plane), as indicated in block 40. That is, the various maps are transformed by obtaining the planar homography between each map and the reference view that is induced by the reference plane. Notably, those homographies can be obtained by determining the homographies between consecutive maps and concatenating each of those homographies to produce the homography between each of the maps and the reference view. Such a process may be considered preferable given that it may reduce error that could otherwise occur when homographies are determined between maps that are spaced far apart from each other.
After each of the maps, and their silhouettes, has been transformed (i.e., warped to the reference view using the planar homography), the warped silhouettes of each map are fused together to obtain a cross-sectional slice of a visual hull of the object that lies in the reference plane, as indicated in block 42. That is, a first slice of the object (i.e., a portion of the object that is occluded from view) that is present at the ground plane is estimated.
The above process can be replicated to obtain further slices of the object that lie in planes parallel to the reference plane. Given that those other planes are imaginary, and therefore comprise no identifiable features, the transformation used to obtain the first slice cannot be performed to obtain the other slices. However, because the homographies induced by the reference plane and the location of the vanishing point in the up direction are known, the homographies induced by any plane parallel to the reference plane can be estimated. Therefore, each of the views can be warped to the reference view relative to new planes, and the warped silhouettes that result can be fused together to estimate further cross-sectional slices of the visual hull, as indicated in block 44 of
As described above, the homographies can be estimated using Equation 1 in which γ is a scalar multiple that specifies the locations of other planes along the up direction. Notably, the value for γ can be selected by determining the range for γ that spans the object. This is achieved by incrementing γ in Equation 1 until a point is reached at which there is no shadow overlap, indicating that the current plane is above the top of the object. Once the range has been determined, the value for γ at that point can be divided by the total number of planes that are desired to determine the appropriate value of γ to use. For example, if γ is 10 at the top of the object and 100 planes are desired, γ can be set to 0.1 to obtain the homographies induced by the various planes.
At this point in the process, multiple slices of the object have been estimated.
Once the slices have been estimated, their precise boundaries are still unknown and, therefore, the precise boundaries of the object are likewise unknown. One way in which the boundaries of the slices could be determined is to establish thresholds for each of the slices to separate image data considered part of the object from image data considered part of the background. In the current embodiment, however, the various slices are first stacked on top of each other along the up direction, as indicated in block 46 of
As described in related U.S. patent application Ser. No. 12/366,241, the surface can be computed by minimizing an energy function that comprises a first term that identifies portions of the data that have high gradient (thereby identifying the boundary of the object) and the second term identifies the surface area of the object surface. By minimizing both terms, the surface is optimized as a surface that moves toward the object boundary and has as small a surface area as possible. In other words, the surface is optimized to be the tightest surface that divides the three-dimensional surface of the object from the background.
After the object surface has been computed, the three-dimensional locations of points on the surface are known and, as indicated in block 50, the surface can be rendered using a graphics engine.
At this point, a three-dimensional model of the object has been produced, which can be used for various purposes, including object localization, object recognition, and motion capture. It can then be determined whether the colors of the object are desired, as indicated in decision block 52 of
To quantitatively analyze the above-described process, an experiment was conducted in which several monocular sequences of an object were obtained. In each flyby of the camera, the object was kept stationary but the posture (arm position) of the object was incrementally changed between flybys. Because the object was kept stationary, the sequences are referred to herein as rigid sequences. Each rigid sequence consisted of 14 views of the object with a different arm position at a resolution of 480×720 with the object occupying a region of approximately 150×150 pixels.
A monocular sequence of a non-rigidly deforming object was assembled by selecting two views from each rigid sequence in order, thereby creating a set of fourteen views of the object as it changes posture. Reconstruction on this assembled non-rigid, monocular sequence was performed using the occupancy deblurring approach described above and the visualization of the results is shown in
where ν is a voxel in the voxel space , Otest is the three-dimensional reconstruction that needs to be compared with, Qrigi the visual hull reconstruction from ith rigid sequence. Si is the similarity score, i.e. the square of the fraction of non-overlapping to overlapping voxels that are a part of the reconstructions, wherein the closer Si is to zero greater the similarity. Shown in
The processing device 108 can comprise a central processing unit (CPU) that controls the overall operation of the computer system 106 and one or more graphics processor units (GPUs) for graphics rendering. The memory 110 includes any one of or a combination of volatile memory elements (e.g., RAM) and nonvolatile memory elements (e.g., hard disk, ROM, etc.) that store code that can be executed by the processing device 108.
The user interface 112 comprises the components with which a user interacts with the computer system 106. The user interface 112 can comprise conventional computer interface devices, such as a keyboard, a mouse, and a computer monitor. The one or more I/O devices 114 are adapted to facilitate communications with other devices and may include one or more communication components such as a modulator/demodulator (e.g., modem), wireless (e.g., radio frequency (RF)) transceiver, network card, etc.
The memory 110 (i.e., a computer-readable medium) comprises various programs (i.e., logic) including an operating system 118 and three-dimensional modeling system 120. The operating system 118 controls the execution of other programs and provides scheduling, input-output control, file and data management, memory management, and communication control and related services. The three-dimensional modeling system 120 comprises one or more algorithms and/or programs that are used to model a three-dimensional moving object from two-dimensional views in the manner described in the foregoing. Furthermore, memory 110 comprises a graphics rendering program 122 used to render surfaces computed using the three-dimensional modeling system 120.
Various code (i.e., logic) has been described in this disclosure. Such code can be stored on any computer-readable medium for use by or in connection with any computer-related system or method. In the context of this document, a “computer-readable medium” is an electronic, magnetic, optical, or other physical device or means that contains or stores code, such as a computer program, for use by or in connection with a computer-related system or method. The code can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
This application claims priority to co-pending U.S. non-provisional application entitled “Systems and Methods for Modeling Three-Dimensional Objects from Two-Dimensional Images” and having Ser. No. 12/366,241, filed Feb. 5, 2009, which is entirely incorporated herein by reference.
The disclosed inventions were made with Government support under Contract/Grant No.: NBCHCOB0105, awarded by the U.S. Government VACE program. The Government has certain rights in the claimed inventions.