Stereo and three-dimensional (3D) reconstructions are used by many applications such as object modeling, facial expression studies, and human motion analysis. Typically, multiple high frame rate cameras are used to obtain stereo images. Special hardware and/or sophisticated software is generally used, however, to synchronize such multiple high frame rate cameras.
The present invention is embodied in methods, system, and apparatus for generating depth maps, 3D structures and volumetric reconstructions. In accordance with one embodiment, a depth map is generated by obtaining a transformation for a camera having a still image capture mode and a video mode (the transformation providing image translation and scaling between the still image transfer mode and the video mode), capturing at least one multi-view still image with the camera, capturing multi-view video with the camera, estimating relative depth values through stereo matching of the still images, and generating a resolved video depth map from the transformation, the at least one multi-view still image, and the multi-view video images. The multi-view still image may be a stereo still image and the multi-view video images may be stereo video. Multiple 3D structures from multiple prism camera apparatus may be combined to generate a volumetric reconstruction (3D image scene).
An embodiment of an apparatus for generating a depth map includes a camera having a lens (the camera having a still capture mode and a video capture mode), a prism positioned in front of the lens having a first surface, a second surface, and a third surface, the first surface facing the lens, a first mirror positioned proximate to the second surface of the prism, and a second mirror positioned proximate to the third surface of the prism. The apparatus may include a processor configured to generate a resolved video depth map from a transformation for the camera, at least one multi-view still image from the camera, and multi-view video from the camera. Two or more apparatus may be combined to form a system for generating a volumetric reconstruction.
The invention is best understood from the following detailed description when read in connection with the accompanying drawings, with like elements having the same reference numerals. This emphasizes that according to common practice, the various features of the drawings are not drawn to scale. On the contrary, the dimensions of the various features are arbitrarily expanded or reduced for clarity. Included in the drawings are the following figures:
Simultaneously, light from the scene being imaged impinges on the second mirror 112b. The second mirror 112b reflects the light toward the third surface 114c of prism 110. The light passes through the third surface 114c and is reflected within the prism 110 by the second surface 114b. The reflected light passes through the first surface 114a toward lens 106, which focuses the light on a second portion 116b of an imaging device (e.g., a charge coupled device (CCD) within camera 102).
As depicted in
In deriving the above, it is assumed that there is no inversion of the image from any of the reflections. This assumption may be violated at large fields of view. More specifically, φ<60° in the exemplary setup. Since no other lenses apart from the camera lens are used, the field of view in resulting virtual cameras should be half of the real camera.
In
In an exemplary setup, the parameters used were a focal length of 35 mm corresponding to φ=17°, β=49.3°, m=76.2 mm, and x=25.4 mm. Varying the mirror angles provides control over the effective baseline as well as the vergence of the stereo imaging system.
Conventional multi-camera systems use single-view cameras rather than stereo cameras due to issues associated with synchronization and re-calibration whenever vergence, zoom, etc. of stereo cameras are changed. Using prism cameras 100 in accordance with the present invention avoids these issues because only a rigid transformation (three dimensional translation and rotation) corresponding to each prism camera 100 is needed for the processor 402 to combine images/frames from multiple cameras, which can be performed using conventional processors. One of skill in the art would understand how to combine images using conventional procedures from the description herein. A rigid transformation may be used to map points in one 3D coordinate system to another such that the distance between the points do not change and the angles between any two straight lines is preserved. An exemplary rigid transformation consists of two parts: a 3×3 rotation matrix R and a 3×1 translation vector T. The mapping (x′,y′,z′) of a point (x,y,z) may be obtained by the following equation:
For a pair of prism cameras, these transformations can be obtained by capturing images of scene with both the cameras; estimating 3D structures from both the prism cameras independently; obtaining correspondences between images from the cameras; and obtaining the matrix R and the vector T that provide the optimal mapping between the corresponding points.
An optimal estimate of the transformation is obtained using a least squares process. For a given set of points (x1,y1,z1), . . . (xn,yn,zn) with correspondences, the transformation is estimated by solving the following least squares problem:
An illustration of the alignment process is shown in
In an exemplary embodiment, an initial step (not shown) is performed to estimate a homography (H) transformation between low resolution (LR) video frames and high resolution (HR) still images using a known pattern. The transformation accounts for the camera using different portions of the imaging device (CCD array) for still image capture and for video capture, e.g., due to different aspect ratios. In an exemplary embodiment, the H transformation may need to be performed only once for a prism camera 100 because the translation and scale differences between the LR video and the HR still images of a camera is typically fixed once the camera zoom and the prism 110 and mirrors 112 are set. The H transformation may be determined whenever the setup, e.g., zoom or prism/mirrors configuration change. The prism camera 100 captures multi-view (e.g., stereo) low resolution (LR) video and periodically captures high resolution (HR) still images. A HR image is selected for each LR video image that is closest in time to the captured time of the LR video image at block 504. At block 506, each stereo pair is rectified. A disparity map 508 is then obtained using stereo matching. The transformation H is then applied to the disparity map at block 511 to transform the disparity map 508 to the HR image size.
In an exemplary embodiment, the prism camera is configured to capture the images substantially simultaneously, e.g., one still image for every 30 frames of video. The capability to capture both still and video may be required for super-resolution. Certain commercial DSLRs (such as the Canon T1i DSLR) have the capability to capture both still frames and video. In such commercial DSLRs, video is taken continuously and the rate at which still images are captured is adjustable. Other commercial cameras can provide the above capability through same/different means (wireless remote, wired trigger or manual etc). Such capabilities are usually provided by the camera and require the processor to capture both still frames and video in a specific mode. The processor by itself does not perform any specialized task for the above and the triggering process would be same.
At block 510, motion and warping between the selected HR still image and the disparity map 508 are estimated. In an exemplary embodiment, assuming rigid objects in the scene exist, per-object motion between the LR images and the selected HR image are estimated and a scale-invariant feature transform (SIFT) is applied at block 510. The motion compensated HR frame and transformed depth map are then used to up-sample the disparity map at block 512 in a known manner to create the resolved depth map 502.
I
L(x+d,y)=IR(x,y).
The disparity may be estimated at each pixel using a method such as a combination of known local and global image matching methods. Suitable methods will be understood by one of skill in the art from the description herein. Such methods are disclosed in the following articles: Rohith M V et al., Learning image structures for optimizing disparity estimation, ACCV'10 Tenth Asian Conference on Computer Vision 2010, 2010; Rohith M V et al., Modified region growing for stereo of slant and textureless surfaces, ISVC2010—6th International Symposium on Visual Computing, 2010; Rohith M V et al., Stereo analysis of low textured regions with application towards sea-ice reconstruction, IPCV'09—The 2009 International Conference on Image Processing, Computer Vision, and Pattern Recognition, 2009; and, Rohith M V et al., Towards estimation of dense disparities from stereo images containing large textureless regions, ICPR 08: Proceedings of the 19th International Conference on Pattern Recognition, 2008.
The method optionally consists of matching each pixel in the right image with a corresponding pixel in the left image under the constraint that the correspondences are smooth. The problem may be posed as a global energy minimization problem where each disparity assignment to each pixel has a cost associated with it. The cost consists of error in matching |IL(x+d,y)−IR(x,y)| and gradient of disparity ∇d. The disparity map is an assignment that minimizes the following energy function
This energy minimization problem can be solved using known techniques such as graph cuts, gradient descent or region growing techniques. Suitable methods will be understood by one of skill in the art from the description herein. Such methods are described in the above-identified articles. The contents of those article are incorporated by reference herein in their entirety.
The 3D structure is obtained at block 618 from the disparity estimate at block 612 through triangulation at block 614 using the stereo parameters at block 616. At block 614, the process of triangulation consists of projecting two rays for each pair of corresponding pixels in the right and left image. The rays originate at the camera center (focal point of all the rays belonging to the camera) and pass through the chosen pixel. The position in space where the two rays are closest to each other provides an estimate from the scene point they originated. This process is repeated for all pixels in the image to obtain the 3D structure of the scene being imaged. For this, an estimate of stereo parameters are needed.
At block 616, the stereo parameters are estimated. Stereo parameters comprise intrinsic camera parameters including focal lengths, image centers, distortion and also extrinsic parameters comprising baseline and vergence. For each prism camera, the stereo parameters are estimated by capturing calibration images (images of planar objects with a checkerboard pattern placed in varying orientations and positions); detecting corresponding points in the calibration images; and estimating stereo parameters such that the calibration object is reconstructed as a planar object satisfying the constraints of correspondences derived from the calibration images. Suitable computer programs for estimating stereo parameters will be understood by one of skill in the art from the description herein. An exemplary computer is program for estimating stereo parameters available at http://www.robotic.dir.de/callab/.
The estimated stereo parameters are input to the previously-described triangulation process at block 614. At block 618, the 3D structure is recovered following the triangulation step at block 614. The stereo parameters need only be estimated when the physical setup (i.e., placement of mirrors, prism, zoom of lens) of a prism camera changes.
Although the invention is illustrated and described herein with reference to specific embodiments, the invention is not intended to be limited to the details shown. Rather, various modifications may be made in the details within the scope and range of equivalents of the claims and without departing from the invention. For example, although a stereo view imaging system is depicted, it is contemplated that multi-view images comprised of more than two images may be generated and utilized.
This application claims priority to U.S. Provisional Patent Application No. 61/417,570, filed Nov. 29, 2010, the contents of which are incorporated by reference herein in their entirety.
This invention was made with government support under contract number ANT0636726 awarded by the National Science Foundation. The government may have rights in this invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US11/62314 | 11/29/2011 | WO | 00 | 9/30/2013 |
Number | Date | Country | |
---|---|---|---|
61417570 | Nov 2010 | US |