None.
A plethora of three dimensional capable mobile devices are available. In many cases, the mobile devices may be used to obtain a pair of images using a pair of spaced apart imaging devices, and based upon the pair of images create a three dimensional view of the scene. In some cases, the three dimensional view of the scene is shown on a two dimensional screen of the mobile device or otherwise shown on a three dimensional screen of the mobile device.
For some applications, an augmented reality application incorporates synthetic objects in the display together with the sensed three dimensional image. For example, the augmented reality application may include a synthetic ball that appears to be supported by a table in the sensed scene. For example, the application may include a synthetic picture frame that appears to be hanging on the wall of the sensed scene. While the inclusion of synthetic objects in a sensed scene is beneficial to the viewer, the application tends to have difficulty properly positioning and orientating the synthetic objects in the scene.
The foregoing and other objectives, features, and advantages of the invention will be more readily understood upon consideration of the following detailed description of the invention, taken in conjunction with the accompanying drawings.
Referring to
Referring to
Referring also to
is a projected two dimensional point, where
is an intrinsic matrix of the camera characteristics with fx and fy being focal lengths in pixels in the x and y direction, where px and py is the image center, where
is an extrinsic matrix of the relationship between the camera and the object being sensed with R being a rotation matrix and T being a translation matrix, and where
is a three dimensional point in a homogeneous coordinate system. Preferably, such characterizations are determined once, or otherwise provided once, for a camera and stored for subsequent use.
In addition, the camera calibration may characterize the distortion of the image which may be reduced by suitable calibration. Referring also to
x
u
=x
d+(xd−xc)(K1r2+K2r4+ . . . )
y
u
=y
d+(yd−yc)(K1r2+K2r4+ . . . )
where xu and yu are undistorted coordinates of a point, where xd and yd are corresponding points with distortion, where xc and yc are distortion centers, where Kn is a distortion coefficient for the n-th term, and where r represents the distance from (xd, yd) to (px, py).
The process of calibrating a camera may involve obtaining several images of one or more suitable patterns from different viewing angles and distances, then the corners or other features of the pattern may be extracted. For example, the extraction process may be performed by a feature detection process using sub-pixel accuracy. The extraction process may also estimate the three dimensional locations of the feature points by using the aforementioned projection model. The estimated locations may be optimized together with the intrinsic parameters by iterative gradient descent on Jacobian matrices so that re-projection errors are reduced. The Jacobian matrices may be partial derivatives of the image point coordinates with respect to intrinsic parameters and camera distortions.
Referring again to
A three dimensional triangulation process 530 is performed with the estimated two dimensional disparities and the relative rotation and translation estimated by the camera calibration process. The rotation matrices R1, R2, and translation vectors T1 and T2 are precomputed by the calibration process. The triangulation process estimates the three dimensional depth by least squares fitting to at least four equations from the projective transformation models and then generates the estimated three dimensional coordinate of a point. The estimated point minimizes the mean square re-projection error of the two dimensional pixel pair. In this manner, the offsets between the pixels in the different parts of the image result in three dimensional depth information of the sensed scene.
Referring again to
By way of example, the first step of the bundle adjustment may be to detect feature points in each input image frame. Then the bundle adjustment may use the matched feature points, together with the calibration parameters and initial estimations of the extrinsic parameters, to iteratively refine the extrinsic parameters so that the distance between the image points and the calculated projections are reduced. The bundle adjustment may be characterized as follows:
in which xij is a projection of a three dimensional point bi on view j, aj, and bi parameterize a camera and a three dimensional point respectively, Q(ai, bi) is a predicted projection of point bi on view j, vij is a binary visibility term where if the projected point on view j is visible it is set to 1 and otherwise 0, and d measures the Euclidean distance between an image point and the projected point.
A multi-view stereo plane sweeping process 610 may be used to locate corresponding points across different views and calculate the depth of different parts of the image. Referring also to
The cost value may be determined by using a matching window centered at the current pixel, therefore, an implicit smoothness assumption within a matching window is included. For example, two window based matching processes may be used, such as a sum of absolute differences (SAD) and normalized cross correlation (NCC). However, due to lack of global and local optimization, the resultant depth map may contain noise caused by occlusion and lack of texture.
A confidence based depth map fusion 620 may be used to refine the noisy depth map generated from stereo plane sweeping process 610. Instead of only using stereo images from current frame, previously captured image pairs may be used to provide additional information to improve the current depth map. Confidence metrics may be used to evaluate the accuracy of a depth map. Noise from current depth map may be reduced by combing confident depth estimates from several depth maps.
The confidence measurement implementation may use cost volumes from stereo matching as input and the output is a dense confidence map. Depth maps from different views may contradict each other, so visibility constraints may be employed to find supports and conflicts between different depth estimations. To find supports of a three dimensional point, the system may project depth maps from another view to the selected reference view, other three dimensional points on the same ray that are close to the current point are supporting the current estimation. Occlusions happen on the rays of the reference view, if a three dimensional point found by the reference view is in front of another point located by other views and the distance between two the points are larger than the support region. Another kind of contradiction is free space violation is defined on the rays of target views. This type of contradiction occurs when the reference view predicts a three dimensional point in front of the point perceived by the target view. A confidence based fusion technique may be used to update the confidence value of a depth estimate by finding its supports and conflicts, the depth value is also updated by taking a weighted average within the support region, then a winner-take-all technique is used to select the best depth estimate by choosing the largest confidence value, which in most cases is the closer position so that occluded objects are not selected.
The depth map fusion may be modified to improve the selection process. The differences include, firstly, allowing views to submit multiple depth estimates, so the correct depth values that mistakenly left out are given a second chance. Secondly, instead of using a fixed number as support region size, the system may automatically calculate a value which is preferably proportional to the square of depth. Third, in the last step of fusion, the process may aggregate supports for multiple depth estimates instead of only using the one with the largest confidence.
As a general matter, the stereo matching technique may be based upon multiple image cues. For example, if only a stereo image pair is available the triangulation techniques may compute the three dimensional structure of the image. In the event that the mobile device is in motion, then the plurality of stereo image pairs from different positions may be used to further refine the three dimensional structure of the image. In the case of a plurality of the stereo image pairs the depth fusion technique selects the three dimensional positions with the higher confidence to generate a higher quality three dimensional structure with the images obtained over time.
In some cases, the three dimensional image being characterized is not of sufficient quality and the mobile device should indicate to the user suggestions in how to improve the quality of the image. For example, the value of the confidence measures may be used as a measure for determining whether the mobile device should be moved to a different position in order to attempt to improve the confidence measure. For example, in some cases the imaging device may be too close to the objects or may otherwise be too far away from the objects. When the confidence measure is sufficiently low, the mobile device may provide a visual cue to the user on the display or otherwise an audio cue to the user from the mobile device, with an indication on a suitable movement that should result in an improved confidence measure of a sensed scene.
Three dimensional objects within a scene are then determined. For example, a planar surface may be determined, a rectangular box may be determined, a curved surface may be determined, etc. The determination of the characteristics of the surface may be used to interact with a virtual object. For example, a planar vertical wall may be used to place a virtual picture frame thereon. For example, a planar horizontal surface may be used to place a bowl thereon. For example, a curved surface may be used to drive a model car across while matching the curve of the surface during its movement.
Referring to
By modeling the three dimensional characteristics of the sensed scene, the system has a depth map of the different aspects of the sensed scene. For example, the depth map will indicate that a table in the middle of a room is closer to the mobile device than the wall behind the table. By modeling the three dimensional characteristics of the virtual object and positioning the virtual object a desired position within the three dimensional scene, the system may determine whether the virtual object occludes part of the sensed scene or whether the sensed scene occludes part of the virtual object. In this manner, the virtual object may be more realistically rendered within the scene.
By modeling the three dimensional characteristics of the sensed scene, such as planar surfaces and curved surfaces, the system may more realistically render the virtual objects within the scene, especially movement over time. For example, the system may determine that the sensed scene has a curved concave surface. The virtual object may be a model car that is rendered in the scene on the curved surface. Over time, the rendered virtual model car object may be moved along the curved surface so that it would appear that the model car is driving along the curved surface.
With the resulting three dimensional scene determined and the position of one or more virtual objects being suitably determined within the scene, a lighting condition sensing technique 250 may be used to render the lighting on the virtual objects and the scene in a consistent manner. This provides a more realistic view of the rendered scene. In addition, the lighting sources of the scene may be estimated based upon the lighting patterns observed in the sensed images. Based upon the estimated lighting sources, the virtual objects may be suitably rendered based upon the estimated lighting sources, and the portions of the scene that would otherwise be modified, such as by shadows from the virtual objects, be suitably modified.
The virtual object may likewise be rendered in a manner that is consistent with the stereoscopic imaging device. For example, the system may virtually generate two stereoscopic views of the virtual object(s), each being associated with a respective imaging device. Then based upon each of the respective imaging device, the system may then render the virtual objects and display the result on the display.
It is noted that the described system does not require markers or other identifying objects, generally referred to as markers, in order to render a three dimensional scene and suitably render virtual objects within the sensed scene.
Light condition sensing refers to estimating the inherent 3D light conditions in the images. One embodiment is to separate the reflectance of each surface point with the light sources, based on the fact that visible color is resulted by the multiplication of surface normal and light intensity. Since the position and normal of surface points are already estimated by the depth sensing step, the spectrum and intensity of light sources can be solved by linear estimation based on a giving reflectance model (such as Phong shading model).
Once the light conditions are estimated from the stereo images, the virtual objects are rendered at the user specified 3D position and orientation. The known 3D geometry of the objects and the light sources inferred from the images are combined to generate a realistic view of the object, based on a reflectance model (such as Phong shading model). Furthermore, the relative orientation of the object with respect to the first camera can be adjusted to fit the second camera so that the virtual object looks correct from the stereoscopic views. The rendered virtual object can even be partially occluded by the real-world objects.
The terms and expressions which have been employed in the foregoing specification are used therein as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding equivalence of the features shown and described or portions thereof.