This application is entitled to the benefit of, and incorporates by reference essential subject matter disclosed in PCT Application No. PCT/EP2009/005495 filed on Jul. 29, 2009.
1. Technical Field
The present invention relates to a method for determining the pose of a camera with respect to at least one real object. A camera is operated for capturing a 2-dimensional image including at least a part of a real object, and in the determination process, the pose of the camera with respect to the real object is determined using correspondences between 3-dimensional points associated with the real object and corresponding 2-dimensional points of the real object in the 2-dimensional image.
2. Background Information
Augmented Reality Systems permit the superposition of computer-generated virtual information with visual impressions of the real environment. To this end, the visual impressions of the real world are mixed with virtual information, e.g. by means of semi-transmissive data glasses or by means of a head-mounted display worn on the head of a user. The blending-in of virtual information or objects can be effected in context-dependent manner, i.e. matched to and derived from the respective environment viewed. As virtual information, it is basically possible to use any type of data, such as texts, images etc. The real environment is detected e.g. with the aid of a camera carried on the head of the user.
When the person using an augmented reality system turns his or her head, tracking of the virtual objects is necessary with respect to the changing field of view. The real environment may be a complex apparatus, and the object detected can be a significant part of the apparatus. During a so-called tracking process, the real object (which may be an object to be observed such as an apparatus, an object provided with a marker to be observed, or a marker placed in the real world for tracking purposes) detected during initialization serves as a reference for computing the position at which the virtual information is to be displayed or blended-in in an image or picture taken up by the camera. For this purpose, it is necessary to determine the pose of the camera with respect to the real object. Due to the fact that the user (and consequently the camera when it is carried by the user) may change his or her position and orientation, the real object has to be subjected to continuous tracking in order to display the virtual information at the correct position in the display device also in case of an altered pose (position and orientation) of the camera. The effect achieved thereby is that the information, irrespective of the position and/or orientation of the user, is displayed in the display device in context-correct manner with respect to reality.
One of the problems in the field of augmented reality is the determination of the head position and the head orientation of the user by means of a camera that is somehow associated with the user's head. Another problem may be determining the position and orientation of the camera inside a mobile phone in order to overlay information on the camera image and show the combination of both on the phone's display. To this end, in some applications the pose of the camera with respect to at least one real object of the captured real environment is estimated using the video flow or image flow of the camera as source of information.
Pose estimation is one of the most basic and most important tasks in Computer Vision and in Augmented Reality. In most real-time applications, it needs to be solved in real-time. However, since it involves a non-linear minimization problem, it requires heavy computational time.
It is therefore an object of the present invention to provide a method for determining the pose of a camera with respect to at least one real object which is capable to be computed by a data processing means with rather low computational time.
The invention is directed to a method for determining the pose of a camera with respect to at least one real object according to the features of claim 1. Further, the invention is directed to a computer program product as claimed in claim 12.
According to an aspect, the invention concerns a method for determining the pose of a camera with respect to at least one real object, the method comprising the steps of:
In an embodiment of the invention, the iterative minimization process involves the algorithm of Gauss-Newton or the algorithm of Levenberg-Marquardt.
In a further embodiment of the invention, the iterative minimization process involves a compositional update process in which the respective updated version of the transformation matrix is computed from a multiplication between a matrix built with update parameters of the respective previous transformation matrix and the respective previous transformation matrix.
In a further embodiment, Lie Algebra parameterization is used in the iterative minimization process.
Further embodiments of the invention are set forth in the dependent claims.
Aspects of the invention will be discussed in more detail in the following by way of the Figures illustrating embodiments of the invention, in which
In this way, an appropriate camera may capture a 3-dimensional image and a transformation matrix may be provided which includes information regarding a correspondence between 3-dimensional points associated with the real object and corresponding 3-dimensional points of the real object as included in the 3-dimensional image.
In a tracking process, the aim is to determine the pose of the camera 1 with respect to the real object 3, the pose being computed using the transformation matrix T between the camera coordinate system 11 and the world coordinate system 13 associated with the real object 3. The information thus obtained serves as reference for computing the position at which virtual information is to be displayed or blended-in in an image 4 taken up by the camera. Continuous tracking of the real object 3 is performed when the relative position and/or orientation between the user or the camera 1 and the real object 3 changes.
Given the coordinates of the set of 3D points Pi* (expressed in the world coordinate system 13, for the sake of simplicity, in the
The coordinates of the 3D points Pi* can for example be manually measured on the real object or coming from the 2D technical drawing of the real object or determined thanks to 3D CAD model of the object. While the 2D point coordinates pi are generally determined thanks to some standard image processing techniques such as feature extraction (corner points, edge points, scale invariant points, etc . . . ). The correspondences between the Pi* and pi are obtained thanks to standard computer vision method generally based on geometric properties of the 3D points and the use of special geometric spatial distribution of the points and their neighborhood. It can also be based on some knowledge of the appearance of the 3D points in the image based on a manually labeled set of images called keyframes (one then have different possible appearance of the object and the 3D points lying on it in images acquired under different viewpoints). In that case, approaches using for example descriptors (e.g. David G. Lowe: “Distinctive Image Features from Scale-Invariant Keypoints In: International Journal of Computer Vision”, Vol. 60, Nr. 2, pages 91-110, 2004) or classifiers (e.g. V. Lepetit and P. Fua, Keypoint Recognition using Randomized Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 28, Nr. 9, pages 1465-1479, 2006) could be used to establish the correspondences. For example, correspondences between the 3D points and 2D points are established between the real object and the keyframes (2D⇄3D). In order to establish the correspondences between the real object and the image in hand (image 4), only 2D⇄2D correspondences are then needed. In case, a textured 3D CAD model of the object is available, it can be used during the process of determining the correspondences for example by synthetically generating the keyframe images instead of acquiring them with a camera.
The position of the camera is also called translation of the camera and it can be represented using a (3×1) vector t as known by the skilled person. The orientation of the camera is also called rotation of the camera and it can be represented with a (3×3) rotation matrix R as also known by the skilled person. The rotation matrix R is a matrix that belongs to the so-called Special Orthogonal group SO(3) which is known to the person skilled in the art. The SO(3) group is the group of (3×3) matrices verifying: RT R=Id and det(R)=1, Id is the identity matrix (which is a (3×3) matrix with ones on the main diagonal and zeros elsewhere and det stands for the determinant of the matrix).
The camera pose in the world coordinate system 13 can be written using a (4×4) matrix
This matrix T is called the transformation matrix and it is in the Special Euclidean group SE(3). It has 6 degrees of freedom: 3 degrees of freedom for the translation and 3 degrees of freedom for the rotation.
The above is basically known to the skilled person and is described in, e.g., “Richard Hartley and Andrew Zisserman, Multiple View Geometry in Computer Vision”, Cambridge University press, second edition, 2003 (see for example Chapter 3, pages 65-86).
The image projection can be summarized as follows:
At a certain pose T of the camera, a 3D point Pi*=[Xi*,Yi*,Zi*,1] expressed in the world coordinate system 13 (where Xi* is the coordinate along the x-axis, Yi* is the coordinate along the y-axis and Zi* is the coordinate along the z-axis) is first projected on the normalized plane as a “normalized” points mi=[xi,yi,1]=[Xi/Zi,Yi/Zi,1] where [Xi,Yi,Zi,1]=T[Xi*,Yi*,Zi*,1], and where xi, yi are the coordinates of the 2D point in the normalized plane. The normalized plane is the plane orthogonal to the z-axis of the camera coordinate system (corresponding to the optical axis of the camera) and the distance between this plane and the optical center of the camera is 1 distance unit. It is only a mathematical concept allowing the expression of the point coordinates as a simple projection of the 3D points without the need of the camera intrinsic parameters (see below). With a Time Of Flight camera or a stereo-camera, it is possible to get the 3D coordinates of the transformed points in the coordinate system attached to the camera, i.e. it is possible to measure directly [Xi,Yi,Zi,1].
Then mi is projected into the camera image as pi=[ui,vi,1]=K mi where the matrix K is the (3×3) camera intrinsic parameter matrix
The parameters fx and fy are respectively the horizontal focal length on the x-axis and the vertical focal length on the y-axis, s is the skew (this parameter is different from 0 in case the pixel elements of the camera sensor array do not have perpendicular axes), u0 and v0 are the coordinates of the principal point in pixels in the image (the principal point is the intersection of the optical axis with the image plane). For more details, see “Richard Hartley and Andrew Zisserman, Multiple View Geometry in Computer Vision”, Cambridge University press, second edition, 2003 (see for example Chapter 6, pages 153-177).
There is also the phenomenon of the camera distortion. Here, one will consider that the camera is calibrated and that the intrinsic parameters are known. The images are also considered undistorted.
Therefore, it is possible when having an image point pi=[ui,vi,1] with coordinates expressed in an undistorted image to get its corresponding “normalized” point m, as follows: mi=K−1 pi. An illustration of the projection can be seen in
From now on, instead of considering corresponding points Pi*pi, one will consider corresponding points: Pi*mi since having pi allows to directly get mi using the formula above.
Having a set of corresponding 3D-2D points, the pose of the camera may be solved non-linearly using, for example, the algorithm of Gauss-Newton or using the algorithm of Levenberg-Marquardt, which are basically known to the skilled person and are described, e.g., in P. Gill, W. Murray and M. Wright: “Practical Optimization”, Academic Press, pages 134-137, 1981.
Initial estimate of transformation matrix T
Both algorithms (Gauss-Newton or Levenberg-Marquardt) require an initial estimate Tl of the transformation matrix T (see the respective initial step in the processes as shown in
The following is an embodiment of how we obtain the initial estimate Tl of transformation matrix T:
Every correspondence Pi*mi gives us a set of equations mi×L Pi*=0 (× is the cross product operator), as shown in
It is then possible by stacking up all the Ci matrices to write the system as a linear problem C 1=0 where the unknown is the vector 1 (see
In order to solve this problem, one can perform the SVD decomposition of the matrix as C=Uc Sc VcT, where Uc and Vc are two orthogonal matrices and Sc is diagonal matrix in which the entries of the diagonal correspond to the singular values of the matrix C. The last column of the matrix Vc corresponds to the least-square solution of the linear problem.
In order to have a better conditioning of the matrix C, one may first normalize the 3D points such that their mean is [0 0 0 1] and their standard deviation is sqrt(3) and one should normalize the 2D points such that their mean [0 0 1] and their standard deviation is sqrt(2). After solving the system, then, one should de-normalize the obtained solution.
The above is basically known to the skilled person and is described in, e.g.: Richard Hartley and Andrew Zisserman, “Multiple View Geometry in Computer Vision”, Cambridge University press, second edition, 2003 (see for example Chapter 7, pages 178-194).
When having 1 it is possible to build the matrix L. However, the matrix L is not guaranteed to be in SE(3) since the matrix
is not guaranteed to be a rotation matrix.
One can approximate the matrix A by a rotation matrix Ra as follows:
The matrix Ra is a rotation matrix as it verifies Ra RaT=Id and det(Ra)=1. The matrix
is a then a possible solution to the pose Tl.
This pose was first obtained through a linear system and then forcing the upper left (3×3) matrix to be a rotation matrix.
Non-linear iterative minimization process
Now, one has an initial estimate of the transformation matrix T and one is ready for a non-linear estimation. Both Gauss-Newton and Levenberg-Marquardt algorithms can be used. See the flow diagrams of
The standard Gauss-Newton implementation can be seen in
Similarly, a standard Levenberg-Marquardt implementation can be seen in
Both algorithms iteratively update (refine) the pose estimation: T←update (T,d), where d is a (6×1) vector containing the 6 update parameters of the translation and the rotation and the update function depends on the parameterization chosen for d.
The Parameterization
Generally, the translation is parameterized using the natural R3 parameterization. The 3 first entries of d represent the translation. However, for the rotation one can choose a parameterization based on the Euler angles or on the axis-angle representation or on the quaternion representation. Respectively, the 3 last entries of d are either the Euler angles or the axis-angle representation of the rotation. The vector d=[d1, d2, d3, d4, d5, d6] contains 3 entries for the translation [d1, d2, d3] and 3 entries for the rotation [d4, d5, d6].
When, the rotation is parameterized using quaternions, the vector d has 7 entries and the last 4 parameters represent the quaternion representation of the rotation. For example, in case we use Euler angles for parameterizing the rotation, we would have the matrix T in the following form
One can use Lie Algebra parameterization. This parameterization is often used in the robotic community. The principles of parameterization, as described below, are known to the skilled person, and are described in the literature as follows:
In “Tracking people with twists and exponential Maps”, Bregler and Malik, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Santa Barbara, Calif., pages 8-15, 1998; the above described parameterization was introduced for tracking people.
In “Real-Time Visual Tracking of Complex Structures”, Tom Drummond and Roberto Cipolla, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pages 932-946, July 2002 it was used for parameterizing increments in the context of edge based tracking.
Recently in “Motilal Agrawal, A Lie Algebraic Approach for Consistent Pose Registration for General Euclidean Motion”, International Conference in Computer Vision, pages 1891-1897, 2005 it was used to register local relative pose estimates and to produce a global consistent trajectory of a robot.
When using this parameterization, the matrix T has the following form:
where the rotation matrix is written using the Rodriguez formula:
R(d4,d5,d6)=Id+sin(θ)[u]x+(1−cos(θ))[u]x2
θ is the angle of the rotation (θ=∥[d4,d5,d6]∥) and [u]x is the skew matrix of the rotation axis u=[ux, uy, uz]=[d4,d5,d6]/θ
The translation matrix depends on all the parameters of the update vector
t(d1,d2,d3,d4,d5,d6)=(Id+(1−cos(θ))[u]x/θ+(1−sin(θ)/θ)*[u]x2)*[d1, d2, d3]
This parameterization has many advantages compared to the standard one (Euler Angles for the rotation and R3 for the translation). One of the advantages that concerns this invention is that the Jacobian matrix (the derivative of the update matrix with respect to the update parameters d) has a very simple and cheap (in terms of computational costs) expression.
The update of the transformation matrix T
For every parameterization, one can either choose an additional update or a compositional update. The additional update builds an updated matrix T by adding the old parameters with the new parameters. The compositional update computes the updated matrix T as the multiplication between a matrix built with the update parameters ΔT(d) and the old matrix T. The update can then be written as:
T←ΔT(d)T
The different update methods are basically known to the skilled person and are described, for example, in: Lucas-Kanade 20 Years On: A Unifying Framework”, Simon Baker and lain Matthews, International Journal of Computer Vision, vol. 56, no. 3, pages 221-255, 2004 where the different possible updates are explained in the context of markerless tracking.
The Iterative Minimization Process
In every iteration, one transforms the reference 3D points Pi* with the estimated transformation matrix T (which in the first iteration is set to Tl, as described above). One gets the equation Pi=T Pi*=[Xi,Yi,Zi,1].
Then, the obtained 3D points Pi are projected into a mei=[Xi/Zi,Yi/Zi]. It is important to see that me, depends on the estimation of T.
For every correspondence, one computes [eix, eix]=[xi,yi]−mei, where [xi,yi] are the two first entries of the “normalized” point mi which was explained above.
Stacking all the correspondences together results into the geometric error vector, i.e. the vector e=[e1x, e1y, e2x, e2y, . . . eNx, eNy] where N is the total number of points.
The non-linear estimation iteratively minimizes the squared norm of this error vector.
For that, one also needs to compute the Jacobian matrix which is the first derivative of the mei with respect to the rotation and translation increments. The dimension of the Jacobian matrix is then (2N×6).
The lines 2*i and (2*i+1) of the Jacobian matrix can be written as: Δ(mei)/Δd
The Jacobian matrix can be computed numerically using the finite differences method. Alternatively, it can be computed analytically. For both approaches, the parameterization of the incremental matrix plays an important role in the complexity of the Jacobian computation. For example, in the case of a rotation update parameterized using the Euler angles, several trigonometric functions (e.g. cos and sin) need to be computed and recomputed at every iteration. The way of updating the estimate (additional or compositional) plays also an important role in the complexity of the algorithm.
According to embodiments of the invention, the following considerations are made:
In
In
In
In
In
According to aspects of the invention, a proper parameterization of the transformation matrix T may be chosen, wherein multiple kinds of parameterizations can be chosen. According to an embodiment, one decides to parameterize the matrix ΔT using the Lie Algebra se(3) associated to the Lie group SE(3) (see above).
The Jacobian computation complexity is then very much reduced. One gets a simple and an analytic expression of the Jacobian matrix. In addition, when using the compositional update approach, parts of the obtained Jacobian matrix can be pre-computed. This reduces the run-time processing. See the effect of such parameterization in the simplification of one Jacobian computation in the context of planar markerless tracking in: “Integration of Euclidean constraints in template based visual tracking of piecewise-planar scenes”, Selim Benhimane and Ezio Malis, in Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, October 9-15, pages 1218-1223, 2006, Beijing, China. The cost function error in this paper was based on photometric error between a reference template and current template. It was not based on geometric error as we are considering in this invention.
In addition to that, the non-linear estimation remains very time consuming and running it on hardware with very limited computational power, such as on mobile devices, makes it impossible to have the pose estimation to be performed in real-time especially with a very high number of points. According to the invention, with keeping the Jacobian matrix fixed and not updating it during the iterative minimization process (either using the Gauss-Newton method or using Levenberg-Marquardt method) provides still very good results. The computed Jacobian matrix right after the linear solution is a good approximation that allows reaching the minimum of the cost function with much lower computational power.
According to an embodiment of the invention, the different aspects as described above may also be combined in a method for determining the pose of a camera. For example, using both the Lie Algebra parameterization during a non-linear estimation based on a compositional update reduces the computational cost of the pose estimation problem. Using a fixed Jacobian matrix, as explained with respect to
The method may be implemented in a computer program adapted to be loaded into the internal memory of a computer. Such computer may further include, e.g., a standard processor 2 as shown in
While the invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment(s) disclosed herein as the best mode contemplated for carrying out this invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2009/005495 | 7/29/2009 | WO | 00 | 12/22/2011 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2011/012142 | 2/3/2011 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5850469 | Martin et al. | Dec 1998 | A |
7023536 | Zhang et al. | Apr 2006 | B2 |
7038846 | Mandella | May 2006 | B2 |
7088440 | Buermann et al. | Aug 2006 | B2 |
7110100 | Buermann et al. | Sep 2006 | B2 |
7113270 | Buermann et al. | Sep 2006 | B2 |
7161664 | Buermann et al. | Jan 2007 | B2 |
7203384 | Carl | Apr 2007 | B2 |
7268956 | Mandella | Sep 2007 | B2 |
7474809 | Carl et al. | Jan 2009 | B2 |
7729515 | Mandella et al. | Jun 2010 | B2 |
7826641 | Mandella et al. | Nov 2010 | B2 |
7961909 | Mandella et al. | Jun 2011 | B2 |
20050168437 | Carl et al. | Aug 2005 | A1 |
20050253871 | Anabuki et al. | Nov 2005 | A1 |
20050261573 | Satoh et al. | Nov 2005 | A1 |
20070009135 | Ishiyama | Jan 2007 | A1 |
20110227915 | Mandella et al. | Sep 2011 | A1 |
20120038549 | Mandella et al. | Feb 2012 | A1 |
20130194418 | Gonzalez-Banos et al. | Aug 2013 | A1 |
Entry |
---|
Gabriele Bleser et al: “Online camera pose estimation in partially known and dynamic scenes” Mixed and Augmented Reality, 2006. International Symposium, p. 56-65. |
Selim Benhimane et al: “Integration of Euclidean constraints in template based visual tracking or piecewiese-planar”, Intelligent Robots and Systems, 2006, p. 1218-1223. |
Jorge Batista et al: “Interactive Multistep Explicit Camera Calibration”, IEEE Transactions on Robotics and Automation, 1999, New York. |
David G. Lowe: “Distinctive Image Features from Scale-Invariant Keypoints In: International Journal of Computer Vision”, vol. 60, Nr. 2, pp. 91-110, 2004. |
V. Lepetit and P. Fua, Keypoint Recognition using Randomized Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, Nr. 9, pp. 1465-1479, 2006. |
Richard Hartley and Andrew Zisserman, Multiple View Geometry in Computer Vision, Cambridge University press, second edition, 2003 (see for example Chapter 3, pp. 65-86. |
Richard Hartley and Andrew Zisserman, Multiple View Geometry in Computer Vision, Cambridge University press, second edition, 2003 (see for example Chapter 6, pp. 153-177. |
P. Gill, W. Murray and M. Wright: “Practical Optimization”, Academic Press, pp. 134-137, 1981. |
Richard Hartley and Andrew Zisserman, “Multiple View Geometry in Computer Vision”, Cambridge University press, second edition, 2003 (see for example Chapter 7, pp. 178-194. |
Tracking people with twists and exponential Maps, Bregler and Malik, in IEEE (CVPR), Santa Barbara, p. 8-15, 1998. |
Real-Time Visual Tracking of Complex Structures, Tom Drummond et al, Transactions on Pattern Analysis and Machine Intelligence, vol. 24, No. 7, p. 932-946, 2002. |
Motilal Agrawal, A Lie Algebraic Approach for Consistent Pose Registration for General Euclidean Motion, International Conference in Computer Vision, pp. 1891-1897, 2005. |
Lucas Kanade 20 Years On: A Unifying Framework, Simon Baker and Iain Matthews, International Journal of Computer Vision, vol. 56, No. 3, pp. 221-255, 2004. |
Number | Date | Country | |
---|---|---|---|
20120120199 A1 | May 2012 | US |