The present invention generally relates to the field of estimating camera orientation relative to a ground surface. More specifically, the present invention relates to techniques of estimating camera orientation automatically by analyzing a sense structure through leveraging properties of orthogonal vanishing points.
Machine vision has gained much attentions in commercial and industrial use, such as imaging-based analysis for production and logistic automation. In many machine vision-based applications, camera orientation plays an important role; i.e. it is needed in order to obtain real metric units in three-dimensional (3D) space from measurements on 2D images or video frames. For example, in vehicle guidance, lane departure detection that detects when the vehicle moves away from lane markers on ground requires the knowledge of camera orientation with respect to the ground plane. Camera orientation, in particular its pitch and row angle, can be made known by a manual calibration procedure after it is mounted on the vehicle. However, for a fleet of identical vehicles, such as a fleet of automatic guided vehicles (AGV) in a factory, such repetitive manual calibration on every AGV is troublesome and error prone. Moreover, camera orientation often drifts after extended period of time of use from hard braking, sudden accelerations, inadvertent camera movements, etc.
It is possible to estimated camera orientation from a single image. For example, where the ground at infinitely far is clearly visible, its vanishing line gives indication of the camera's orientation relative to the ground. However, in many practical circumstances where there is no vertical structure in the captured image, it is impossible to obtain vertical vanishing points to estimate the ground plane. Accordingly, there is a need in the art of a new approach for estimating camera orientation that can address the shortcomings in the estimation approach that depends on vertical vanishing points.
The present invention provides a method and an apparatus for estimating camera orientation relative to a ground surface. In accordance to various embodiments of the present invention, the method includes the steps as follows. A first image of a scene before a front-facing camera is captured and recorded. A plurality of line segments are detected from the first image. A first virtual cube having three orthogonal vanishing points in a random or best-guess 3D orientation is superimposed on to the first image. An orthogonal direction classifier classifies the line segments of the first image and groups them into first, second, and third 3D-directional groups by comparing the perpendicular distances between each of the three orthogonal vanishing points of the first virtual cube to each of the detected line segments, and determining the group of which the line segment belongs to according to the shortest of the three perpendicular distances. A second virtual cube having three orthogonal vanishing points is superimposed on to the first image, wherein the second virtual cube is in an initial 3D orientation of that of the first virtual cube represented by an initial rotation matrix R0. An optimal orientation of the second virtual cube with respect to the grouped line segments is computed iteratively by changing the 3D orientation of the second virtual cube and computing the perpendicular distances of the three orthogonal vanishing points to the three line segment groups in each iteration starting with the initial rotation matrix R0; wherein the optimal orientation of the second virtual cube is one that provides the shortest perpendicular distances. Co-variances of the three orthogonal vanishing points of the second virtual cube at the optimal orientation are computed by the computer processor. Ground orientation is computed from one of the three orthogonal vanishing points of the second virtual cube at the optimal orientation. The process repeats in subsequent N images, each with a different random or best-guess 3D orientation of the first virtual cube. A most accurate ground plane is determined by selecting the ground orientation having the least estimation error in response to the co-variances.
In one embodiment, the optimal orientation of the second virtual cube in the first image is used to compute a ground plane on a second image following the first image; and the resulting rotation matrix R* representing the optimal orientation of the second virtual cube is used to compute a ground normal vector n of the ground orientation as n=R*[0,0,1]τ.
In accordance to an application of the present invention, a method for guiding a self-driven vehicle having a front-facing camera includes executing the method for estimating camera orientation of the front-facing camera in accordance to the various embodiments of the present invention. Motions of the self-driven vehicle is determined based on the estimated camera orientation.
In accordance to another application of the present invention, a remote processing server for estimating camera orientation of a front-facing camera of a machine-vision enabled autonomous guided vehicle (AGV) is provided. The remote processing server is in data communication with the AGV and configured to receive a video feeds captured by the front-facing camera, so as to execute a method for estimating front-facing camera's orientation in accordance to the various embodiments of the present invention.
The advantages of the present invention include: (1) that in the estimation of the ground plane, any one of the X, Y, and Z 3D plane line segment groups detected and classified can be empty; (2) that the core computation in accordance to the various embodiments of the present invention is properly established on a least square optimization approach, enabling the error uncertainty on camera orientation to be computed; (3) that in general, it is difficult to solve quadratic least square minimization problem with six quadratic equality constraints, but the present invention provides a solution to circumvent the quadratic equality constraints by rotation; and (4) enabling the automation of camera orientation estimation in machine-vision applications, thereby avoiding repetitive and periodic manual calibration to camera orientation.
Embodiments of the invention are described in more details hereinafter with reference to the drawings, in which:
In the following description, methods and apparatuses for estimating camera orientation relative to a ground plane by leveraging properties of orthogonal vanishing points, and the likes are set forth as preferred examples. It will be apparent to those skilled in the art that modifications, including additions and/or substitutions may be made without departing from the scope and spirit of the invention. Specific details may be omitted so as not to obscure the invention; however, the disclosure is written to enable one skilled in the art to practice the teachings herein without undue experimentation.
In the present disclosure, 2D and 3D spatial geometry, such as points and lines as perceived by machine vision are represented in projective space coordinates. Definitions for mathematical notations in the present disclosure are listed as follows:
A point p in a two-dimensional projective space 2 is represented as three-vector {right arrow over (p)}=(u, v, k), and its coordinate in a two-dimensional Euclidean space 2 is
A line l in 2 is represented as three-vector {right arrow over (l)}=(a, b, c), and its slope and y-intercept in 2 is respectively
A point p is on a line l in 2 if and only if pτl=0 because au+bv+ck=0 which is a line equation;
aτ represents transpose of a, and aτb represents dot product between two vectors a and b.
Projective transformation H in 2 is a 3×3 matrix. It transforms a point in 2 from p to p′=Hp.
If H in 2 transforms point from p to p′=Hp, it transforms line from l to l′=H−τl.
A−τ represents transpose of matrix A−1, and A−1 represents inverse of matrix A;
A point in three-dimensional 3 is P=(X, Y, Z). Under a pinhole camera model, an image captured by a pinhole camera is modeled as a point p=KP in two-dimensional 2, where K is a projective transformation in 2.
K is also known as camera calibrated (or intrinsic) matrix, and it encodes camera's focal length f and principal point (px, py)
such that point =(X, Y, Z) in 3 is imaged as point
and
A camera calibrated matrix K can be found by some manual calibration procedure.
Referring to
In practical cases, during operation of AGVs 100, certain conditions encountered may result in computational problems that cause the AGVs 100 unable to function. For example, as shown in
Further, as shown in
Referring to the flowchart depicted in
In the step S10, a video file/data stream is produced by the AVG 100's front-facing camera 120 in capturing a real world scene before it and transmitted to the remote processing server 150 via the wireless communication. The video file/data stream contains a plurality of video frames of continuous images.
In the step S20, one video frame/image is extracted from the video file/data stream by the remote processing server 150. The video frame/image is static and reflects the real-world scene (i.e. the left image in
In the step S30, detection of line segments in the video frame/image is performed by the remote processing server 150, such that line segments are generated on the video frame/image (i.e. the right image in
In step S40, the line segments detected in the step S30 are classified and grouped into three orthogonal directions, for example, the X, Y and Z directions.
In the present disclosure, definition for X, Y, and Z directions is that the X, Y, Z directions are orthogonal in 3D space and satisfy: X·Y=0; Y·Z=0; and Z·X=0. Further, the points at infinity in 3D space along X, Y, and Z directions are captured by a camera with K onto a 2D object image at image locations known as VPs and denoted by vx, vy, vz, hereafter. They are also “orthogonal with respect to ω” on an object image such that vzτωvy=0; vyτωvx=0; and vxτωvz=0, where ω=K−τK−1, which is known as “image of absolute conic”, and K refers to the aforementioned definition of the camera calibrated matrix.
At the beginning of the line segment classification, it is assumed a virtual cube with an initial orientation in 3D space as shown in
Continuing with step S40, the virtual cube is superimposed onto the video frame/image with line segments detected as shown in
As illustrated in
In accordance to one embodiment, the line segment classification is computed using the following algorithm. Hypothesizing a line l to converge to one of VPs vx, vy and vz, the computation of the Sampson error
of the three hypotheses is equivalent to determining δX, δY, and δZ . Here, the computation of the Sampson error is based on defining distance δi, where i is x, y, or z, in terms of the rotation matrix R as:
where
li=(pi, qi, 1)×(ui, vi, 1) which is a line segment having two end points (pi, qi) and (ui, vi);
K=3×3 camera calibrated matrix;
R=3×3 rotation matrix;
where the denominator JiΣgJiτ can be understood as pixel error of li, meaning the pixel noise level at both end of the object line segment.
Then, with the conditions of ∈i=liτKRPi, serving as scalar residual error, Σg=4×4 co-variances matrix of (p, q, u, v), and Ji=Jacobian of ∈i w.r.t. g=(p, q, u, v), Sampson error is computed and δX, δY, and δZ for each line segment are obtained. Using the Sampson error computation in the aforementioned first illustration as shown in
Step S40 can also be described as a qualitative computation with input parameters including an initial rotation matrix R, line segments li, camera calibrated matrix K, pixel noise co-variance Σg, an acceptance level α, and a rejection level β, where α=0.05 and β=0.95≥α for example. The objective is to classify line segments li into X, Y, or Z group.
The step S40 qualitative computation comprises the following steps:
Step I: for each li, intermediate expressions as follows are computed:
Step II: costs are sorted such that δiD
Step III: li is classified into D1 directional group if:
δiD
where Fn, is cumulative chi squared χ2 distribution with n degree of freedom. The determination condition in step III serves as a Hysteresis window to avoid line segments that are likely determined to be pointing to multiple directions.
Furthermore, the initial random orientation of the virtual cube may not be within a range of proximately correct orientations, and if the initial orientation is entirely incorrect, the classification may fail entirely. As such, multiple trial-and-error runs with multiple random initial orientations and rotation matrix R are needed. In one embodiment, the initial orientation and rotation matrix R of the trial-and-error run that yield the lowest number of out-liners are selected for the correct line segment classification and grouping generated there within.
After the classification and grouping, the video frame/image having the properly classified and grouped line segments in X, Y, Z is obtained, as shown in the right image of
Referring to
In sub-step P20, distances δi between VPs vx, vy ,vz and every line segment in their respective groups are measured. For example, referring to
In the present step, linearized rotation matrix technique is further applied to approximating an infinitesimal rotational perturbation at the rotation matrix R as: R′=(1−[ϕ]x)R, ϕ is a three-vector Euler-angle. Such technique achieves low-complexity linear matrix computation. Note that for any three-vector a=(a1, a2, a3), it has:
Then, since scalar residual error ∈i=liτKRPi is related to the rotation matrix R, R′ can be substituted into ∈i such that the total Sampson error is expressed in terms of ϕ, which yields the following expression.
In sub-step P20, the total Sampson error is computed as equivalent to computing the distance δi between vx, vy, vz and every line segment in their respective group. A least square estimation (LSE) for the three orthogonal VPs vx, vy, vz is expressed as
and the output for the expression may include the optimal orthogonal VPs vz*, vy*, vz* and jointly as:
v
z
*
=KR
*[0,0,1]τ;
v
y
*
=KR
*[0,1,0]τ; and
v
x
*
=KR
*[1,0,0]τ.
In sub-step P30, an optimal three-vector Euler-angle ϕ*, which is also referred to as the best rotation, is computed from the total Sampson error. Since the total Sampson error J(ϕ) is a function of ϕ, the minimum of J(ϕ) occurs at ∂J(ϕ)/∂ϕ=0 (i.e. in an orientation that rate of change of the total Sampson error is zero). As such, by solving the equation ∂J(ϕ)/∂ϕ=0, the optimal three-vector Euler-angle ϕ* (i.e. the best or optimal orientation) is determined.
In sub-step P40, the optimal three-vector Euler-angle ϕ* is converted to a 3×3 rotation matrix by e[ϕ]x, which is also referred to as a rotation matrix R″, where eA is an exponential function of matrix A. The rotation matrix R″ represents a new orientation. It should be noted that the yielded rotation matrix R″ is different from the aforementioned rotation matrices R and R′, and thus it is labelled by the different symbol.
In sub-step P50, an absolute value of the three-vector Euler-angle ϕ* (i.e. ||ϕ*∥) is checked whether it is very close to 0. If it is very close to 0, the computation proceeds to sub-step P60. Otherwise, the computation reiterates from sub-step P20, and the yielded rotation matrix R″ serves as input instead of the initial rotation matrix R generated in sub-step P10; and then a further 3×3 rotation matrix is obtained by executing sub-steps P20-P40 again.
In sub-step P60, co-variances of the yielded rotation matrix R″ are also computed, which is equivalent to uncertainty in the yielded rotation matrix R″ due to error of li.
In sub-step P70, the VP vz* is computed by vz*=KR*[0,0,1]τ. That is, the yielded rotation matrix R″ serves as input for the computation of the VP vz*. In one embodiment, if iterative executions are performed (i.e. when the execution of sub-step P50 results in reiterating from sub-step P20 repeatedly), a 3×3 rotation matrix R eventually obtained to compute the ground orientation is expected to have the least total error trace (Σϕ).
Step S50 can also be described as a qualitative computation with input parameters including an initial rotation matrix R0, three groups of line segments li going through respectively vz, vy, vx, a camera calibrated matrix K, and pixel noise co-variance Σg. The objective is to find R* such that Σi∈i2/JiΣgJiτ is minimized and to find co-variance Σϕ of R* in terms of Euler angles linearized at R0.
The step S50 qualitative computation comprises the following steps:.
Step I: a parameter R is initialized by using an initial rotation matrix R0 as input (R←R0);
Step II: intermediate expressions as follows are computed:
where A+ is pseudo inverse of A;
Step III: the parameter R is updated by using RϕR as input (R←RϕR);
Step IV: determine whether to proceed to the next step of the computation: if ∥ϕ∥ is very close to 0, it is determined to proceed to the next step; otherwise, it returns to the step II; and
Step V: a final-determination parameter R* is set by using the parameter R (R*←R), and co-variance Σϕ of R* is computed, in which Σϕ=A+.
In one embodiment, if ∥ϕ∥ is not very close to 0 or exceeds a preset threshold value, step I to step V are repeated by taking R* as input to set R in step I (R←R*) and reiterate the execution of the computation until ∥ϕ∥ is very close to 0.
Referring to
By the above processes, as the ground orientation is estimated from orthogonal VPs, camera orientation is correspondingly obtained. In the illustrative example above, none of the X, Y, and Z groups is empty. Nonetheless, embodiments of the present invention allow any one of the three groups to be empty (i.e. both X and Y groups contain at least one detected and classified line segment but the Z group is empty) because the computation in step S50 needs at least two of the three VPs being matched with the orientation of the virtual cube. Nevertheless, step S50 takes the advantage of having all three non-empty groups. It matches all three VPs to three groups and further reduces estimation error.
Referring to
where ∥a∥ is the norm of vector a.
Further, assuming three group of line segments lx
To enforce the orthogonality among vx*, vy*, vz*, and avoid 0 trivial solutions, the minimization above are constrained by: vx*τωvy*=0, vy*τωvz*=0, vz*τωvx*=0, ∥vx*∥=1, ∥vy*∥=1, ∥vz*∥=1. In a case where i.e. group lz
On the other hand, it is difficult to solve the quadratic least square minimization problem with six quadratic equality constraints such as those presented in above equation (1). In this regard, the present invention provides a solution, presented below, to circumvent the quadratic equality constraints by rotation.
Basis directions in 3 are defined as X=[1,0,0], Y=[0,1,0] and Z=[0,0,1], and their VPs are imaged by a camera calibrated matrix K as:
q
x
=KX, q
y
=KY, q
z
=KZ (2).
Since vx*, vy*, vz* and qx, qy, qz are on the plane at infinity, there exists a projective transformation H* such that
v
d
*
=H
*qd for d being x, y and z (3).
A simple check reveals vx*, vy*, vz* are orthogonal on ω=K−τK−1. Because H* transforms points on the plane at infinity, the “infinite homography” property is applied herein, such that:
H
*
=KR
*
K
−1, where R* is a rotation matrix (4).
Accordingly, by substituting equations (2), (3), and (4) into the equation (1), the least square intersection points problem for three line-segment groups lx
As such, the six quadratic equality constraints are eliminated and transformed to depend upon only one rotation matrix constraint. Furthermore, it explains the definition of distance δi as being related to “liτKRPi” and is effective to estimate camera orientation.
Therefore, by leveraging the property of orthogonal VPs, camera orientation is automatically estimated by analyzing sense structure. The core algorithm is properly established on a least square optimization approach, enabling the error uncertainty on camera orientation to be computed.
The control circuitry 330 is electrically coupled with the front-facing camera 320, the CPU 340, and the GPU 350. The control circuitry 330 is configured to transmit a video file recorded by the front-facing camera 320 to the CPU 340 and the GPU 350 for estimating camera orientation. The configuration of the present embodiment enables the AGV 300 for being a street vehicle or commercial/domestic robots.
Although the above description of the present invention involved only ground-based AGVs, an ordinarily skilled person in the art can readily adapt and apply the various embodiments of the present invention in other machine vision applications in e.g. aerial and marine-based drones without undue experimentation or deviation from the spirit of the present invention.
The electronic embodiments disclosed herein may be implemented using computing devices, computer processors, or electronic circuitries including but not limited to application specific integrated circuits (ASIC), field programmable gate arrays (FPGA), and other programmable logic devices configured or programmed according to the teachings of the present disclosure.
Computer instructions or software codes running in the computing devices, computer processors, or programmable logic devices can readily be prepared by practitioners skilled in the software or electronic art based on the teachings of the present disclosure.
All or portions of the electronic embodiments may be executed in one or more computing devices including server computers, personal computers, laptop computers, mobile computing devices such as smartphones and tablet computers.
The electronic embodiments include computer storage media having computer instructions or software codes stored therein which can be used to program computers or microprocessors to perform any of the processes of the present invention. The storage media can include, but are not limited to, floppy disks, optical discs, Blu-ray Disc, DVD, CD-ROMs, and magneto-optical disks, ROMs, RAIVIs, flash memory devices, or any type of media or devices suitable for storing instructions, codes, and/or data.
Various embodiments of the present invention also may be implemented in distributed computing environments and/or Cloud computing environments, wherein the whole or portions of machine instructions are executed in distributed fashion by one or more processing devices interconnected by a communication network, such as an intranet, Wide Area Network (WAN), Local Area Network (LAN), the Internet, and other forms of data transmission medium.
The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art.
The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated.