The present invention relates to a device and a process for carrying out three-dimensional localization and pose estimation of an object using images of the object captured by a plurality of cameras; and a computer-readable storage medium storing the program thereof.
The stereo method is a technique for reconstructing a three-dimensional environment using images captured by a plurality of cameras at different viewpoints. In recent years, the image recognition technique has become more frequently used in the factory automation field. Particularly, among them, the stereo method has a function of measuring a three-dimensional shape, size, localization and pose of the target object with high accuracy. This function cannot be achieved by other image processing techniques. With this advantage, the stereo method is widely applicable in the industrial field; for example, for manipulation of robots for bin-picking of randomly placed parts. Moreover, the stereo method can be performed at low cost only by acquiring conventional image information from different viewpoints, without requiring special hardware. For this reason, there is a high expectation for actual utilization of the localization and pose estimation technique according to the stereo method.
On the other hand, the stereo method has a long-held problem called “occlusion”, which occurs due to the positional difference between the cameras. Occlusion is a condition more specifically called “self-occlusion phenomenon”, in which a part of an edge of the target object is overlapped by the object itself; therefore, said part can be captured by one camera, but cannot be captured by another camera. When the three-dimensional reconstruction is performed with an image set having occlusion, a stereo correspondence error occurs in the defective part, thereby failing the recognition of the target object, or measuring incorrect localization and pose due to the resulting false three-dimensional reconstructed structure. This problem has been a drawback of the stereo method.
In contrast, on the same right lateral face, the edge on the far side of the right lateral face can be seen in the second camera, but cannot be seen in the first camera (it is not displayed in the first camera image). More specifically, the combination of the first camera image and the second camera image has occlusion.
When three-dimensional reconstruction is performed with such an image set having occlusion, a stereo correspondence error occurs in the defective part, thereby failing the recognition of the target object, or measuring incorrect localization and pose due to the resulting false three-dimensional reconstructed structure. Therefore, occlusion is a major hurdle of factory utilization of the stereo method.
As one solution of this problem, there is a known method that uses an extra camera to verify three-dimensional reconstructed structures obtained by the conventional binocular camera so as to eliminate correspondence errors. This process eliminates all information other than the information observable by all cameras. For example, in
Moreover, in the case of
However, in some cases of the method of eliminating information other than the information observable by all cameras, depending on the geometric positioning of the cameras and the target object upon image-capturing, the verification of the right lateral face of a three-dimensional reconstructed structure created by the first camera image and the third camera image is performed by the second camera image; more specifically, only the camera image for verification may contain occlusion. In this case, even though the result of stereo correspondence is correct, the result would not be approved (more specifically, in some cases, a correct correspondence may be eliminated as a false correspondence). This disadvantage has been considered a problem to be solved. This problem could be solved by adopting a combined result of multiple stereo processes in which the order of the basic image, the reference image, and the verification image are switched. However, it is still a fact that the determination of false correspondence is principally impossible using a trinocular camera.
Further, the process of determining stereo correspondence by referring to luminance matching condition of the region (surface) containing the edge also has a problem such that the luminance difference between the respective surfaces of the object greatly depends on the degree of exposure and other conditions (material, surface treatment condition, etc. of the target object, lighting position, performance of camera etc.). In the actual factory environments, overexposure or the like may unavoidably occur due to various factors. In this case, as shown in
Although a great deal of research was conducted to improve the accuracy of three-dimensional reconstruction, elimination of false stereo correspondence data, and detection of occlusion by using at least three cameras, no intensive research was conducted for a method for measuring the localization and pose of an object without influence of a part of three-dimensional reconstructed structure generated by false stereo correspondence.
In order to solve the foregoing problems, an object of the present invention is to provide a device and method capable of measurement of three-dimensional localization and pose of a target object without influence of false stereo correspondence data that may be contained in a portion of image data captured by at least three cameras; and a computer-readable storage medium storing the program thereof.
The object of the present invention is attained by the following means.
Specifically, a three-dimensional localization and pose estimation device according to the present invention comprises:
an input unit for receiving three or more items of image data obtained by capturing images of an object by imaging units at different viewpoints; and
an arithmetic unit,
wherein:
the arithmetic unit performs:
1) finding a three-dimensional reconstruction point set and a feature set for each of multiple pairs of two different images selected from the three or more items of image data,
2) calculating a total three-dimensional reconstruction point set and a total feature set by totaling the three-dimensional reconstruction point sets and the feature sets of the multiple pairs,
3) matching a model feature set regarding model data of the object with the total feature set, thereby determining, among the total three-dimensional reconstruction point set, points corresponding to model points of the object;
the three-dimensional reconstruction point set contains three-dimensional position information of segments obtained by dividing a boundary of the object in the image data; and
the feature set contains three-dimensional information regarding vertices of the segments.
A second three-dimensional localization and pose estimation device according to the present invention is arranged such that, based on the first three-dimensional localization and pose estimation device,
the segments are approximated by straight lines, arcs, or a combination of straight lines and arcs;
the three-dimensional information regarding the vertices comprises three-dimensional position coordinates and two types of three-dimensional tangent vectors of the vertices;
in Step (3), the process of matching a model feature set regarding model data of the object with the total feature set is a process of finding a transformation matrix for three-dimensional coordinate transformation, thereby matching a part of the model feature set with a part of the total feature set; and
in Step (3), the process of determining, among the total three-dimensional reconstruction point set, points that correspond to model points of the object is a process for evaluating a concordance of a result of three-dimensional coordinate transformation of the model points using the transformation matrix with the points of the total three-dimensional reconstruction point set.
A process for measuring three-dimensional localization and pose according to the present invention comprises the steps of:
1) obtaining three or more items of image data by capturing images of an object by imaging units at different viewpoints;
2) finding a three-dimensional reconstruction point set and a feature set for each of multiple pairs of two different images selected from the three or more items of image data;
3) calculating a total three-dimensional reconstruction point set and a total feature set by totaling the three-dimensional reconstruction point sets and the feature sets of the multiple pairs;
4) matching a model feature set regarding model data of the object with the total feature set, thereby determining, among the total three-dimensional reconstruction point set, points corresponding to model points of the object;
wherein:
the three-dimensional reconstruction point set contains three-dimensional position information of segments obtained by dividing a boundary of the object in the image data; and
the feature set contains three-dimensional information regarding vertices of the segments.
A computer-readable storage medium according to the present invention stores a program for causing a computer to execute the functions of:
1) obtaining three or more items of image data by capturing images of an object by imaging units at different viewpoints;
2) finding a three-dimensional reconstruction point set and a feature set for each of multiple pairs of two different images selected from the three or more items of image data;
3) calculating a total three-dimensional reconstruction point set and a total feature set by totaling the three-dimensional reconstruction point sets and the feature sets of the multiple pairs;
4) matching a model feature set regarding model data of the object with the total feature set, thereby determining, among the total three-dimensional reconstruction point set, points corresponding to model points of the object;
wherein:
the three-dimensional reconstruction point set contains three-dimensional position information of segments obtained by dividing a boundary of the object in the image data; and
the feature set contains three-dimensional information regarding vertices of the segments.
The present invention enables accurate localization and pose estimation without influence of a three-dimensional reconstructed structure generated by false stereo correspondence that may occur in a portion of image data due to occlusion or the like. In the conventional method that supplementarily uses an additional camera image for verification, there are some cases where a correct combination of stereo correspondence is regarded as false correspondence due to verification camera image information. However, since the present invention handles all of the three-dimensional reconstructed structures captured by different combinations of multiple cameras equally, the reconstruction result will not depend on the combination of the cameras. Therefore, it becomes possible to more accurately perform localization and pose recognition regardless of the geometric positioning of the cameras and the target object.
a) through (f) Diagrams showing multiple results of projection of a 3D reconstruction data point group on a model coordinate system according to a localization—and pose-estimating result obtained by a conventional process.
a) through (f) Diagrams showing multiple results of transformation of a 3D reconstruction data point group on a model coordinate system, according to a localization—and pose-estimating result obtained by a process of the present invention.
An embodiment of the present invention is described below in reference to the attached drawings.
The CPU 1 reads out a predetermined program from the recording unit 2, and develops the data in the memory 3 to execute predetermined data processing using a predetermined work area in the memory 3. The CPU 1 records, as required, results of ongoing processing and final results of completed processing in the recording unit 2. The CPU 1 accepts instructions and data input from the operation unit 5 via the interface unit 4, and executes the required task. Further, as required, the CPU 1 displays predetermined information in the display unit 6 via the interface unit 4. For example, the CPU 1 displays a graphical user interface image showing acceptance of input via the operation unit 5 in the display unit 6. The CPU 1 acquires information regarding conditions of the user's operation with respect to the operation unit 5, and executes the required task. For example, the CPU 1 records the input data in the recording unit 2, and executes the required task. The present device may be constituted of a computer. In this case, computer keyboards, mice, etc. may be used as the operation unit 5. CRT displays, liquid crystal displays, etc. may be used as the display unit 6.
The first to third imaging units C1 to C3 are disposed in predetermined positions at a predetermined interval. The first to third imaging units C1 to C3 capture images of a target object T, and send resulting image data to the main body unit. The main body unit records the image data sent from the imaging units via the interface unit 4 in a manner to distinguish the respective data items from each other; for example, by giving them different file names according to the imaging unit. When the output signals from the first to third imaging unit C1 to C3 are analog signals, the main body unit comprises an AD (analog-digital) conversion unit (not shown) to sample the input analog signals supplied at predetermined time intervals into digital data. When the output signals from the first to third imaging unit C1 to C3 are digital data, the AD conversion unit is not necessary. The first to third imaging units C1 to C3 are at least capable of capturing still pictures, and optionally capable of capturing moving pictures. Examples of the first to third imaging units include digital cameras, and digital or analog video cameras.
An operation sequence of the present device is described below in reference to the flow chart of
In Step S1, the initial settings are made. The initial settings are required to enable the processes in Step S2 and later steps. In the initial settings, for example, the control protocol and data transmission path for the first to third imaging units C1 to C3 are established to enable control of the first to third imaging units C1 to C3.
In Step S2, images of the target object T are captured by the first to third imaging units C1 to C3. The captured images are sent to the main body unit, and are recorded in the recording unit 2 with predetermined file names. In this manner, three items of two-dimensional image data captured at different localizations and directions are stored in the recording unit 2. In this embodiment, the three items of two dimensional image data obtained by the first to third imaging units C1 to C3 are represented by Im1 to Im3.
In Step S3, two items out of the three image data items Im1 to Im3 stored in the recording unit 2 in Step S2 are specified as paired images. More specifically, either of a pair of Im1 and Im2, a pair of Im2 and Im3, or a pair of Im3 and Im1 is specified.
In Step S4, using the two items of two-dimensional image data specified as paired images in Step S3, a three-dimensional reconstructed structure (a set of three-dimensional reconstruction points) is calculated by stereo correspondence. Here, the correspondence is not found by points (pixels), but by more comprehensive units, i.e., “segments”. This can reduce the search space to a considerable degree, compared with the point-based image reconstruction. For the detailed processing method, the conventional method disclosed in the above Non-patent Literature 1 can be referenced. The following explains only the operation directly related to the present invention.
The reconstruction is performed by carrying out a series of three-dimensional reconstruction processes by sequentially subjecting each image of the paired images to (a) edge detection, (b) segment generation, and (c) three-dimensional reconstruction by evaluation of segment connectivity and correspondence between the images. Hereinafter, a set of three-dimensional reconstruction points regarding the paired images obtained in Step S4 is represented by Fi. Because Step S4 is repeated for all pairs as described later, “i” discriminates the pair. In this embodiment, “i” is either 1, 2 or 3, since two items are selected out of three images.
Any known image-processing method can be used for edge detection of each image. For example, the strength and direction of the edge of each point of the image are found by a primary differential operator; and a closed edge (also referred to as a boundary) surrounding a region is obtained by non-maximum suppression, thresholding, and edge extension.
Segments are generated using the two edge images obtained above. A “segment” is obtained by dividing an edge into a plurality of line (straight line) components. At first, the boundary is tentatively divided with a predetermined condition, and the segments are approximated by straight lines according to the method of least squares. Here, if there are any segments having a significant error, the segment is divided at a point most distant from the straight line connecting the two ends of the segment (a point having the largest perpendicular line with respect to the straight line in the segment). This process is repeated to determine the points to divide the boundary (divisional point), thereby generating segments for each of the two images, and further generating straight lines for approximating the segments.
The processing result is recorded in the recording unit as a boundary representation (structural data). More specifically, each image is represented by a set of multiple regions. Each region R is represented by a list of an external boundary B of the region and a boundary H with respect to the inner hole of the region. The boundaries B and H are represented by a list of segments S. Each region is defined by values representing a circumscribed rectangle that surrounds the region, and a luminance. Each segment is oriented so that the region containing the segment is seen on the right side. Each segment is defined by values representing coordinates of the start point and the end point, and an equation of the straight line that approximates the segment. Such data construction is performed for the two images. The following correspondence process is performed on the data structure thus constructed.
Next, corresponding segments are found from the two images. Although the segments represent images of the same object, it is not easy to determine correspondences of the segments because of the variable lighting conditions, occlusion, noise, etc. Therefore, first, correspondences are roughly found on a region basis. As a condition to determine a correspondence of a pair of the regions, it is necessary to satisfy that the difference between the luminances of the regions is equal to or less than a certain value (for example, a level 25 for 256-scale luminance), and that the regions contain points satisfying the epipolar condition. However, since this is not a sufficient condition, multiple corresponding regions may be found for a single region. More specifically, this process finds all potential pairs having the corresponding boundaries, so as to reduce the search space for finding correspondences on a segment basis. This is a kind of coarse-to-fine analysis.
Among the segments roughly assumed to compose the same boundary, potential corresponding segment pairs are found and summarized in a list. Here, as a condition to determine a correspondence of a pair of the segments, it is necessary to satisfy that the segments have corresponding portions satisfying the epipolar condition, that upward or downward orientations of the segments (each segment is oriented so that the region containing it is seen on the right side) are matched, and that the difference of angles of the orientations falls within a certain value (e.g., 45°).
Thereafter, for each of the potential segment pairs, the degree of similarity, which is represented by values C and D, is found. “C”, as a positive factor, denotes a length of the shorter segment among the corresponding two segments. “D”, as a negative factor, denotes a difference in parallax from the start point to the end point between the corresponding segments. The potential segment pairs found at this stage contain multiple correspondences in which a single segment corresponds to multiple segments on the same y axis (vertical direction). As explained below, false correspondences are eliminated according to a similarity degree and a connecting condition of the segments.
Next, for each of the two images, a list of connected segments is created. To satisfy the condition to determine the connection of two segments, it is necessary for the difference between the luminances of the regions containing the segments to be equal to or less than a certain value (for example, a level 25); and for the distance between the end point of one segment to the start point of the other segment to be less than a certain value (for example, 3 pixels). Basically, if one of the segments of a pair is a continuous segment, the other segment must be a continuous segment. Accordingly, using the connection list and correspondence list, a path showing a string of the corresponding continuous segments connected to and from the segment is found, in the following manner.
Further, it may even be possible to determine the connection for the pairs not in direct connection. For example, when a single segment corresponds to two segments, a line component having the largest distance between the both ends of the two segments is temporarily used as a substitute for the two segments. Still further, in some cases, the two continuous segments connected via a point A correspond to two discontinuous segments. In this case, the two discontinuous segments are extended. Then, if the distance between the two points intersecting with the horizontal line that crosses the point A is small, the extended two-line components (one of the ends is the intersecting point) are temporarily determined as two corresponding segments. However, to avoid generating an unnecessarily large amount of temporarily assumed segments, the similarity degree of the temporarily assumed segments and true segments must satisfy C>|D|. In this manner, the operation is repeated until segments to be added to the path are no longer found. By performing the above operation, new temporarily assumed segments are added.
Next, assuming that the paths are projected backwards on a three-dimensional space, the segments composing the same plane are grouped. This serves not only as the plane restraint condition for finding correct segment pairs, but also as a procedure to obtain an output of the boundary on a three-dimensional plane. To confirm that the segments compose the same plane, the following plane restraint theorem is referenced.
Plane restraint theorem: For the standard camera model, with respect to an arbitrary shape on a plane, a projection image on one camera and a projection image on another camera are affine-transformable.
The theorem denotes that a set of segments that exist on the same plane is affine-transformable between stereo images even for segments on an image obtained by perspective projection, thereby enabling validation of flatness of segments on an image without directly projecting segments backwards. The grouping of the segments using the plane restraint theorem is performed as follows.
First, an arbitrary pair of two corresponding continuing segments is selected from the paths of corresponding pairs, so as to form a minimum pair group.
Then, a segment continuous to each segment of the two images is found. Assuming that all terminal points of the three segments thus found exist on the same plane, an affine transformation matrix between two pairs of continuing segments (each pair has three segments) is found according to a method of least squares. To confirm that the three segments exist on a plane, it is verified that the point obtained by affine transformation of either the right or left terminal point is identical with the other terminal point. In the present specification, concordance of two points indicates a state in which the distance between the two points is equal to or less than a predetermined value. Therefore, if the distance is equal to or less than a predetermined value (e.g., 3 pixels), it is determined that the three segments exist on the same plane.
When the above method found that the three segments exist on the same plane, a segment continuous to each of the right and left segments is found again. In this manner, an affine transformation matrix is found for the four corresponding segments, and validation is performed to determine whether the corresponding terminal points satisfy the obtained transformation matrix. Further, if the plane restraint condition is satisfied, the validation is repeated by sequentially validating continuous segments.
As a result of the above process, pairs of segment groups that constitute the plane are found. However, in some cases, multiple pair groups may be obtained with respect to a single segment pair (multiple continuing segments that constitute the plane). Therefore, the degree of shape similarity is calculated for each pair group so that each segment pair is allotted a single pair group with the maximum similarity degree. The similarity degree G of a pair group is a total of the similarity degrees C. and D of the segments contained in the pair group. In the addition, the minus factor D is given a minus sign, i.e., −D is added. Multiple correspondences indicate that there are one or more false-matching pairs. In a false-matching pair, the segment pair has a small correspondence (C is small), a large difference in parallax (|D| is large), and a small number of continuous segments. Hence, the value of similarity degree G of the pair group containing the pair becomes small. Therefore, the pair group having the maximum similarity degree G is sequentially selected, and other corresponding pair groups are eliminated. In this manner, it is possible to specify the corresponding segment pairs among two images.
With the above process, the coordinates of the segments on a three-dimensional space can be found from the differences in parallax of the corresponding segment pairs among two images. Since the differences in parallax can be calculated using functions of segments, the obtained results are based on sub-pixels. Further, the differences in parallax on the segments do not fluctuate. For example, assuming that the equations of the two corresponding segments j among two images are x=fj(y) and x=gj(y), the difference in parallax d between the two segments can be found by d=fj(y)−gj(y). In practice, the three-dimensional segments are expressed by an equation of a straight line.
Using the information and difference in parallax d of the obtained corresponding segments, and taking the positions of two cameras (imaging units) into account, a three-dimensional reconstruction point set Fi is found. A detailed explanation of the calculation method for finding three-dimensional coordinates using the two corresponding points on two images and their difference in parallax is omitted here because there are some known methods adoptable both in the case of disposing optical axes of two cameras in parallel, and in the case of disposing them via an angle of convergence.
The result obtained above is recorded in the recording unit 2 in the form of a predetermined data structure. The data structure is composed of a set of groups G* expressing three-dimensional planes. Each group G* contains information of a list of three-dimensional segments S* constituting the boundary. Each group G* has a normal direction of the plane, and each segment has three-dimensional coordinates of the start and end points, and an equation of a straight line.
In Step S5, calculation of feature is performed with respect to the image data specified as paired images in Step S3. Here, a set of “vertices”, which is a feature required for model matching, is found. A “vertex” refers to an intersection of so-called virtual straight lines, which are composed of two vectors defined by straight lines allotted to spatially-adjacent three-dimensional segments. More specifically, with respect to the three-dimensional reconstruction point set Fi, the intersection of two adjacent tangent lines is found using tangent lines at terminal points of the straight lines allotted to two adjacent segments (in this example using straight lines to approximate the segments, it refers to the straight lines). The obtained intersections are defined as vertices. A set of the vertices is expressed as Vi. Further, an angle between the two tangent vectors (hereinafter referred to as an included angle) is found.
More specifically, the feature refers to a three-dimensional position coordinate of the vertex, an included angle at the vertex, and two tangent vector components. To find the features, the method disclosed in the Non-patent Literature 2 shown above may be used.
In Step S6, a judgment is carried out as to whether the process is completed for all of the three pairs of image data, each of which has a different combination among the two-dimensional image data Im1 to Im3. If there is any unprocessed pair or pairs, the sequence goes back to Step S3 to repeat a sequence from Steps S3 to S5. If the process is completed for all pairs, an entire three-dimensional reconstruction point set Fa (Fa=F1+F2+F3), which is a total of all three-dimensional reconstruction point sets Fi, and an entire vertex set Va (Va=V1+V2+V3), which is a total of all vertex sets Vi, are found. Then, the sequence goes back to Step S7. As required, Fi, Vi (i=1, 2, 3), Fa, and Va are stored in the recording unit 2.
Step S7 performs matching with model data. Here, it is assumed that, with respect to the target object T, the model point set Ft, and the model vertex set Vt corresponding to the entire three-dimensional reconstruction point set Fa and the entire vertex set Va are generated from its three-dimensional shape data; and the generated data is stored in the recording unit 2.
The target object T used in the present invention is an industrial product whose three-dimensional shape is determined in the designing process before the actual manufacture; therefore, it is usually possible to obtain the original three-dimensional shape data (such as CAD data), which may be used to generate the model point set Ft and the model vertex set Vt. If it is not possible to obtain the original data, the above process may be performed using stereo images of the target object T captured at a desirable imaging condition (desirable lighting, imaging position, resolution etc.), thereby generating the model point set Ft, and the model vertex set Vt.
With respect to the entire vertex sets Va and model vertex sets Vt generated from the image data of the target object T, 4×4 (4 columns and 4 rows) coordinate transformation matrices Tj are found for all combinations (denoted by candidate number j) of vertices having similar included angle values, to create a solution candidate group Ca (Ca=ΣCj). This process is called “initial matching”. Then, using each transformation matrix Tj as an initial value, “fine adjustment” is performed according to the Iterative Closest Point (ICP) algorithm using the model point group and the entire three-dimensional reconstruction point set Fa, thereby updating each coordinate transformation matrix Tj. The final coordinate transformation matrix Tj, and the matching level Mj between the model points and the data points are stored in the recording unit 2 as information of each candidate.
For the detailed method, the method disclosed in the above Non-patent Literature 2 can be referenced. The following explains only the operation directly related to the present invention.
The transformation from a three-dimensional coordinate vector a=[x y z]t to a three-dimensional coordinate vector a′=[x′ y′ z′]t (t denotes transposition) is expressed as a′=Ra+P using a 3×3 three-dimensional coordinate rotation matrix R and a 3×1 translation vector P. Therefore, the relative localization/pose of the target object T may be defined by a 4×4 coordinate transformation matrix T for moving a model to match it with a corresponding three-dimensional structure of the captured image data.
First, the initial matching is performed. The initial matching is a process for comparing a model vertex set Vt with the entire vertex set Va in the captured image data, thereby finding a transformation matrix T. However, since it is not possible to previously obtain information of correct vertex correspondence between the model vertex set and the measured set, all likely combinations are presumably determined as candidates.
First, the model vertex VM is assumed to move to match the measurement data vertex VD. According to the relationship between the three-dimensional position coordinates of the vertex VM and VD, the translation vector P of the matrix T is determined. A rotation matrix R is determined according to the directions of two three-dimensional vectors constituting the vertex. If the pair has a large difference in angle θ formed of two vectors constituting the vertex, it is likely that the correspondence is incorrect; therefore, the pair is excluded from the candidates. More specifically, with respect to VM(i) (i=1, . . . , m) and VD(j)(j=1, . . . , n), the matrices Tij (corresponding to the aforementioned coordinate transformation matrix Tj) are found for all combinations A(i,j) satisfying |θM(i)−θD(j)|<θth, which are regarded as correspondence candidates. Here, m and n respectively denote the numbers of vertices existing in the model vertex set VM and the measurement data vertex set VD. The threshold θth may be empirically determined, for example.
Next, fine adjustment is performed. The fine adjustment is a process for finding correspondence between the model points and the data points of the entire three-dimensional reconstruction point set Fa, thereby simultaneously determining the adequacy of A(i, j) and reducing errors contained in matrix Tij(0). The process performs a sequence that repeats transfer of the model points using the coordinate transformation matrix Tij(0) found by the initial matching, a search for image data points (points in the entire three-dimensional reconstruction point set Fa) corresponding to the model points, and an update of coordinate transformation matrix by way of least squares. The details are according to known methods (for example, see the section “3.2 fine adjustment” in the above Non-patent Literature 2).
Since the initial matching uses a local geometric feature, the corresponding point search may not have sufficiently effective recognition accuracy, except for the model points in the vicinity of the vertices used for calculation of Tij. Therefore, the fine adjustment process is preferably performed in the following two stages.
Initial fine adjustment: correspondence errors are roughly adjusted using only model points on the segments constituting the vertices used for initial matching.
Main fine adjustment: the accuracy is increased by using all model points.
Using the final coordinate transformation matrix Tj(Tij) thus obtained, the points on the model are transformed, and the number Mj of points (matched points) in which the distance after the transformation from the model point to the image data point is equal to or less than a predetermined value is found for each candidate. The obtained coordinate transformation matrix Tj and the number of matched points Mj are stored in the recording unit 2.
Step S8 carries out a judgment regarding the result of initial matching. The matched point number Mj is found for all candidates, and the candidates are ranked by Mj in descending order. The coordinate transformation matrix Tj of the top candidate (with the greatest Mj) is defined as a solution showing the localization and pose of the target. More specifically, a coordinate transformation matrix Tj for transforming a segment is determined for each segment of the model.
As described, even in a condition where the stereo correspondence partly generates a false result due to occlusion or the like, the above method enables accurate localization and pose estimation without influence of a three-dimensional reconstructed structure generated by such false stereo correspondence.
In the method in which the additional camera image is used as an auxiliary image used for verification, the reconstruction result varies depending on which camera is used for verification. Therefore, in some cases, a combination having correct stereo correspondence is regarded as an unmatched combination due to the information of a verification camera image. However, according to the present invention, all pairs of the three-dimensional reconstructed structures captured by a different camera pair out of three cameras are equally treated. Therefore, it is possible to prevent the false reconstruction results due to the varying combinations of camera, i.e., varying geometric positions between the cameras and the target object, thereby enabling more accurate localization and pose estimation.
As described above, the present invention adopts a method of assuming a candidate group of local optimum solutions (and in the vicinity thereof) by performing matching of features. More specifically, according to a comparison between respective included angle values, which are the features, of the model and the captured image data, it is likely that a combination having similar values is near the local optimum solution of the multimodal function. Therefore, the present invention finds a candidate group of the initial estimate value (transformation matrix) in the vicinity of the local optimum solution, finds a local optimum solution by the ICP for each candidate (Step S7), and determines a solution having the greatest matched point number among the solution group, thereby finding a global optimum solution (Step S8).
The above embodiment is not to limit the present invention. More specifically, the present invention is not limited to the disclosures of the embodiment above, but may be altered in many ways.
For example, in
Further, in the above embodiment, the segments are approximated by straight lines; however, the segments may be approximated by straight lines or arcs. In this case, the arcs (for example, the radius of the arc, the directional vector or the normal vector from the center of the arc to the two terminal points, etc.) as well as the vertices can be used as features. Further, the segments may be approximated by a combination of straight lines and arcs (including a combination of multiple arcs). In this case, only the arcs in the two terminal points of the segment may be used as a characteristic of the segment, in addition to the vertices.
When the segments are approximated by arcs (including the case where the segments are approximated by a combination of straight lines and arcs), the calculation of the vertices in Step S5 is performed using the tangent lines at the ends of the arcs. The tangent lines of the arcs can be found by a directional vector from the center of the arc toward the two terminal points. Further, in Step S7, in addition to the process regarding the vertices, a process for finding correspondence candidates for the combination of arcs of a model and obtained image data are also performed. A translation vector P can be determined by three-dimensional coordinates of the two terminal points of the arc, and a rotation matrix R can be determined by a directional vector and a normal vector from the center of the arc toward the two terminal points. It is preferable to exclude a combination of arcs having a great difference in their radii from the candidates. The total of the correspondence candidates obtained by using the vertices and the arcs, i.e., A(i,j) and Tij(0), is regarded as the final result of initial matching.
Further, the above embodiment is carried out by a software program using a computer as the main unit; however, the present invention is not limited to this. For example, a single hardware device or multiple hardware devices (for example, dedicated semiconductor chip (ASIC) and its peripheral circuit) may be used to execute a part or the entirety of the functions divided into multiple functions. For example, when multiple hardware devices are used, the devices may comprise a three-dimensional reconstruction calculation unit for obtaining the three-dimensional reconstructed structures from paired image data by way of stereo correspondence and for finding features required for model matching; a localization—and pose-matching adjustment unit for estimating localization and pose according to similarity of features of the captured image data and a model; and a matching result judgment unit for ranking the candidates in order of matched point number.
Examples of the present invention are described below to further clarify the effectiveness of the present invention.
In First Example, to more easily understand the condition of false stereo correspondence, the measurement was performed using a model having a simple shape. An image of the object shown in
The three cameras were arranged such that the second camera was disposed on the right of the first camera with a base length of 25 cm, and the third camera was disposed upward from the center of the first and second cameras at a 6 cm distance.
As an appropriate target object shown in
Images of the target object shown in
The lower images of
In
The following describes a problem in a conventional matching process using the same captured images (
As shown above, the present invention is capable of accurate measurement even for an image set whose localization and pose could not be accurately measured by the conventional method due to false stereo correspondence.
In First Example, a model having a simple shape was used to more clearly show the condition of false stereo correspondence. For comparison, another experiment was performed using an object having a more complicated shape. The result of this experiment is explained below as Second Example. In Second Example, the measurement was performed using an L-shaped object as a target object, which is a shape closer to a real industrial component, and a structure simple enough to be drawn on a diagram. The L-shaped object had two L-shaped faces and six rectangular faces.
The images in
Before presenting the results of the process according to the present invention, the following presents results of localization and pose estimation according to a conventional process using binocular stereo paired images (G1, G2).
In the case of 3D data, direct plotting of the model points will not clearly show the three-dimensional relative position of the points. Therefore, in (a)-(f) in
The following individually explains
Observing
Measurement Results of Process According to the Present Invention
Three items of image data shown in
Observing
Observing
In order to show various conditions of stereo correspondence, the present example shows the results regarding the images shown in
Number | Date | Country | Kind |
---|---|---|---|
2010-067275 | Mar 2010 | JP | national |
2011-024715 | Feb 2011 | JP | national |