This application is based upon and claims the benefit of priority from Japanese patent application No. 2023-128099, filed on Aug. 4, 2023, the disclosure of which is incorporated herein in its entirety by reference.
The present disclosure relates to an information processing apparatus, an information processing method, and a computer-readable recording medium for estimating a camera position.
A technique is known for estimating three dimensional information of a camera from an image, in order to perform robot self-position estimation and three dimensional analysis on a subject. According to the technique, if internal parameters (e.g., focal length, lens distortion, etc.) of a camera and three dimensional coordinates of objects that constitute the shooting scene are both known, a rotation matrix and a translation vector of the camera can be estimated. In the following description, the rotation matrix and the translation vector of a camera are collectively referred to as a camera position.
One method of estimating a camera position from an image is to solve a PnP (Perspective-n-point) problem. In order to solve the PnP problem, a 3D point cloud of a scene is acquired using a 3D (3 dimensional) scanner, in advance. Next, feature points are calculated from an image obtained by shooting the scene with a camera, and corresponding 3D points (2D-3D corresponding points) are determined. Finally, the camera position is estimated by solving the PnP problem using the plurality of determined 2D-3D corresponding points. Patent Document 1 (International Publication No. WO/2012/157342) describes a method of obtaining a global optimum solution of a PnP problem.
With a conventional camera position estimation method, it is difficult to stably and accurately estimate a camera position when the illumination condition changes or a large error is included in 3D points. Note that, with the method described in Patent Document 1, the camera position may not be accurately estimated.
An example object of the present disclosure is to accurately calculate a camera position.
In order to achieve the example object described above, an information processing apparatus according to an example aspect includes:
Also, in order to achieve the example object described above, an information processing method that is performed by an information processing apparatus according to an example aspect includes:
Furthermore, in order to achieve the example object described above, a computer-readable recording medium according to an example aspect includes a program recorded on the computer-readable recording medium, the program including instructions that cause the computer to carry out:
According to the present disclosure, the camera position can be accurately calculated.
Hereinafter, an example embodiment will be described with reference to the drawings. Note that, in the drawings described below, the elements that have the same or corresponding functions are given the same reference numerals and description thereof may not be repeated.
The configuration of an information processing apparatus will be described using
The information processing apparatus shown in
The image generating unit 11 generates, based on a camera position of a query image obtained by shooting a scene with the camera from the camera position and three dimensional information regarding the scene, a projection image and a depth image corresponding to shooting from the camera position.
The camera position is information representing a rotation matrix and a translation vector of the camera estimated using internal parameters (e.g., focal length, lens distortion, etc.) of the camera and three dimensional coordinates of objects constituting the scene.
The query image is an image obtained by shooting a scene with the camera, and is an input image to be used for estimating the camera position. The input image may be a color image represented by RGB values, or may also be a gray scale image represented only by luminance values.
The three dimensional information is implicit function three dimensional information, for example. The implicit function three dimensional information is information representing three dimensions using a nonlinear implicit function such as NeRF (Neural Radiance Fields) or SRN (Scene Representation Network), for example. Also, the three dimensional information has the same color representation (RGB or gray scale) as the query image, in addition to the three dimensional shape information regarding the scene.
The camera position correcting unit 12 corrects the camera position based on the depth image and a correspondence relationship between the query image and the projection image.
As described above, in the example embodiment, the camera position is corrected using the corrected depth image, and therefore the camera position can be accurately calculated.
Next, the configuration of the information processing apparatus 10 of the example embodiment will be more specifically described using
As shown in
The network is an ordinary communication network constructed using a communication line such as the Internet, a LAN (Local Area Network), a dedicated line, a telephone line, an intranet, a mobile communication network, Bluetooth (registered trademark), or WiFi (Wireless Fidelity) (registered trademark), for example.
The information processing apparatus 10 is a CPU (Central Processing Unit), a programmable device such as an FPGA (Field-Programmable Gate Array), or a GPU (Graphics Processing Unit), or a circuit on which at least one of the devices is mounted, or an information processing apparatus such as a server computer, a personal computer, or a mobile terminal.
The camera 20 outputs images captured in time series to the information processing apparatus 10. A monocular camera (e.g., wide angle camera, fish-eye camera, omnidirectional camera, etc.), a compound eye camera (e.g., stereo camera, multi-camera, etc.), an RGB-D camera (e.g., depth camera, ToF camera, etc.), and the like are conceivable as the camera 20. Note that the camera 20 may also be provided in the information processing apparatus 10.
The storage device 30 is a database, a server computer, a circuit including a memory, or the like. In the example in
The input device 40 includes devices (user interfaces) such as a touch panel, a mouse, and a keyboard, for example. Note that the input device 40 may also be provided in the information processing apparatus 10.
The output device 50 acquires later-described output information that has been converted into a format that can be output, and outputs images, audio and the like generated based on this output information. The output device 50 is an image display device that uses liquid crystal, organic EL (ElectroLuminescence), or a CRT (Cathode Ray Tube). Furthermore, the image display device may include an audio output device such as a speaker, and the like. Note that the output device 50 may also be a printing device such as a printer. Note that the output device 50 may also be provided in the information processing apparatus 10.
The information processing apparatus will be described in detail.
The information processing apparatus 10 includes an image generating unit 11, a camera position correcting unit 12, a depth image correcting unit 13, a detecting unit 14, and an output information generating unit 15.
The image generating unit 11 generates, using a camera position initial value indicating an initial state of the camera position of a query image and three dimensional information regarding a scene captured by the camera, a projection image and a depth image by projecting the three dimensional information on an image plane based on the camera position.
The camera position initial value may be a random rotation matrix and a random translation vector, or the origin of a three dimensional coordinate system of the scene.
Also, a known method such as VLAD (Locally Aggregated Descriptor) or BoW (Bag of word) may also be used, in order to automatically obtain the camera position as accurately as possible.
The projection image is an image obtained by projecting three dimensional information on an image plane of the camera based on the input camera position. Note that the projection image has the same color representation as the query image.
The depth image is an image in which each pixel has a depth value in a coordinate system whose origin is at the camera. Note that the depth value may be used as it is, or an inverse of the depth value may also be used.
Specifically, the image generating unit 11 first acquires, from the storage device 30, a camera position initial value of a query image for estimating the camera position, and three dimensional information regarding a scene captured by the camera. Next, the image generating unit 11 generates, using the camera position initial value and the three dimensional information, a projection image and a depth image by projecting the three dimensional information on the image plane based on the camera position.
The depth image correcting unit 13 corrects an error in the depth image by performing depth image correction processing using the projection image and the depth image.
Specifically, the depth image correcting unit 13 first acquires the projection image and the depth image. Next, the depth image correcting unit 13 corrects an error included in the depth image by performing the depth image correction processing using the projection image and the depth image.
The error is an inaccurate depth value occurring in the generated depth image, and may also be referred to as a so-called artifact.
The depth image correction processing is processing for correcting a depth image based on a projection image. Specifically, the depth image correction processing is processing for, using a neural network, for example, correcting a depth image using a depth image correction model for correcting an error included in the depth image.
Note that the depth image correction processing is not limited to the processing using a neural network, and may also be processing in which a machine learning method such as a support vector machine or a random forest is used. Also, the processing for correcting a depth image is not limited to a machine learning method, and may also be processing for performing correction according to a correction rule.
The depth image correction model is a machine learning model that receives a generated projection image and depth image as inputs, and outputs the projection image and depth image from both of which noise has been removed, or the depth image from which noise has been removed.
Training of the depth image correction model will be described using
In the example in
If the sensor noise distribution of the camera is known, the noise is a random value based on the distribution, and if the sensor noise distribution of the camera is not known, a random value based on Gaussian distribution is instead used as the noise.
Next, training data is obtained by generating a plurality of true value pairs of a projection image and a depth image to each of which noise is added. Next, supervised learning is executed by inputting a plurality of pieces of generated training data to the depth image correction model for removing noise from the projection image and depth image.
Note that computer graphics may not be used. For example, a plurality of projection images and depth images that have been subjected to learning with respect to various scenes using NeRF are generated, and correct answer pairs of the projection image and depth image corresponding to the respective scenes may also be generated.
The detecting unit 14 detects first image feature points included in the query image and second image feature points included in the projection image, and detects first corresponding points (2D-2D corresponding points: first corresponding point group) corresponding to both the first image feature points (first image feature point group) and the second image feature points (second image feature point group).
The 2D-2D corresponding point is information representing a matching pair of a first image feature point and a second image feature point. A method robust to illumination change such as SIFT (Scale Invariant Feature Transform) or SuperPoint is used as the method of acquiring the 2D-2D corresponding point.
Note that the 2D-2D corresponding point may also be manually designated by a user. A method that accepts removal of an erroneous corresponding point, such as RANSAC (Random Sample Consensus) may also be used for acquisition.
Specifically, the detecting unit 14 first acquires a query image and a projection image. Next, the detecting unit 14 detects first image feature points included in the query image and second image feature points included in the projection image. Next, the detecting unit 14 detects 2D-2D corresponding points (first corresponding points) corresponding to both the detected first image feature points and second image feature points.
The camera position correcting unit 12 first acquires the corrected depth image and the 2D-2D corresponding points (first corresponding points). Next, the camera position correcting unit 12 calculates a correction value for correcting the camera position by performing camera position correction processing using the corrected depth image and the 2D-2D corresponding points.
In the camera position correction processing, first, 2D-3D corresponding points (second corresponding points: second corresponding point group) between the query image and the projection image is acquired using the corrected depth image and the 2D-2D corresponding points (first corresponding points). Specifically, the pixels of the projection image are respectively associated with the depth values of the pixels of the corrected depth image. That is, 2D-3D corresponding points between the query image and the projection image is obtained.
Next, the correction value is calculated by performing processing for solving a PnP (Perspective-n-point) problem using the 2D-3D corresponding points (second corresponding points). Specifically, the camera position of the query image in a camera coordinate system in which the projection image is at the origin is calculated using a known PnP problem solving method. Here, the calculated camera position is equivalent to the difference (correction value) in the camera position from the projection image to the query image. Therefore, when the correction value (difference) is small, the difference in the rotation matrix approaches a unit matrix, and the difference in the translation vector approaches a zero vector.
Next, if the correction value is larger than a preset threshold value, the camera position correcting unit 12 corrects the camera position using the correction value. Specifically, the camera position correcting unit 12 adds the correction value to a camera position initial value or the current camera position.
Next, the camera position correcting unit 12 outputs the corrected camera position to the image generating unit 11. In contrast, if the correction value is the threshold value or less, the corrected camera position is adopted. That is, the processing is repeated until the correction value becomes the threshold value or less.
The output information generating unit 15 generates output information for performing display in the output device 50 by combining at least one of three dimensional spatial representations of the query image, projection image, depth image, and camera position.
Thereafter, the output information generating unit 15 outputs the output information to the output device 50. Note that the output information generating unit 15 may not be provided in the information processing apparatus 10.
Next, operations of the information processing apparatus in the example embodiment will be described using
As shown in
Next, the depth image correcting unit 13 corrects an error in the depth image by performing depth image correction processing using the projection image and the depth image (step A2).
Next, the detecting unit 14 detects first image feature points included in the query image and second image feature points included in the projection image, and detects 2D-2D corresponding points (first corresponding points) corresponding to both the first image feature points and the second image feature points (step A3). Note that the processing order of step A2 and step A3 described above may be reversed.
Next, the camera position correcting unit 12 acquires 2D-3D corresponding points (second corresponding points) between the query image and the projection image using the corrected depth image and the first corresponding points (step A4).
Next, the camera position correcting unit 12 calculates a correction value by performing processing for solving a PnP problem using the 2D-3D corresponding points (step A5).
Next, the camera position correcting unit 12 determines whether or not the correction value is larger than a threshold value (step A6). If the correction value is larger than the threshold value (step A6: Yes), the camera position correcting unit 12 corrects the camera position using the correction value (step A7).
Specifically, in step A7, the camera position correcting unit 12 adds the correction value to the camera position initial value or the current camera position. Then, the processing from step A1 is executed again using the corrected camera position.
In contrast, if the correction value is the threshold value or less (step A6: No), the corrected camera position is adopted, and the processing is ended. Thereafter, the output information generating unit 15 generates output information, and outputs the generated output information to the output device 50 (step A8).
As described above, according to the example embodiment, the camera position is corrected using the corrected depth image and the 2D-2D corresponding point, and therefore the camera position can be accurately calculated. Also, it is suitable for an application in which the camera position is estimated from an image.
Modification 1 will be described using
The depth image correction model of Modification 1 is a machine learning model that receives generated projection image and depth image as inputs, and outputs the projection image and depth image from both of which noise has been removed, or the depth image from which noise has been removed.
In the example in
According to Modification 1, the true value of the depth image need not be generated in advance, and therefore real data obtained by shooting performed by a RGB-D sensor can be used for learning, in addition to simulation by computer graphics.
The program according to the second example embodiment may be a program that causes a computer to execute steps A1 to A8 shown in
Also, the program according to the embodiment may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as any of the image generating unit 11, the camera position correcting unit 12, the depth image correcting unit 13, the detecting unit 14, and the output information generating unit 15.
Here, a computer that realizes the information processing apparatus by executing the program according to the example embodiment and the modification 1 will be described with reference to
As shown in
The CPU 111 opens the program (code) according to this example embodiment, which has been stored in the storage device 113, in the main memory 112 and performs various operations by executing the program in a predetermined order. The main memory 112 is typically a volatile storage device such as a DRAM (Dynamic Random Access Memory).
Also, the program according to the example embodiment and the modification 1 are provided in a state being stored in a computer-readable recording medium 120. Note that the program according to the example embodiment and the modification 1 may be distributed on the Internet, which is connected through the communications interface 117.
Also, other than a hard disk drive, a semiconductor storage device such as a flash memory can be given as a specific example of the storage device 113. The input interface 114 mediates data transmission between the CPU 111 and an input device 118, which may be a keyboard or mouse. The display controller 115 is connected to a display device 119, and controls display on the display device 119.
The data reader/writer 116 mediates data transmission between the CPU 111 and the recording medium 120, and executes reading of a program from the recording medium 120 and writing of processing results in the computer 110 to the recording medium 120. The communications interface 117 mediates data transmission between the CPU 111 and other computers.
Also, general-purpose semiconductor storage devices such as CF (Compact Flash (registered trademark)) and SD (Secure Digital), a magnetic recording medium such as a Flexible Disk, or an optical recording medium such as a CD-ROM (Compact Disk Read-Only Memory) can be given as specific examples of the recording medium 120.
The information processing apparatus 10 according to the example embodiment and the modification 1 can also be achieved using hardware corresponding to the components, instead of a computer in which a program is installed. Furthermore, a part of information processing apparatus 10 may be realized by a program and the remaining part may be realized by hardware. In the example embodiment and the modification 1, the computer is not limited to the computer shown in
The following supplementary notes are also disclosed in relation to the above-described example embodiments. Although at least part or all of the above-described example embodiments can be expressed as, but are not limited to, (Supplementary note 1) to (Supplementary note 21) described below.
(Supplementary note 1)
An information processing apparatus comprising:
The information processing apparatus according to supplementary note 1, further comprising
The information processing apparatus according to supplementary note 2,
The information processing apparatus according to supplementary note 3,
The information processing apparatus according to supplementary note 1, further comprising
The information processing apparatus according to supplementary note 5,
The information processing apparatus according to any one of supplementary notes 1 to 4,
An information processing method that is performed by an information processing apparatus, the method comprising:
The information processing method according to supplementary note 8,
The information processing method according to supplementary note 9,
The information processing method according to supplementary note 10,
The information processing method according to supplementary note 8,
The information processing method according to supplementary note 12,
The information processing method according to any one of supplementary notes 8 to 11, wherein the three dimensional information is information representing the scene in three dimensions using a nonlinear implicit function.
(Supplementary note 15)
A computer-readable recording medium that includes a program including instructions recorded thereon, the instructions causing a computer to carry out:
The computer readable recording medium according to supplementary note 15,
The computer readable recording medium according to supplementary note 16,
The computer readable recording medium according to supplementary note 17,
The computer readable recording medium according to supplementary note 15,
The computer readable recording medium according to supplementary note 19,
The computer readable recording medium according to any one of supplementary notes 15 to 18,
Although the invention has been described with reference to the example embodiment and the modification 1, the invention is not limited to the example embodiment and the modification 1 described above. Various changes can be made to the configuration and details of the invention that can be understood by a person skilled in the art within the scope of the invention.
According to the describe above, the camera position can be accurately calculated. In addition, it is useful in a field where camera position calculation is required.
While the invention has been particularly shown and described with reference to exemplary example embodiments thereof, the invention is not limited to these example embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the claims.
Number | Date | Country | Kind |
---|---|---|---|
2023-128099 | Aug 2023 | JP | national |