The present invention relates to an image processing technique to generate data corresponding to a virtual viewpoint from multi-viewpoint images.
There is a technique called NeRF (Neural Radiance Fields) generating an image corresponding to an arbitrary virtual viewpoint by taking multi-viewpoint images whose camera parameters are already known as an input (“Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. 3 Aug. 2020.”, U.S. Pat. No. 11,308,659). This NeRF is a neural network outputting a volume density σ and emitted radiance from a five-dimensional input variable {arbitrary spatial position coordinates (x, y, z) and direction (θ, φ)}. In the learning of this neural network, the pixel values of multi-viewpoint images are taken as correct pixel values and the difference from the pixel value of rendering results is taken as a loss. Consequently, in the learning process, drawing (rendering), loss calculation, and error backpropagation are performed the number of times corresponding to the number of images included in the multi-viewpoint images and further, these pieces of processing are repeated, and therefore, much time is necessary for the learning. For example, in order to learn a scene in which a 1K image with 100 viewpoints is taken as a correct image, it is known that at least 12 hours or more are necessary. For the problem such as this requiring much time for learning, the method called VaxNeRF or the like has been proposed, which makes attempts to increase the speed by utilizing the visual hull method (see “Naruya Kondo, Yuya Ikeda, Andrea Tagliasacchi, Yutaka Matsuo, Yoichi Ochiai, Shixiang Shane Gu. VaxNeRF: Revisiting the Classic for Voxel-Accelerated Neural Radiance Field. Nov. 25, 2021”). The visual hull method is a method of obtaining a three-dimensional shape of an object extracted from multi-viewpoint images by performing back projection of the silhouette of the object onto a three-dimensional space, configuring a pyramid-like body (including a body whose base is not a square) from each viewpoint, and finding the intersecting portions of each pyramid-like body. This visual hull method has a feature that on the assumption the extracted silhouette is correct, it is guaranteed that no object exists outside the obtained three-dimensional shape. By utilizing this feature, in VaxNeRF, the amount of calculation at the time of learning is suppressed and the speed of learning is increased by limiting the sampling points at the time of rendering to within the three-dimensional shape obtained by the visual hull method as well not utilizing the pixel located outside the silhouette for learning.
For example, in a case or the like where a scene is taken as a target, in which many objects are distributed sparsely in a wide image capturing area, such as a game of soccer, even by the method of VaxNeRF, a huge amount of time is still necessary for learning.
The present invention has been made in view of the above-described problem and an object is to perform learning for generating an image or the like corresponding to a virtual viewpoint from multi-viewpoint images at a higher speed.
The image processing apparatus according to the present invention is an image processing apparatus including: one or more memories storing instructions; and one or more processors executing the instructions to perform: obtaining a plurality of captured images obtained by a plurality of imaging devices; generating rough shape data representing a three-dimensional shape of an object based the obtained plurality of captured images; setting a learning area for each object based on the generated rough shape data; and learning a three-dimensional field in accordance with the captured image for obtaining virtual viewpoint data or a three-dimensional shape of an object corresponding to a virtual viewpoint from the plurality of captured images by taking the learning area set for each object as a target.
Further features of the present invention will become apparent from the following description of exemplary embodiments with reference to the attached drawings.
Hereinafter, with reference to the attached drawings, the present invention is explained in detail in accordance with preferred embodiments. Configurations shown in the following embodiments are merely exemplary and the present invention is not limited to the configurations shown schematically.
In the following, the present embodiments are explained with reference to the drawings. The following embodiments do not necessarily limit the present invention. Further, all combinations of features explained in the present embodiments are not necessarily indispensable to the solution of the present invention. Furthermore, as the way of thinking common to each embodiment, in the present invention, a rough three-dimensional shape (rough shape) of an object is utilized as in VaxNeRF. Then, the three-dimensional field (this “field” is different depending on learning contents. In the following, this is described as “three-dimensional field” in the present specification) within the learning-target image capturing space is defined independently for each object and learning is performed. Due to this, the amount of information each three-dimensional field has is reduced and the convergence of learning for each three-dimensional field is expedited, and thereby, high-speed learning is implemented.
In the present embodiment, by using eight cameras installed in a studio, one or a plurality of objects is captured from the periphery. Further, it is assumed that camera parameters, such as intrinsic parameters, extrinsic parameters, and distortion parameters of the cameras, are stored in the storage unit 204. The intrinsic parameters represent the coordinates of the image center and the lens focal length and the extrinsic parameters represent the position and orientation of the camera. It is not necessary for the camera parameters of the plurality of the cameras to be common to all the cameras and for example, the viewing angle may be different.
Following the above, the operation of the image processing apparatus 102 according to the present embodiment is explained.
At S401, the image input unit 301 receives and obtains multi-viewpoint images from the plurality of the cameras 101 via the input I/F 206. Alternatively, it may also be possible for the image input unit 301 to read and obtain the data of multi-viewpoint images stored in the storage unit 204. The obtained multi-viewpoint images are stored in the RAM 202.
At S402, the rough shape generation unit 302 generates rough shape data representing a rough three-dimensional shape of an object captured in the obtained multi-viewpoint images. In the present embodiment, as rough shape data, the shape of the object is derived by the visual hull method and three-dimensional shape data represented as a voxel set is generated. In the visual hull method, first, for each captured image configuring the multi-viewpoint images, foreground/background separation is performed by using an image (background image) in the state where no object is captured and an image (silhouette image) representing the silhouette of the object is obtained. The background image is prepared by, for example, capturing the image capturing area 106 in advance in the state where no object exists, or the like. Next, based on the camera parameters of each of the plurality of the cameras 101, each voxel included in the voxel set corresponding to the image capturing area 106 is projected onto each silhouette image of the object, which is obtained from the multi-viewpoint images. Then, in all the silhouette images, the voxel set including only the voxels projected within the silhouette is taken as the rough shape of the object.
At S403, the learning area setting unit 303 sets a rectangular parallelepiped circumscribing the rough shape of the object for each object as the target three-dimensional area (learning area) for which learning is performed based on the rough shape data of each object, which is generated at S402. The shape of this learning area only needs to be a solid body containing the rough shape of the object and, for example, may be a sphere or an oval sphere. Alternatively, it may also be possible to set a three-dimensional area having a rougher shape as the learning area, which includes a set of voxels larger than the voxels that are elements configuring the rough shape data.
At S404, the three-dimensional field updating unit 304 secures a memory area corresponding to the learning area set for each object at S403 within the RAM 202 as the three-dimensional field storage unit 305. Here, each following step is explained on the assumption that the “three-dimensional field” is the emitted radiance field (vector field associating the volume density (˜ occupancy) and the emitted radiance (˜ color) with each other for each coordinate on space) in NeRF. In a case where the three-dimensional field is the emitted radiance field of NeRF, the value representing the volume density (in the following, simply described as “density”) at an arbitrary position within space and the value representing the anisotropic color different for each direction are stored in the secured memory area.
At S405, the rendering unit 306 draws an image (performs volume rendering for an image) corresponding to each image capturing viewpoint having the same viewing angle as that of each captured image based on the camera parameters corresponding to each captured image configuring the multi-viewpoint images and the emitted radiance field stored in the three-dimensional field storage unit 305. Specifically, the rendering unit 306 performs processing to find a pixel value C (r) of each pixel, which corresponds to a ray r in a case of being viewed from the same viewpoint as that of the captured image, by using, for example, formula (1) below.
In formula (1) described above, “i” indicates the index of the sampling point, “σi” indicates the density, “ci” indicates the color, and “δi” indicates the distance to the next sampling point. Here, the index i of the sampling point is allocated to the plurality of the cameras 101 (eight cameras in the present embodiment) in order from the closest camera (camera closest to front). By formula (1) described above, the pixel values (RGB values) are determined by the weighted sum so that the weight is heavier for the sampling point whose density is higher and which is closer to the camera. In the three-dimensional field storage unit 305 immediately after the processing is started, there is no learned three-dimensional field (here, emitted radiance field of NeRF). Consequently, a pitch-black image in which nothing is captured is obtained as the rendering results. In this manner, the rendered image corresponding to each captured image is obtained. The obtained rendered image is output to the three-dimensional field updating unit 304.
At S406, the three-dimensional field updating unit 304 performs processing to update the emitted radiance field so that the color difference between the captured image and the rendered image in a correspondence relationship with each other becomes small by finding the color difference by taking the learning area of interest among the learning areas set at S403 as a target. This process corresponds to the error calculation and the error backpropagation in deep learning. In the present embodiment, the color difference between both images is calculated for each pixel in a correspondence relationship and the value is defined by the squared Euclidean distance of the colors (RBG).
At S407, whether the updating processing of the emitted radiance field is completed by taking all the learning areas set at S403 as a target is determined. In a case where there is a learning area for which the updating processing of the emitted radiance field has not been completed yet, the processing returns to S405, and the next learning area of interest is determined and the same processing is performed. On the other hand, in a case where the updating processing of the emitted radiance field is completed by taking all the learning areas as a target, the processing moves to S408.
At S408, whether the updating has converged sufficiently for all the emitted radiance fields is determined. In order to determine whether the updating has converged, for example, the total of the errors found for each pixel at all the viewpoints is found and in a case where the reduction width from the total value found previous time becomes lower than a threshold value (for example, 0.1%), it is determined that convergence has converged. Further, it may also be possible to determine convergence in a case where the error found for each pixel becomes smaller than a predetermined threshold value, or it may also be possible to determine convergence at the time in point at which the counted number of times of the updating processing of the emitted radiance field reaches a predetermined number of times. Further, for example, it may also be possible to determine convergence by taking part of the plurality of cameras as a camera for evaluation not used for the updating of the emitted radiance field and in a case where the error from the captured image of the camera for evaluation begins to increase (overlearning occurs). Furthermore, it may also be possible to determine convergence by combining the above. In a case where the updating has converged, this processing is terminated and in a case where the updating has not converged yet, the processing returns to S405 and the same processing is repeated.
The above is the flow of the operation in the image processing apparatus 102 according to the present embodiment.
<Difference from Prior Art>
Here, the difference in the method between the present embodiment and both NeRF and VaxNeRF is explained by taking a case (see
As above, according to the present embodiment, the learning area is set for each object based on the rough shape of the object and the emitted radiance field defined independently for each object is updated. Due to this, it is possible to reduce the amount of information each emitted radiance field has, and therefore, convergence is expedited and it is possible to perform learning at a higher speed.
Next, a method is explained as a second embodiment, which performs learning at a higher speed by using Plenoxels (Alex Yu, Sara Fridovich-Keil, Matthew Tanick, Qinhong Chen, Benjamin Recht, Angjoo Kanazawa. Plenoxels: Radiance Fields without Neural Networks. 9 Dec. 2021) as the representation format of the emitted radiance field defined independently for each object. Plenoxels is a method of representing the emitted radiance field by direct parameters and optimizing the parameters without using the neural network. Consequently, it is made possible to more directly control the volume density and the color value for an arbitrary position within space. In the following, points different from those of the first embodiment are explained mainly.
Before explaining the present embodiment, the outline of Plenoxels is explained. In Plenoxels, first, space is divided into rough voxel grids.
In formula (2) described above, “Lrecon” is the term that reduces the difference from the pixel value of the captured image and “LTV” is the term that reduces the difference in the value between parameters in the vicinity. For the optimization of parameters by the objective function, for example, the RMSProp method, the steepest descent method, the Adam method, the SGD method or the like is used.
Then, the emitted radiance field by the rough voxel grid shown in
In the present embodiment, the process of the above-described optimization in Plenoxels is simplified by using the visual hull method. Here, the estimation of “rough emitted radiance field” in Plenoxels has the same role of obtaining the rough shape explained in the first embodiment. That is, it is possible to define the object shape by finer voxels in accordance with the rough shape obtained by the rough shape generation unit 302. Further, by finding in advance the initial values of the parameters to be allocated to each fine voxel, the optimization is caused to converge at a higher speed. Further, in the present embodiment, a method is also explained, which further reduces the number of pixels to be utilized for learning by utilizing the visibility determination (shielding determination) of an object. It is also possible to apply the way of thinking of further reducing the number of pixels to be utilized for learning by utilizing the visibility determination to the first embodiment.
Following the above, the operation of the image processing apparatus 102 according to the present embodiment is explained.
S1201 and S1202 are the same as S401 and S402 in the flow in
At S1203, based on the rough shape obtained at S1202, the learning area setting unit 303 sets a three-dimensional area including rough voxels as the learning area in which updating of the emitted radiance field is performed.
At S1204, the three-dimensional field updating unit 304 secures a memory area corresponding to the learning area set for each object at S1203. Specifically, within the RAM 202 as the three-dimensional field storage unit 305, a memory area is secured for storing parameters representing the emitted radiance filed for the fine voxels obtained by dividing the rough voxels.
At S1205, the three-dimensional field updating unit 304 performs processing to calculate the initial value of the emitted radiance field defined for each object. Details of this initial value calculation processing will be described later.
At S1206, the rendering unit 306 performs the same processing as that at S405 in the flow in
At S1207, the rendering unit 306 finds the difference in the pixel value between the captured image and the rendered image in a correspondence relationship with each other by taking the learning area of interest as a target and performs processing to update the emitted radiance field so that the difference in the pixel value becomes small. At this time, at S1206 that is performed first immediately after the processing starts, the initial value generated at S1205 is used. Then, this initial value is set based on only the pixel determined to be visible by the visibility determination, to be described later.
S1208 and S1209 are the same as S407 and S408 in the flow in
The above is the flow of the operation in the image processing apparatus 102 according to the present embodiment.
Following the above, with reference to the flowchart in
At S1301, the volume density of the emitted radiance field is initialized.
Specifically, processing is performed, which allocates σ=1 to each voxel configuring the fine voxel grid in a case where the voxel is located inside of the rough shape, σ=0 in a case where the voxel is located outside, and σ=0.5 in a case where the voxel is located on the surface.
At S1302, the surface voxel in the rough shape obtained at S1202 is extracted. This extraction of the surface voxel is implemented by, for example, selecting the voxel located inside of the rough shape and adjacent to the voxel located outside the rough shape.
At S1303, the surface voxels of all the objects extracted at S1302 are projected onto the image surface by using the camera parameters of all the image capturing viewpoints of the multi-viewpoint images and a depth map at all the image capturing viewpoints is generated. The generated depth map is stored in the RAM 102.
At S1304, processing is performed for each surface voxel, which determines the visibility from each viewpoint by projecting the center coordinates of the surface voxel onto the image capturing viewpoint and comparing a depth value (d′) by the projection with a depth value (d) in the depth map. Specifically, in a case where d′≤ d+m, visibility (not being shielded) is determined. Here, “m” is a constant and which is one size larger than that of the voxel in the fine voxel grid, and for example, “m” takes a constant of 1 to 2 mm.
At S1305, color information on the emitted radiance field is initialized for each surface voxel based on the pixel value of the captured image corresponding to the viewpoint determined to have visibility. In Plenoxels, the color different for each direction is represented by the parameters of the spherical harmonic function for each component of RGB. Consequently, processing is performed as the initialization processing, which takes, for example, the base component (average of values in all the directions) of the spherical harmonic function as the average value of the pixel values of the captured images corresponding to the viewpoints determined to have visibility and takes the component representing the change in color for each another direction as 0.
As above, according to the present embodiment, Plenoxels in which the emitted radiance field is handled by direct parameters is taken as a basis and after the initial value of each emitted radiance field is calculated, the updating of each emitted radiance field is performed, and therefore, it is possible to further expedite the convergence of optimization. Further, by additionally performing the visibility determination, it is possible to further reduce the amount of calculation and it is possible to considerably increase the speed of learning.
In the first and second embodiments, explanation is given by taking the emitted radiance field as an example of the three-dimensional field, which associates the volume density and the color different for each direction with each other for each coordinate on space, but this is not limited. For example, color information that is associated with the coordinate within space may be an isotropic color (color field) independent of the direction. Further, the three-dimensional field is not limited to the emitted radiance field. For example, the three-dimensional field may be Occupancy Field representing the volume density. Further, the three-dimensional field may be a field represented by the bidirectional reflectance distribution function (BRDF) representing the distribution characteristics of reflected light for incident light. Furthermore, the three-dimensional field may be a field representing Light Visibility of ambient light. Like the emitted radiance field explained in the first and second embodiments, it is possible for these fields to perform learning with the multi-viewpoint images as an input and by inputting the camera parameters of the virtual viewpoint to each field after learning, the following virtual viewpoint data is obtained.
Further, the three-dimensional field may be a floating point field (Signed Distance Field) in which the inside of an object is negative and the outside is positive or a binary field (Surface Field) in which the inside of an object is 0 and the outside is 1. Furthermore, the three-dimensional field may be a field in the direction of the normal to the object surface (Normal Field). In a case of these fields, learning is performed, which takes the depth map corresponding to the multi-viewpoint images as an input, in addition to the multi-viewpoint images, and based on each field after learning, the following virtual viewpoint data is obtained.
For example, Surface Field described above is utilized in Pixel NeRF or Double Field. Further, BRDF, Light Visibility, and Normal Field are utilized in NeRFactor. Furthermore, Signed Distance Field is utilized in NeuS. It is possible to apply the method of the present invention also to these fields.
The method of generating virtual viewpoint data based on virtual viewpoint parameters is explained up to this point, and the learning of the three-dimensional field is also effective for obtaining three-dimensional shape data. For example, it is possible to obtain mesh data by extracting Occupancy Field or Signed Distance Field as a voxel and using the Marching Cubes method for the extracted voxel.
Embodiment(s) of the present invention can also be realized by a computer of a system or apparatus that reads out and executes computer executable instructions (e.g., one or more programs) recorded on a storage medium (which may also be referred to more fully as a ‘non-transitory computer-readable storage medium’) to perform the functions of one or more of the above-described embodiment(s) and/or that includes one or more circuits (e.g., application specific integrated circuit (ASIC)) for performing the functions of one or more of the above-described embodiment(s), and by a method performed by the computer of the system or apparatus by, for example, reading out and executing the computer executable instructions from the storage medium to perform the functions of one or more of the above-described embodiment(s) and/or controlling the one or more circuits to perform the functions of one or more of the above-described embodiment(s). The computer may comprise one or more processors (e.g., central processing unit (CPU), micro processing unit (MPU)) and may include a network of separate computers or separate processors to read out and execute the computer executable instructions. The computer executable instructions may be provided to the computer, for example, from a network or the storage medium. The storage medium may include, for example, one or more of a hard disk, a random-access memory (RAM), a read only memory (ROM), a storage of distributed computing systems, an optical disk (such as a compact disc (CD), digital versatile disc (DVD), or Blu-ray Disc (BD)™), a flash memory device, a memory card, and the like.
According to the present invention, it is made possible to perform learning at a higher speed for generating an image or the like corresponding to a virtual viewpoint from multi-viewpoint images.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
This application claims the benefit of Japanese Patent Application No. 2023-078633, filed May 11, 2023 which is hereby incorporated by reference wherein in its entirety.
Number | Date | Country | Kind |
---|---|---|---|
2023-078633 | May 2023 | JP | national |