This application is based upon and claims priority to Chinese Patent Application No. CN201910179121X, filed on Mar. 11, 2019, the entire contents of which are incorporated herein by reference.
The present invention relates to the technical field of computer vision and pattern recognition, and particularly to 3D reconstruction method based on deep learning.
The 3D reconstruction based on vision is the calculation process and technology of the 3D information (shape, texture, etc.) of the object recovered from the image obtained by the vision sensor. Reconstructing accurate 3D model of an object from images is essential for many applications, such as cultural relics restoration, robot grasping and obstacle avoidance. At present, the traditional 3D reconstruction methods have some limitations, including: cameras calibrated accurately and high-quality visual imaging elements are needed; the reconstruction process includes image preprocessing, point cloud registration, data fusion and other steps, which is easy to cause error accumulation and reduce the reconstruction accuracy; and it is difficult to reconstruct the shape of the part of the perceptual object that is occluded or information lost. These defects lead to the low quality of the reconstruction results of traditional methods, so they can't be widely used in practice. Therefore, high-precision 3D reconstruction based on vision has a great challenge.
In recent years, the rapid development of deep learning and the availability of a large number of 3D CAD models have brought new ideas to traditional 3D reconstruction methods. At present, more common methods based on depth learning use depth generation model, such as generative adversarial networks (GAN), auto encoder (AE) and variational auto encoder (VAE), to reconstruct 3D images from a single perspective. The main framework of these methods includes two stages: encoding and decoding. In the encoding stage, the input data is encoded as potential features, and in the decoding stage, the features are decoded to generate a complete 3D shape. The reconstruction method based on GAN uses random noise as input and ensures the reconstruction accuracy through the confrontation between the discriminator and the generator. However, because the random noise can't reflect the prior information of the reconstruction object, the reconstruction result is not specific. The reconstruction method based on AE only regards the minimal reconstruction loss of the generator as the optimization goal and does not consider the confrontation loss of the discriminator, so that the reconstruction results are limited by the known input information and difficult to expand the unknown part. Naturally, by combining the prior information maintained by AE methods with the discriminant ability of GAN methods, the decoder of AE is set as the generator of GAN, which can overcome the defects of the above two methods at the same time. However, although this method based on GAN and AE fusion improves the reconstruction accuracy, it can't completely recover the occluded and missing areas, and will generate noise, which is more obvious in the process of cross category reconstruction of the model.
The technical problem addressed by the present invention is to overcome the deficiency in the prior art, and to provide 3D reconstruction method based on deep learning, which does not need to design complex feature algorithm manually, can avoid complex camera calibration and fine process design, and has the ability to expand “what you know” and rebuild “what you don't know” by learning “what you see”, which can make up for the inherent defect of traditional reconstruction method “what you know is what you see”, so that it can't only highly preserve the input depth information, but also accurately predict the missing part of the object to achieve high precision 3D reconstruction. The technical solution of the present invention is that, in this 3D reconstruction method based on deep learning, which includes the following steps:
The invention uses the depth neural network to extract high-performance features and avoid the accumulation of multi loop error in artificial design; by learning the potential information of 3D shape, the input image is constrained so that the missing part can be accurately predicted; the predicted 3D shape is constrained by the consistency of depth projection, so that the input information can be highly preserved; the spatial local pattern classification is used to reconstruct the predicted 3D shape in binary way and achieve high precision 3D reconstruction. Therefore, this method does not need to design complex feature algorithm manually, can avoid complex camera calibration and fine process design, and has the ability to expand “what you know” and rebuild “what you don't know” by learning “what you see”, which can make up for the inherent defect of traditional reconstruction method “what you know is what you see”, so that it can't only highly preserve the input depth information, but also accurately predict the missing part of the object to achieve high precision 3D reconstruction.
As shown as
The invention uses the depth neural network to extract high-performance features and avoid the accumulation of multi loop error in artificial design; by learning the potential information of 3D shape, the input image is constrained so that the missing part can be accurately predicted; the predicted 3D shape is constrained by the consistency of depth projection, so that the input information can be highly preserved; the spatial local pattern classification is used to reconstruct the predicted 3D shape in binary way and achieve high precision 3D reconstruction. Therefore, this method does not need to design complex feature algorithm manually, can avoid complex camera calibration and fine process design, and has the ability to expand “what you know” and rebuild “what you don't know” by learning “what you see”, which can make up for the inherent defect of traditional reconstruction method “what you know is what you see”, so that it can't only highly preserve the input depth information, but also accurately predict the missing part of the object to achieve high precision 3D reconstruction.
Preferably, the mentioned step (1) includes the following steps:
(1.1) Reconstruction of 3D GAN and Realization of discriminant constraints
(1.2) Realization of consistency constraints of potential features
(1.3) Realization of consistency constraint of depth projection
Preferably, the mentioned step (1.1) uses the improved Wasserstein GAN to train. for the generator, 3D generator loss Lg is defined as formula (1):
L
g=η(−βyt log(yp)−(1−β)(1−yt)log(1−yp))−(1−η)E[D(yp|x)] (1)
Where x, yt, yp respectively represent 3D voxel value converted for the depth image, the ground truth value and 3D object value generated by the network. In the experiment β is set to 0.85, η is set to 5.
For the discriminator, 3D GAN optimizes parameters by narrowing the Wasserstein distance between the real pair and the fake pair. The discriminator loss Ld is defined as:
L
d
=E[D(yp|x)]−E[D(yt|x)]+λE[(∥∇ŷD(ŷt|x)∥2−1)2] (2)
Where ŷ=εx+(1−ε)yp,ε˜U[0,1]. λcontrols the tradeoff between optimizing the gradient penalty and the original objective.
Preferably, in the mentioned step (1.2), the potential vectors of the input image are constrained by the potential feature vector information of the learned 3D real object to guide the model to generate the target 3D shape data, so that the missing part can be accurately predicted. The latent vector L, is defined as:
L
l
=E(Zt)−E(Zp) (3)
Where Z, is a latent vector decoded by a 3D ground truth object, Zp is decoded by input depth image, and E(·) denotes the expectation.
Preferably, in the mentioned step (1.3), a projection constraint is applied between the predicted 3D shape and the input depth image. The depth value after projection is consistent with the input depth value, so as to improve the fidelity of the input information, so that the model can fine tune the generated 3D shape. The loss function Lproject is the formula (4):
Where yp(x,y,z) represents the value of the predicted 3D shape yp at the position (x,y,z), yp(x,y.z)∈{0,1}, dx,y is the depth value of the input image x at the position (x,y).
Preferably, in the mentioned step (2), using a 3D depth convolution AE with jump connection, the feature layer of the encoder will be connected to the decoder accordingly. Preferably, in the mentioned step (2), the network structure includes an encoder and a decoder: the encoder has four 3D convolution layers, each convolution layer has a bank of 4×4×4 filters of 1×1×1 strides, followed by a leaky ReLU activation function and a maximum pooling layer; then there are two fully connected layers, the second fully connected layer is the potential vector learned; the decoder consists of four symmetric anti convolution layers. Each layer concatenates the feature layers of the encoder accordingly, followed by ReLU activations except for the last layer with sigmoid function. The whole calculation process is: 643(1)→323(64)→163(128)→83(256)→43(512)→32768→5000→32768→43(5 12)→83(256)→163(128)→323(64)→643(1).
Preferably, in the mentioned step (2), by making the predicted 3D shape as close as possible to the real 3D shape to optimize network parameters, its objective function Lt is the formula (5):
L
t
=−αy
t log(y′t)−(1−α)(1−yt)log(1−y′t) (5)
Where yt is the ground truth value for each voxel, y′t is the predicted value for each voxel. Cross entropy is used to measure the quality of reconstruction. For the value of most voxel grids of each object are zero, weight α is applied to the false positive and false negative samples to balance them. In the experiment α is set to 0.85.
Preferably, in the mentioned step (3), nonlinear binary reconstruction is applied to voxel set output by the generator with ELM classifier.
Preferably, in the mentioned step (3), it has three layers in the network: an input layer, a hidden layer and an output layer. Input is the feature of each voxel mesh of the object. The neighborhood value around each voxel mesh is extracted as the feature value, and a 7-dimensional feature vector is established. The number of hidden layer nodes is determined about 11 by multiple experiments. The output is to judge whether the label of each voxel is 0 or 1.
If the incentive function is infinitely differentiable over any real number interval for ELM, the network approximates any nonlinear function, and the classifier loss function Lc is the formula (6):
L
c
=y
f
voxel
−y
t
voxel (6)
Where yfvoxel is the value of each voxel mesh after binary reconstruction, ytvoxel is the value of each voxel mesh of the real object.
The invention is described in more detail below.
3DGAN-LFPC provided by the present invention consists of three components: 1) 3D GAN: The potential vector constrained in the input image is used to reconstruct the complete 3D shape of the target. 2) 3D depth convolution AE: Learn the intermediate feature representation between the 3D real object and the reconstructed object to obtain the target potential variables in step (1). 3) a spatially local pattern classifier: The voxel floating value predicted in step (1) is transformed into binary value by using Extreme Learning Machine (ELM) to complete high-precision reconstruction. A general flowchart of training and testing 3DGAN-LFPC is shown in
The network structure of this part mainly includes 3D generator and 3D discriminator, as shown in 3D GAN part of
The loss function of this part mainly includes 3D generator loss Lg, 3D discriminator loss Ld, potential feature loss Ll and depth projection loss Ldepth. The details of each part are as follows.
i. Reconstruction of 3D GAN and Realization of Discriminant Constraints
In view of the problem that the gradient is easy to disappear and the network is difficult to converge in the original GAN training, the invention adopts the improved Wasserstein GAN for training. For generator, the invention combines the reconstruction loss of AE and GAN as the objective function Lg:
L
g=η(−βyt log(yp)−(1−β)(1−yt)log(1−yp))−(1−η)E[D(yp|x)] (1)
Where x, yt , yp respectively represent 3D voxel value converted for the depth image, the ground truth value and 3D object value generated by the network. In the experiment β is set to 0.85, η is set to 5.
For the discriminator, 3D GAN optimizes parameters by narrowing the Wasserstein distance between the real pair and the fake pair. The discriminator loss Ld is defined as:
L
d
=E[D(yp|x)]−E[D(yt|x)]+λE[(∥∇ŷD(ŷt|x)∥2−1)2] (2)
Where ŷ=εx+(1−ε)yp,ε˜U[0,1]. λcontrols the tradeoff between optimizing the gradient penalty and the original objective.
ii. Realization of Consistency Constraints of Potential Features
In the unconditional generation model, we can't control the network to generate the required target model. For 3D reconstruction, its result is obtained by decoding latent feature vectors, and its accuracy depends on the learning quality of latent vector. In fact, a good potential vector should not only be able to reconstruct 3D objects, but also be able to predict from 2D images. Therefore, the invention innovatively uses the potential feature vector information of the learned 3D real object to constrain the potential vector of the input image to guide the model to generate the target 3D shape data, so that the missing part can be accurately predicted. Its loss function L, is defined as:
L
l
=E(Zt)−E(Zp) (3)
Where Zt is a latent vector decoded by a 3D ground truth object, Zp is decoded by input depth image, and E(·) denotes the expectation.
iii. Realization of Consistency Constraint in Depth Projection
The predicted 3D shape should be consistent with 2D view, which is helpful for the training of 3D reconstruction using depth learning. Therefore, a projection constraint is applied between the predicted 3D shape and the input depth image, that is, the depth value after projection is consistent with the input depth value, so as to improve the fidelity of the input information, so that the model can fine tune the generated 3D shape. Its loss function Lproject is:
Where yp(x,y,z) represents the value of the predicted 3D shape yp at the position (x,y,z), yp(x,y,z)∈{0,1}, dx,y is the depth value of the input image x at the position (x,Y).
The network structure of this part mainly includes 3D generator and 3D discriminator, as shown in 3D depth convolution AE part of
By making the predicted 3D shape as close as possible to the real 3D shape to optimize network parameters, its objective function L, is the formula (5):
L
t
=−αy
t log(y′t)−(1−α)(1−yt)log(1−y′t) (5)
Where yt is the ground truth value for each voxel, y′t is the predicted value for each voxel. Cross entropy is used to measure the quality of reconstruction. For the value of most voxel grids of each object are zero, weight a is applied to the false positive and false negative samples to balance them. In the experiment a is set to 0.85.
The network of this part has three layers: an input layer, a hidden layer and an output layer, as shown as the binary reconstruction part in
If the incentive function is infinitely differentiable over any real number interval for ELM, the network approximates any nonlinear function, and the classifier loss function Lc is the formula (6):
L
c
=y
f
voxel
−y
t
voxel (6)
Where yfvoxel is the value of each voxel mesh after binary reconstruction ytvoxel is the value of each voxel mesh of the real object.
In conclusion, 3D GAN-LFPC proposed in the invention includes 3D GAN generator loss (see formula (1)), discriminator loss (see formula (2)), potential feature loss (see formula (3), (4)), depth projection loss (see formula (5)) and voxel classification loss (see formula (6)). Adam algorithm is adopted for model optimization, and the optimization sequence is (4), (3), (2), (1), (5), (6).
The invention uses the disclosed Model Net database to generate training and test data sets. The specific operation is as follows. For each CAD model, the invention creates a virtual depth camera, scans it from 125 different angles, and evenly samples 5 viewing angles in each pitch angle, yaw angle and roll angle direction. In the above way, the depth image and the corresponding complete 3D shape are obtained, and then the depth image and the 3D shape are voxeled into a dimensional 3D mesh 64×64×64 by using the virtual camera parameters. Each voxel mesh is represented as a binary tensor: 1 for voxels occupied, 0 for voxels not occupied.
The invention uses Intersection over Union (IoU) of 3D voxel to evaluate the performance of 3D reconstruction. IoU represents the similarity between the predicted 3D voxel mesh and the real voxel mesh, which is defined as follows:
Where I( ) is the indicator function, (i,j,k) is the 3D index of a voxel mesh, (yf)ijk is the predicted value at the (i,j,k) voxel, yijk is the real value at the (i,j,k) voxel. The value of IoU of a 3D shape is [0.1]. The closer the value of IoU is to 1, the better the reconstruction effect is.
The invention has made relevant experiments, and compared the proposed 3D GAN-LFPC with the classical reconstruction methods including: Poisson surface reconstruction method and the method based on 3D RecGAN framework proposed by Yang et al. At the same time, in order to verify the effectiveness and performance of each constraint proposed in the invention, two reconstruction models of simplified versions of 3D GAN-LFPC reconstruction model, 3D GAN-LFC (only potential feature consistency constraint) and 3D GAN-PC (only depth projection consistency constraint), are proposed for comparison experiments.
(2)Cross-category Results. To further investigate the generality, the network is trained on one category, but tested on another two different categories. Particularly, in group 1, the network is trained on chair, tested on stool, toilet; in group 2, the network is trained on stool, tested on chair, toilet; in group 3, the network is trained on toilet, tested on chair, stool. The comparative results are shown in Table 2.
In general, 3DGAN-LFPC proposed by the invention is superior to the traditional reconstruction method and reconstruction method based on deep learning, that is, it can recover the 3D object structure with higher accuracy in the case of a single depth image. In the process of training, 3DGAN-LFPC optimizes the potential feature vector of input image by learning the generation network of 3D real object, which provides direction for shape reconstruction of model. Moreover, 3DGAN-LFPC uses the potential eigenvector optimized by self-encoder to replace the random input of GAN, which improves the performance of the model. In addition, 3DGAN-LFPC implements depth projection consistency constraint on the predicted 3D shape, avoids the generation of uncorrelated noise, and better captures the details of the object surface. At last, 3DGAN-LFPC improves the reconstruction quality by using the nonlinear voxel binarization. In a word, the model of the invention can make better use of the prior knowledge of the object, that is to say, it can expand the “seeing” by “learning”, better reconstruct the occluded and missing areas of the target object, and can learn the variability and correlation of the geometric features between different object categories.
The above contents are only the preferable embodiments of the present invention, and do not limit the present invention in any manner. Any improvements, amendments and alternative changes made to the above embodiments according to the technical spirit of the present invention shall fall within the claimed scope of the present invention.
| Number | Date | Country | Kind |
|---|---|---|---|
| 201910179121.X | Mar 2019 | CN | national |