The present invention relates to a method and a device for determining a high-resolution depth map of a scene. The present invention also relates to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out such a method.
RGB-D sensors have become very popular for 3D reconstruction, in view of their low cost and ease of use. They deliver a colored point cloud in a single shot, but the resulting shape often misses thin geometric structures. This is due to noise, quantization and, more importantly, the coarse resolution of the depth map. However, super-resolution of a solitary depth map without additional constraint is an ill-posed problem. In comparison, the quality and resolution of the companion RGB image are substantially better. For instance, a device may deliver 1280×1024 px2 RGB images, but only up to 640×480 px2 depth maps. Therefore, it seems natural to rely on color to refine depth. Yet, retrieving geometry from a single color image is another ill-posed problem, called shape-from-shading. Besides, combining it with depth clues requires the RGB and depth images to have the same resolution. The resolution of the depth map thus remains a limiting factor in single-shot RGB-D sensing.
The objective of the present invention is to provide a method and a device for determining a high-resolution depth map of a scene, wherein the method and the device overcome one or more of the above-mentioned problems of the prior art.
A first aspect of the invention provides a method for determining a high-resolution depth map of a scene, the method comprising:
Therein, low-resolution refers to a spatial resolution that is lower than the high-resolution.
Initializing an estimated reflectance map, an estimated lighting vector and an estimated depth map may refer to creating these variables and assigning them an initial value. The initial value may be predetermined (e.g. a predetermined constant) or it may be determined based on another known parameter. For example, the estimated depth map may be initialized with values from the obtained (measured) low-resolution depth-map.
Simultaneously updating a number of variables preferably refers to that in an iteration, each of the variables (here: estimated reflectance map, estimated lighting vector and estimated depth-map) is updated, wherein an update of at least one of the variables depends on another one of the variables, which was already updated in the iteration.
Determining the high-resolution depth map based on the iteratively updated estimated depth-map may comprise that the high-resolution depth map is determined as the estimated depth map of a final iteration, e.g. when the iteration has converged and an update rate is lower than a predetermined threshold. In other embodiments, determining the high-resolution depth map may involve further processing steps that are based on the iteratively updated estimated depth-map.
Embodiments of the method of the first aspect can jointly refine and up-sample the depth map using shape-from-shading. In other words, the ill-posedness of single depth image super-resolution may be fought using shape-from shading, and vice-versa.
In a first implementation of the method according to the first aspect, the low-resolution depth map and the high-resolution image are obtained using an RGB-D camera. This has the advantage that all required input information can be obtained from one camera device.
In a second implementation of the method according to the first aspect as such or according to the first implementation of the first aspect, a Potts prior is used for initializing and/or updating the estimated reflectance map. Experiments have shown that the reflectance of many objects maps the reflectance assumption of the Potts prior. Thus, superior results can be achieved.
In a third implementation of the method according to the first aspect as such or according to any of the preceding implementations of the first aspect, the iterative updates are determined based on an optimization of a cost function. In other words, in an iterative procedure, an estimated reflectance map, an estimated lighting vector and an estimated depth map that minimize (or maximize) the cost function.
In a fourth implementation of the method according to the first aspect as such or according to any of the preceding implementations of the first aspect, the cost function is given by
∥(l·mz,∇z)ρ−I∥l
wherein ρ:ΩHR→c is the reflectance map, l∈d is the lighting vector, z:ΩHR→ is the depth map, I:ΩHR→c is the high-resolution image, μ, ν and λ are predetermined weights, mz,∇z is a ΩHR→d vector field, |d,Az,∇z∥l
The linear operator K may also involve warping and/or blurring in addition to down-sampling. For example, the linear operator K may be formed as a product of a down-sampling operator, a blurring operator and a warping operator.
In other embodiments, the operator K may be non-linear.
In a fifth implementation of the method according to the fourth implementation of the first aspect, the weights μ, ν and λ are determined as
In a sixth implementation of the method according to the first aspect as such or according to any of the preceding implementations of the first aspect, the iteratively updating the estimated reflectance map, the estimated lighting vector, and the estimated depth-map comprises iteratively updating an auxiliary variable, wherein the auxiliary variable comprises the depth map and a gradient of the depth map.
Introducing this auxiliary variable has the advantage that the cost function can be separated into a linear part and a non-linear part, which simplifies the numerical computation.
In a seventh implementation of the method according to the first aspect as such or according to any of the preceding implementations of the first aspect, the iteratively updating the estimated reflectance map, the estimated lighting vector, and the estimated depth-map comprises determining
wherein ρ(k+1) is the updated estimated reflectance map, l(k+1) is the updated light vector, θ(k+1) is the updated auxiliary variable and z(k+1) is the updated estimated depth map, and ΩHR is the high-resolution domain, u is a Lagrange multiplier, κ is a step size, and wherein mθ is a vector field.
In an eighth implementation of the method according to the first aspect as such or according to any of the preceding implementations of the first aspect, the vector field mθ is a ΩHR→d vector field defined as
wherein f>0 is a focal length, θ=(z,∇z) and p:ΩHR→2 a field of pixel coordinates with respect to a principal point.
In a ninth implementation of the method according to the first aspect as such or according to any of the preceding implementations of the first aspect, the method further comprises an initial step of segmenting one or more objects from the high-resolution image.
In a tenth implementation of the method according to the ninth implementations of the first aspect, the method is performed for each of the segmented one or more objects.
A second aspect of the invention refers to a device for determining a high-resolution depth map of a scene based on a low-resolution depth map of the scene and a high-resolution image of the scene, the device comprising:
The device of the second aspect may be configured to carry out the method of the first aspect or one of the implementations of the first aspect.
A third aspect of the invention refers to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out the method of the third aspect.
To illustrate the technical features of embodiments of the present invention more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some embodiments of the present invention, modifications on these embodiments are possible without departing from the scope of the present invention as defined in the claims.
The method comprises a first step 110 of obtaining a low-resolution depth map and a high-resolution image of a scene. For example, the low-resolution depth map and the high-resolution image can be acquired with a RGB-D camera. The high-resolution image has a higher spatial resolution than the low-resolution depth map. The field of view of the high-resolution image and the low-resolution depth map do not need to be identical. Preferably, they are at least partially overlapping, e.g. at least 50% or at least 25% overlapping.
The method comprises a second step 120 of initializing an estimated reflectance map, an estimated lighting vector and an estimated depth map, wherein the estimated depth map is in high-resolution. The initializing step may consist simply in the creation of the variables in a program, and initial values may be assigned.
The method comprises a third step 130 of iteratively simultaneously updating the estimated reflectance map, the estimated lighting vector, and the estimated depth-map, wherein updating the estimated depth map is partially based on the high-resolution image. Therein, simultaneously updating refers to that in an iteration, each of these variables is updated, wherein an update of at least one of the variables depends on another one of the variables, which was already updated in the iteration.
The method comprises a final step 140 of determining the high-resolution depth map based on the iteratively updated estimated depth-map.
The device 200 comprises an initialization unit 210, an iterative updated unit 220 and a determination unit 230. All three units may be realized on the same physical unit, e.g. on a processor with connected memory. In particular, the three units may be realized as three software modules running on a same processor.
The initialization unit 210 is configured to initialize an estimated reflectance map, an estimated lighting vector and an estimated depth map, wherein the estimated reflectance map and the estimated depth map are in high-resolution.
The iterative update unit 220 is configured to iteratively simultaneously update the estimated reflectance map, the estimated lighting vector, and the estimated depth-map, wherein updating the estimated depth map is partially based on the high-resolution image.
The determination unit 230 is configured to determine the high-resolution depth map based on the iteratively updated estimated depth-map.
In the following, a specific embodiment shall be explained in more detail.
A depth map can be realized as a function which associates to each 2D point of the image plane, the third component of its conjugate 3D-point, relatively to a camera coordinate system. Depth sensors provide out-of-the-box samples of the depth map over a discrete low-resolution rectangular 2D grid ΩLR⊂2. We denote by z0:ΩLR→, p→z0(p) such a mapping between a pixel p and the measured depth value z0(p). Due to hardware constraints, the depth observations z0 are limited by the resolution of the sensor (i.e., the number of pixels in ΩLR). The single depth image super-resolution problem consists in estimating a high-resolution depth map: ΩHR→ over a larger domain ΩHR⊃ΩLR, which coincides with the low-resolution observations z0 over ΩLR once it is downsampled. This can be formally written as
z
0
=Kz+η
Z. (1)
In equation (1), K:Ω
Shape-from-shading aims at inferring shape from a single gray-level or color image of a scene. It comprises inverting an image formation model relating the image irradiance I to the scene radiance R, which depends on the surface shape (represented here by the depth map z), the incident lighting l and the surface reflectance ρ:
I=R(z|l,ρ)+ηI (2)
Therein ηI is the realisation of a stochastic process standing for noise, quantisation and outliers.
In the context of RGB-D sensing, the high-frequency information necessary to achieve detail-preserving depth super-resolution could be provided by the photometric data. Similarly, the low-frequency information necessary to disambiguate shape-from-shading could be conveyed by the geometric data. It is thus possible to achieve joint depth map refinement and super-resolution in a single shot, without resorting to additional data (new viewing angles or illumination conditions, learnt dictionary, etc.).
We formulate shading-based depth super-resolution as the joint solving of (1) (super-resolution) and (2) (shape-from-shading) in terms of the high-resolution depth map z: z:ΩHR→, given a low-resolution depth map z:ΩLR→ and a high-resolution RGB image I:ΩHR→3. We aim at recovering not only a high-resolution depth map which is consistent both with the low-resolution depth measurements and with the high-resolution color data, but also the hidden parameters of the image formation model (2) i.e., the reflectance ρ and the lighting l. This can be achieved by maximizing the posterior distribution of the input data which, according to Bayes rule, is given by
where the numerator is the product of the likelihood with the prior, and the denominator is the evidence, which can be discarded since it plays no role in maximum a posteriori (MAP) estimation. In order to make the independency assumptions as transparent as possible and to motivate the final energy we aim at minimizing, we follow derive a variational model from the posterior distribution (4).
Let us start with the first term in the numerator of (4) i.e., the likelihood. By construction of RGB-D sensors, depth and color observations are independent, hence
(z0,I|z,ρ,l)=(z0|z,ρ,l)(I|z,ρ,l).
We further assume that the depth observations are independent from the surface reflectance and from the lighting, hence (z0|z,ρ,l)=(z0|z) and thus:
(z0,I|z,ρ,l)=(z0|z)(I|z,ρ,l). (5)
Assuming homoscedastic, zero-mean Gaussian noise ηz with variance σz2 in (1), the first factor in (5) writes
Next, we discuss the second factor in (5), by making Equation (2) explicit. In general, the irradiance in channel ★∈{R, G, B} writes
I
★=∫λ∫ωc★(λ)ρ(λ)ϕ(λ,ω)max{0,s(ω)·nz}dωdλ+ηI, (7)
where integration is carried out over all wavelengths λ (ρ is the spectral reflectance of the surface and c★ is the transmission spectrum of the camera in channel ★) and all incident lighting directions ω (s(ω) is the unit-length vector pointing towards the light source located in direction ω, and ϕ(⋅, ω) is the spectrum of this source), and nz is the unit-length surface normal (which depends on the underlying depth map z). Assuming achromatic lighting i.e., ϕ(⋅, ω):=ϕ(ω), and using a first-order spherical harmonics approximation of the inner integral, we obtain
with 1∈4 the achromatic “light vector”, ρ:ΩHR→3 the albedo (Lambertian reflectance) map, relatively to the camera transmission spectra {c★}★∈{R,G,B}, and ΩHR→2⊂3 the field of unit-length surface normals. Assuming perspective projection with focal length f>0 and p:ΩHR→2 the field of pixel coordinates with respect to the principal point, the normal field is given by
Assuming that the image noise is homoscedastically Gaussian-distributed with zero-mean and covariance matrix Diag(σI2,σI2,σI2), we obtain
where, according to (8) and (9), mz,∇z is a ΩHR→4 vector field defined as
We now consider the second factor in the numerator of (4) i.e., the prior distribution. We assume that depth, reflectance and lighting are independent (independence of reflectance from depth and lighting follows from the Lambertian assumption, and independence of lighting from depth follows from the distant-light assumption required to derive the spherical harmonics model (8)). This implies
(z,ρ,l)=(z)(ρ)(l). (12)
Since lighting has already been modelled as a low-frequency phenomenon for the sake of expliciting the image formation model (8), we do not need to introduce any other prior (l); and thus we use an improper prior
(l)=constant (13)
Regarding the depth map z, we and opt for a minimal surface prior. Remark that
is a ΩHR→ scalar field which maps each pixel to the area of the corresponding surface element. Thus ∥d,Az,∇z∥l
with α>0 a free parameter controlling smoothness. According to the Retinex theory, the reflectance ρ can be assumed piecewise constant. This yields a Potts prior
with β>0 a scale parameter, and ∥⋅∥l
where |⋅|2 is the Euclidean norm in 6.
Replacing the maximisation of the posterior distribution (4) by the minimisation of its negative logarithm, combining Equations (4)-(6), (10), (12)-(16), and neglecting the additive constants, we end up with the variational model
with the following definitions of the weights:
We now describe an algorithm for effectively solving the variational problem (18), which is both non-smooth and nonconvex. In order to tackle the nonlinear dependency upon the depth and its gradient arising from shape-from-shading and minimal surface regularisation, we introduce an auxiliary variable θ:=(z, ∇z), then rewrite (18) as a constrained optimisation problem:
We then use a multi-block variant of ADMM to solve (20). Given the current estimates (ρ(k), l(k), θ(k), z(k)) at iteration (k), the variables are updated according to the following sweep:
where u and κ are a Lagrange multiplier and a step size, respectively. In our implementation κ is determined automatically using the varying penalty procedure. To solve the albedo sub-problem (21) we resort to primal-dual iterations. The lighting update (22) is solved using pseudo-inverse. The θ-update (23) comes down to a series of independent (there is no coupling between neighbouring pixels, thanks to the ADMM strategy) nonlinear optimisation problems, which we solve using an implementation of the L-BFGS method, using the Moreau envelope of the l1 norm to ensure differentiability. The depth update (24) requires solving a large sparse linear least-squares problem, which we tackle using conjugate gradient on the normal equations. Although the overall optimisation problem (18) is nonconvex, recent works have demonstrated that under mild assumptions on the cost function and small enough step size κ, nonconvex ADMM converges to a critical point. In practice, we found the proposed ADMM scheme to be stable and always observed convergence. In our experiments we use as initial guess: ρ(0)=I, l(0)=[0, 0, −1, 0]T, z(0) a smoothed (using bilinear filtering) version of a linear interpolation of the low-resolution input z(0), θ(0)=(z0, ∇z(0)), u(0)≡0 and κ(0)=10−4. In all our experiments, 10 to 20 global iterations (k) were sufficient to reach convergence, which is evaluated through the relative residual between two successive depth estimates z(k+1) and z(k). On a recent laptop computer with i7 processor, such a process requires around one minute (code is implemented in Matlab except the albedo update, which is implemented in CUDA).
We evaluated our variational approach to joint depth super-resolution and shape-from-shading against challenging synthetic and real-world datasets.
We first discuss the choice of the parameters involved in the variational problem (18). Although their optimal values can be deduced from the data statistics (see (19)), it can be difficult to estimate such statistics in practice and thus we rather consider μ, ν and λ as tuneable hyper-parameters. The formulae in (19) remain however insightful regarding the way these parameters should be tuned.
To select an appropriate set of parameters, we consider a synthetic dataset (the publicly available “Joyful Yell” 3D-shape) which we render under first-order spherical harmonics lighting (l=[0, 0, −1, 0.2]T) with three different reflectance maps. Additive zero-mean Gaussian noise with standard deviation 1% that of the original images is added to the high resolution (640×480 px2) images. Ground-truth high resolution and input low-resolution (320×240 px2) depth maps are rendered from the 3D-model. Non-uniform zero-mean Gaussian noise with standard deviation 10−3 times the squared original depth value (consistently with real-world measurements) is then added to the low-resolution depth map.
Quantitative evaluation is carried out by evaluating the root mean squared error (RMSE) between the estimated depth and albedo maps and the ground-truth ones.
Initially, we chose
ν=2 and λ=1. Then, we evaluated the impact of varying each parameter, keeping the others fixed to these values found empirically. The impact of the parameters μ, ν and λ on the accuracy of the albedo and depth estimates are shown in
To emphasise the interest of joint shape-from-shading and super-resolution over shading-based depth refinement using the down-sampled image, we also show competing results. For fair comparison, this time we use a scaling factor of 4 for all methods i.e., the depth maps are rendered at 120×160 px2. To evaluate the recovery of thin structures, we provide the mean angular error with respect to surface normals. The learning-based method can obviously not hallucinate surface details since it does not use the color image. The image-based method does a much better job, but it is largely overcome by shading-based super-resolution.
For real-world experiments, we use the Asus Xtion Pro Live sensor, which delivers 1280×1024 px2 RGB and 640×480 px2 depth images at 30 fps. Data are acquired in an indoor office with ambient lighting, and objects are manually segmented from background before processing.
Combining depth super-resolution and shape-from-shading apparently resolves the low-frequency and high-frequency ambiguities arising in either of the inverse problems. Over-segmentation of reflectance may happen, but this does not seem to impact depth recovery. Whenever color gets saturated or too low, then minimal surface drives super-resolution, which adds robustness. Visual inspection confirms the superiority of the presented method.
Handling cases with smoothly-varying reflectance may require using, instead of the Potts prior, another prior for the reflectance, or actively controlling lighting. This has already been achieved in RGB-D sensing.
The foregoing descriptions are only implementation manners of the present invention, the scope of the present invention is not limited to this. Any variations or replacements can be easily made through person skilled in the art. Therefore, the protection scope of the present invention should be subject to the protection scope of the attached claims.
Number | Date | Country | Kind |
---|---|---|---|
18171058.3 | May 2018 | EP | regional |