The present invention relates to the field of computer vision, and in particular, to non-line-of-sight imaging via neural transient field.
Non-line-of-sight (NLOS) imaging employs time-resolved measurements for recovering hidden scenes beyond the direct line of sight from a sensor. As an emerging computational imaging technique, NLOS imaging has been found to have broad applications in computer vision and computer graphics, ranging from recovering 3D shape of hidden objects to tracking hidden moving objects.
Most existing NLOS setups employ an ultra-fast pulsed laser beam towards a relay wall in the line of sight, where the wall diffuses the laser into spherical wavefronts towards the hidden scene. As the wavefront hits the scene and bounces back onto the wall, a time-of-flight (ToF) detector with picosecond resolution (such as the streak camera or the recent more affordable single-photon avalanche diodes (SPADs)) can be used to record the arrival time and the number of the returning photons. SPAD sensors in a time-correlated single photon counting (TCSPC) mode can produce transients, of which a single pixel corresponds to a specific pair of illumination and detection spots on the wall, in the form of a histogram of photon counts versus time bins. The measured transients contain rich geometric information of a hidden scene, potentially useful for scene recovery. The process corresponds to a typical inverse imaging problem that generally incurs high computational cost, since the transients are high dimensional signals.
To solve the computation problem, the pioneering back-projection (BP) technique and its variations assume smooth objects so that scene recovery can be modeled as deconvolution. Alternatively, the light-cone transform (LCT) based techniques collocate the illumination and sensing spots on the relay wall so that the forward imaging model can be simplified as 3D convolution, where advanced signal processing techniques such as Wiener filters can be used to further reduce noises. Assuming that the scene is near diffuse, analysis-by-synthesis algorithms can improve reconstruction. Fermat path based techniques can handle highly specular objects by simultaneously recovering the position and normal of Fermat points on the surface.
Existing NLOS methods can be categorized into confocal settings and non-confolcal settings. For example, Kirmani et al. designed and implemented the first prototype non-confocal NLOS system and derived a linear time-invariant model amenable to multi-path light transport analysis. See A. Kirmani, T. Hutchison, and J. a. Davis, “Looking around the corner using ultrafast transient imaging,” International Journal of Computer Vision, vol. 95, no. 1, pp. 13-28, 2011; A. Kirmani, J. Davis, and R Raskar, “Looking around the corner using transient imaging,” in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 159-166. In practice, varying both the laser beam and the measuring spots can yield a high-dimensional transient field analogous to the light field. Many efforts have focused on imposing priors and constraints to accelerate data processing. For example, Velten et al. proposed a back-projection technique with ellipsoidal constraints, where the observing point and the laser projection point on the wall correspond to the foci of a set of ellipsoids, each of which corresponds to a specific transient. See A. Velten, T. Willwacher, O. Gupta, A. Veeraraghavan, M. G. Bawendi, and R Raskar, “Recovering three-dimensional shape around a corner using ultrafast time-of-flight imaging,” Nature Communications, vol. 3, no. 1, p. 745, 2012. Then the hidden scene can be reconstructed by intersecting the ellipsoids. To further improve the reconstruction quality and speed, filtering techniques such as sharpening and thresholding have been applied. Alternatively, the scene can be directly modeled using parametric surfaces and then optimize the parameters over the observations. For example, Ahn et al. model parameter fitting as a linear least-squares problem using a convolutional Gram operator. See B. Ahn, A. Dave, A. Veeraraghavan, I. Gkioulekas, and A. C. Sankaranarayanan, “Convolutional approximations to the general non-line-of-sight imaging operator,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7889-7899. It is also possible to adopt wave optics for NLOS imaging by characterizing the problem as specific properties of a temporally evolving wave field in the Fourier domain.
To reduce the dimensions of data, several recent approaches have adopted a confocal setting where the laser and the detector (e.g., a SPAD) collocate, e.g., via a beam splitter.
Existing NLOS methods can also be categorized based on the forms of the reconstruction results. Two mostly adopted forms are volume density and points/surfaces. Methods for recovering the volume density generally discretize the scene into voxels and then compute the density, either by using intersections of wavefronts under ellipsoidal and spherical constraints, or by modeling the imaging process as convolution and recovering the volume via specially designed deconvolution filters. Methods for recovering the points/surfaces have relied on light transport physics for optimizing the shape and reflectance of the hidden scenes. Such methods are generally mathematically tractable but are computationally expensive since higher order geometry such as the surface normal needs to be integrated into the optimization process. Reconstruction results are either sparse where only discontinuities in the transient are used, or the results rely heavily on the quality of the basis shape.
Existing methods have obtained relatively good NLOS reconstruction results, however, nonlinear phenomena such as self-occlusion and non-Lambertian surface reflectance in imaging are often not taken into account, resulting in failure of recovering fine details of the NLOS scene. Such issue can be addressed by the novel volumetric NLOS imaging framework neural transient filed (NeTF) provided in the present invention, which models the transient field via deep networks. In NeTF, volumetric transient field is formulated under a spherical coordinate and a trained multi-layer perception (MLP) is devised to predict per-voxel density and view-dependent albedo. Different from the prior art, the trained MLP provides a continuous 5D representation of the hidden scene without digitizing the NLOS volume or optimizing surface parameters, and can handle view-dependent albedo of non-Lambertian surface reflectance and strong self-occlusions, under both confocal and non-confocal setups.
It is to be noted that the above information disclosed in this Background section is only for facilitating the understanding of the background of this invention, and may contain information that is not known to a person of ordinary skill in the art.
In view of the limitations of existing technologies described above, the present invention provides a computer-implemented method for imaging a non-line-of-sight scene to address the aforementioned limitations. Additional features and advantages of this invention will become apparent from the following detailed descriptions.
One aspect of the present invention is directed to a computer-implemented method for imaging a non-line-of-sight (NLOS) scene. The method may include: encoding, by a computing system, a neural transient field onto a Multi-Layer Perception (MLP), wherein the neural transient field represents the NLOS scene as a continuous 5D function of transients; feeding a plurality of transient pixels captured by a time-resolved detector from a plurality of detection spots on a relay wall to the MLP; outputting a volume density and a surface reflectance along a direction by the MLP in accordance with the plurality of transient pixels; and reconstructing the NLOS scene in accordance with the volume density and the surface reflectance.
In some embodiments, each transient pixel may be parameterized using spherical coordinates with respect to a detection spot on the relay wall. The method may further include: transforming the spherical coordinates of the transient pixels into corresponding Cartesian coordinates.
In some embodiments, the method may further include: employing positional encoding (PE) technique to map each transient pixel to a multiple dimensional Fourier domain.
In some embodiments, the multiple dimensional Fourier domain may be in a range of 4 to 10.
In some embodiments, the MLP may comprise nine 256-channel layers and one 128-channel layer.
In some embodiments, the method may further include: outputting a feature vector.
In some embodiments, the method may further include: reconstructing the NLOS scene in accordance with
wherein τ(x′, y′, t) represents the transient at a time instant t; Γ0=Aar02EP/π corresponds to a constant term of particle radius α, initial energy EP, and patch radius r0; the integration domain
is a hemisphere centered at the detection spot P(x′; y′) on the relay wall with a radius of
θ and ϕ are angles of elevation and azimuth in the direction from P(x′; y′) to the transient pixel; α(r, θ, ϕ; x′, y′) represents the density, and ρ(r, θ, ϕ; x′, y′) models the surface reflectance.
In some embodiments, the method may further include: reconstructing the NLOS scene in accordance with
In some embodiments, the method may further include: reconstructing the NLOS scene in accordance with
wherein τ(P, P′, t) represents the transient at the time instant t; P and P′ represent a illumination spot and the detection spot on the relay wall, respectively, with respect to a NLOS scene point Q, r1 and r2 correspond to the distance from P to Q and Q to P′, respectively; a length of optical path γ: P→Q→P′ equals to r1+r2=ct; focal length is γ=|{right arrow over (OP)}−{right arrow over (OP′)}|; EP′ represents an energy received at P′ from Q; (μ, ν, φ) are ellipsoidal coordinates; and J represents Jacobian from the Cartesian coordinates.
In some embodiments, the method may further include: reconstructing the NLOS scene in accordance with
wherein μ=arccos h(ct/2γ).
In some embodiments, the method may further include: predicting the plurality of transient pixels based on the volume density and the surface reflectance.
In some embodiments, the method may further include: calculating a loss function between the estimated transient pixels and the captured transient pixels.
In some embodiments, the method may further include: capturing a plurality of new transient pixels by the time-resolved detector from a plurality of new detection spots on the relay wall in accordance with the loss function as a probability density function (PDF); feeding the plurality of new transient pixels captured to the MLP; outputting the volume density and the surface reflectance along the direction by the MLP in accordance with the plurality of new transient pixels; and reconstructing the NLOS scene in accordance with the volume density and the surface reflectance.
In some embodiments, the decreasing speed of the loss function may be less than (Li−Li+1)/Li<10−4, wherein L represents the loss function.
In some embodiments, the method may further include: selecting a plurality of first transient pixels from the plurality of transient pixels and sampling the first transient pixels by the MLP; predicting the plurality of first transient pixels based on the volume density and the surface reflectance; calculating the loss function between the estimated first transient pixels and the captured first transient pixels; and selecting a plurality of second transient pixels from the plurality of transient pixels and sampling the plurality of second transient pixels in accordance with the loss function as the probability density function (PDF).
In some embodiments, the method may further include: outputting the volume density and the surface reflectance along the direction by the MLP in accordance with the plurality of second transient pixels; and reconstructing the NLOS scene in accordance with the volume density and the surface reflectance.
In some embodiments, the selecting a plurality of second transient pixels from the plurality of transient pixels may further include: employing a Markov chain Monte Carlo (MCMC) algorithm in accordance with
wherein τf(x′, y′, t) is the transient based on the second transient pixels; K(θf,ij, ϕf,ij) is the probability density function; σ(r, θf,ij, ϕf,ij; x′, y′) represents the density; and ρ(r, θf,ij, ϕf,ij; x′, y′) models the surface reflectance.
In some embodiments, the method may further include: reconstructing the NLOS scene using the first transient pixels and the second transient pixels in accordance with:
wherein τc(x′, y′, t) is the transient based on the first transient pixels.
The foregoing general description and the following detailed description are merely examples and explanations and do not limit the present invention.
The accompanying drawings, which are incorporated in and constitute a part of the description, illustrate embodiments consistent with this invention and, together with the description, serve to explain the disclosed principles. It is apparent that these drawings present only some embodiments of this invention and those of ordinary skill in the art may obtain drawings of other embodiments from them without exerting any creative effort.
Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, these exemplary embodiments can be implemented in many forms and should not be construed as being limited to those set forth herein. Rather, these embodiments are presented to provide a full and thorough understanding of this invention and to fully convey the concepts of the exemplary embodiments to others skilled in the art.
In addition, the described features, structures, and characteristics may be combined in any suitable manner in one or more embodiments. In the following detailed description, many specific details are set forth to provide a more thorough understanding of this invention. However, those skilled in the art will recognize that the various embodiments can be practiced without one or more of the specific details or with other methods, components, materials, or the like. In some instances, well-known structures, materials, or operations are not shown or not described in detail to avoid obscuring aspects of the embodiments.
The present invention is inspired by the recent multi-view reconstruction framework Neural Radiance Field (NeRF) that aims to recover the density and color at every point along every ray, implicitly providing a volumetric reconstruction. Different from existing multi-view stereo (MVS) techniques, NeRF adopts a volume rendering model and sets out to optimize volume density that best matches the observation using a MLP. Additionally, NeRF can be modified to tackle photometric stereo (PS) problems where the cameras are fixed but the lighting conditions vary.
It is observed that the non-confocal NLOS imaging process greatly resembles MVS/PS. In particular, fixing the laser beam and measuring the transient at different spots on the wall resembles MVS, while fixing the measuring spot and varying the laser beam resembles PS. Additionally, the confocal setting is very similar to the NeRF AA setting, where the lighting and the cameras move consistently. Therefore, a similar deep learning technique is provided in the present invention for scene recovery, and such reconstruction scheme is called Neural Transient Field or NeTF.
Both NeTF and NeRF use MLP as an optimizer. However, there are several major differences between NeRF and NeTF.
With respect to the volume rendering model, neural radiance field L(x, y, z, θ, ϕ) is used in the NeRF framework as scene representation, where (x, y, z) corresponds to a point on a ray and (θ, ϕ) corresponds to the direction of the ray. The outputs of the trained network are the density σ at every position (x, y, z) and the view-dependent color c=(r, g, b) along a direction (θ, ϕ). The density can be further used for scene reconstruction and the color can be used for image-based rendering.
In the present invention, instead of sampling on a single camera ray, a hemisphere of rays are sampled in NeTF as light propagates as a spherical wave from the relay wall towards the hidden scene. Referring to
where P(x′, y′) is a detection spot on the wall that serves as the origin of the hemisphere, and Q(r, θ, ϕ) is a scene point parameterized using the spherical coordinate (r, θ, ϕ) with respect to P(x′, y′). Similar to NeRF, a fully connected neural network (i.e., an MLP) is designed to estimate LNLOS. Different from NeRF, the outputs of LNLOS in NeTF are volume density σ and surface reflectance (albedo ρ), rather than color along the direction (θ, ϕ).
Different spots need to be scanned on the relay wall in NLOS can result in inconsistent spherical coordinates and pose challenges in network training and inference. Therefore, the spherical coordinates (x′, y′, r, θ, ϕ) are first transformed to their corresponding Cartesian coordinates, i.e., (x, y, z, θ, ϕ) as:
The transform R ensures that the position of a 3D voxel is consistent when scanning from different detection spots. All subsequent training under MLP should be conducted under the Cartesian coordinate for density and view dependent albedo inferences.
Like NeRF, a key benefit of NeTF is that there's no need to discretize the scene into a fixed-resolution volume representation. Instead, the deep network representation can provide scene reconstructions at an arbitrary resolution, and recover fine details largely missing in prior art.
In the present invention, the NLOS reconstruction problem is reformulated as a forward model under the NeTF representation. Under the confocal setting, the illumination and detection collocate at the same spot P(x′, y′) on a relay wall, and produce a spherical wave anchored at the spot. The transient τiso(x′, y′, t) recorded at each spot P(x′, y′) is the summation of photons that are reflected back at a specific time instant t from the NLOS scene in the 3D half-space Ω as:
where c is the speed of light, r is the distance between the wall and the NLOS scene as r=√{square root over ((x′−x)2+(y′−y)2+z2)}=tc/2, 1/r4 is the light fall-off term, and ρiso(x, y, z) is the albedo of an NLOS point (x, y, z are the spatial coordinates of the point). Function g models the time-independent effects, including the surface normal, bidirectional reflectance distribution functions (BRDFs), occlusions patterns, etc. The Dirac delta function relates the time of flight t to the distance r.
The function g makes the imaging process nonlinear. To solve this problem, existing linear approximation schemes adopt g=1, assuming that the NLOS scene scatters isotropically and that no occlusions occur within the NLOS scene. Such assumptions, however, restrict NLOS scenes to be Lambertian and convex. In contrast, NeTF, by adopting a deep network to model the imaging process, can tackle non-linearity without imposing explicit constraints on g.
In the NLOS setting in the present invention, photons travel along spherical wavefronts. When reaching either the relay wall or the hidden surface, the photons are reflected and then continue to propagate along a hemisphere. Since the scattering equation serves as the foundation for volume rendering under NeRF, in order to specify how much an NLOS point in the hemisphere contributes to the transient through photon propagation, a photon version of the scattering equation is derived in NeTF.
The attenuation coefficient can be computed as e∫
When considering the reflection at Q and assuming that the cross-section is thin enough (e.g., d r=2a), the radiant energy at Q attenuated due to absorption and reflection with respect to the reflectance ρ can be defined as:
On the returning path from Q to P, the wavefront forming hemisphere is centered at Q with radius r. The spot P with area πr02 receives the photons back to the relay wall, and the energy received at P with respect to the solid angle d Ω can be defined as:
By taking integral of Eqn. 8 with respect to the solid angle d Ω, the energy received at the detection spot P(x′, y′) in the hemisphere at a time instant t can be defined as:
Eqn. 9 serves as the forward imaging model in NeTF, which essentially maps an NLOS point Q(r, θ, ϕ) to a transient τ detected at a spot P(x′, y′) on a diffuse surface at a time instant t. For clarity, σ(r, θ, ϕ; x′, y′) is abbreviated as σ(r, θ, ϕ), and ρ(r, θ, ϕ; x′, y′) as ρ(r, θ, ϕ). Therefore, Eqn. 9 can be rewritten as:
where constant Γ0=Aar02EP/π is determined by particle radius a, initial energy EP, and patch radius r0. The integration domain
is a hemisphere centered at P(x′, y′) on a relay wall, with a radius of r=ct/2. θ and ϕ are the elevation and azimuth angles in the viewing direction from P(x′, y′) to an NLOS point, equivalent to those in the direction of reflection from the NLOS scene. ρ(r, θ, ϕ; x′, y′) models view-varying BRDFs of the NLOS scene. e2∫
The forward plenoptic transient field model in Eqn. 11 is computationally expensive if the MLP is used for training. Assuming that the NLOS scene is all opaque and does not exhibit self-occlusion, the formulation can be further simplified as:
Such a simplified formulation can reduce computations and is used in the examples of
Although both NeRF and NeTF derive the forward model based on volume rendering, NeRF models a ray propagates along a line (i.e., with a cylinder between two points) while NeTF models spherical wavefront propagation (i.e., with a cone model that accounts for attenuation). In addition, the volume rendering model used in NeRF only considers one-way accumulation, i.e., how light travels through light emitting particles towards the camera sensor. In contrast, NeTF in the present invention adopts a two-way propagation model, i.e., how light illuminates the scene and how the scene illuminates the wall.
The forward model provided in the present invention is differentiable. Therefore, the continuous integral Eqn. 12 can be numerically computed using quadrature as:
Q(r, θij, ϕij) stands for scene points uniformly sampled along the hemispherical rays. These points are transformed into the corresponding Cartesian coordinates and serve as the inputs to the MLP. The outputs of the network are the density and reflectance at each point. Then all the outputs are summed as neural transient fields from the transients.
NeTF is further optimized by minimizing the following l2-norm loss function serving as the difference between the predicted τ(x′, y′, t) transients and the measured τm(x′, y′, t) transients:
The use of MLP allows the minimization of arbitrary losses as long as they are differentiable with respect to the predicted τ(x′, y′, t), although l2-norm is most commonly adopted as in NeRF.
The NeTF forward model can model the plenoptic transient field by using an MLP. However, data acquired by NeTF is quite different from that acquired by NeRF. In NeRF, a dense set of high resolution images are generally required to produce satisfactory density estimation and view interpolation. Under such dense viewpoint setting, the problem of occlusions is less significant as there are a sufficient number of views capturing the occluded points to ensure reliable reconstruction. In NeTF, however, the SPAD only captures a sparse set of spots on the wall and an occluded point may be captured only from a very small number of viewpoints (spots). Consequently, occlusion can lead to strong reconstruction artifacts if not handled properly. Such sampling bias resembles the long-tailed classification problem in machine learning. One solution is to resample the dataset to achieve a more balanced distribution by over-sampling the minority classes. Therefore, a two-stage training strategy along with a hierarchical sampling technique is developed to address this sampling bias.
With respect to the two-stage training, the loss calculated from the first stage in the training process is used to guide resampling for the second stage. In particular, the training is first conducted using all samples to obtain an initial reconstruction, and the loss function between the predicted transients and measured transients is calculated at every detection spot on the relay wall. It is observed that spots that correspond to a high loss imply an undersampling. Therefore, the calculated loss is normalized to form a probability density function (PDF), and the integral of PDF on the whole domain is equal to 1. Next, the detection spots are resampled using the PDF based on a sampling scheme where a higher loss corresponding to a higher PDF indicates that a more dense sampling is required. A new training dataset is built using this sampling scheme, and the network is subsequently retrained to refine reconstruction. These two stages may be iterated until convergence. Specifically, during the training process, loss is decreased at a decreasing speed. When the decreasing speed is slow enough, the network is deemed to be converged. For example, denoting Li as the loss at ith training, if (Li−Li+1)/Li<10−4 is satisfied, the network is deemed to be converged. The two-stage training process provides a viable solution to tackle imbalanced sampling for achieving a more accurate reconstruction.
Denser samples can produce a reconstruction of higher quality. However, they may lead to a much higher computation overhead. For example, by uniformly sampling L hemispherical wavefronts at each detection spot and N2 scene points on each L, the resulting training process requires a computational complexity of O(N2L). The parameter N can be tuned, and usually have a value ranging from 8 to 64. Larger N leads to better imaging resolution but also more memory consumption. It's observed that under the confocal setting, spherical wavefronts only intersect with a very small portion of the NLOS scene. These wavefronts tend to convert at specific patches and contribute greatly to the final integral, while the contribution from the other portions are negligible. Thus, a hierarchical sampling is adopted in the NeTF, but the hierarchical sampling scheme in NeTF is different from that in NeRF. As discussed above, NeRF calculates the integral along a ray, i.e., using 1D sampling, while NeTF calculates on a hemisphere, i.e., using 2D sampling. Therefore, a coarse-to-fine sampling scheme is developed.
Specifically, Nc2 uniform scene points are first sampled in the hemisphere and the coarse network is evaluated with the estimated PDF k(θ, ϕ). Then Metropolis-Hastings algorithm and conditional Gaussian distribution are employed for state transition of Markov chain to produce a fine sampling of Nf scene points with the PDF as K(θijf, ϕijf) along the hemispherical wavefronts that intersect with the NLOS scene. Generally speaking, fine points are gathered around the coarse points with high volume density. After the coarse sampling is done, the density of the object at each sampling point is known by the neural network. Sampling points with very low density can be deemed to be irrelevant to the NLOS volume because they are not on the object, while those sampling points with high density are deemed to be relevant to the NLOS volume.
Finally, the coarse and fine samples are combined as Nc2+Nf to improve the reconstruction quality of the fine network. Specifically, energy received at the detection spot P(x′, y′) in the hemisphere at a time instant t can be redefined as.
where τc(x′, y′, t) is the integral with the coarse samples Nc2, and τf(x′, y′, t) is estimated with fine samples Nf from MCMC, as:
The hierarchical sampling in NeTF is intrinsically differentiable. Previous volume-based methods can theoretically apply such a hierarchical sampling technique to refine their reconstruction. However, these methods in practice use an explicit volumetric representation with a fixed resolution, making resampling on the hemisphere intractable.
The NeTF implementation and experimental validations are further described below.
The NeTF is trained using an MLP.
First, the spatial coordinates (x, y, z) and the viewing direction (θ, ϕ) are normalized to range between [−1, 1]. Next, the positional encoding (PE) technique is applied and each input is mapped from 1 dimension onto a 10-dimensional Fourier domain to represent high-frequency variation in geometry and reflectance. Other dimensional domain may be used, and the preferred range of dimension is between 4 and 10. Third, the coordinates (x, y, z) are processed by the MLP as inputs with eight 256-channel layers and a 256-dimensional feature vector is outputted. For the reconstruction of a complex object, the channel size should be large enough to represent the object, and preferably be larger than 128. The coordinates (x, y, z) are also concatenated with the fourth layer for skip connection. Finally, this feature vector is passed to an additional 256-channel layer and σ is produced. Simultaneously, the feature vector is concatenated with the direction (θ, ϕ) and passed to the 128-channel layer for producing reflectance ρ.
Under the NLOS setting, a batch size of 1 to 4 transients is considered and 32×32 or 64×64 samples are employed for both uniform sampling Nc2 and MCMC sampling Nf on the hemisphere. The Adam optimizer is adopted with hyperparameters β1=0.9, and ∈=1e−7. In the experiments, the learning rate begins at 1e−3 and decays exponentially to 1e−4 through the optimization. The training time of NeTF shares certain similarities to NeRF. In NeRF, the training cost depends on how densely to sample along each ray. In NeTF, the training cost depends on two factors, namely how densely to sample along the radius of the hemisphere (i.e., the number of layers) and how densely to sample each layer/hemisphere. For the Bunny scene, on a single GeForce RTX 3090 GPU, the training takes 10 hours using 200 layers with 32×32 samples on each layer (5 epochs, batch size 4). The training time quadruples with the same number of layers but at 64×64 samples.
The NeTF approach have been validated on two public NLOS datasets, including a simulated ZNLOS dataset, and a real Stanford dataset. ZNLOS consists of multi-bounce transients of synthetic objects that are 0.5 m away from the relay wall. The transients have a temporal resolution of 512 time bins with a width of 10 ps and a spatial resolution of 256×256 pixels. The Stanford dataset captures transients measured in real scenes that are 1.0 m away from the relay wall. The transients in this dataset have a temporal resolution of 512 time bins with a width of 32 ps and a spatial resolution of 512×512 or 64×64 pixels. Quantitative and qualitative comparisons have been conducted between NeTF and the state-of-the-art (SOTA) methods.
For the ZNLOS dataset, experiments have been conducted on several simulated hidden objects, including Bunny, Lucy, and Indonesian with a spatial resolution of 256×256 pixels that correspond to an area of size 1 m×1 m on the relay wall. All these three models are diffuse. Bunny is not put on a floor, while Lucy and Indonesian are. For the Stanford dataset, experiments have been conducted on three real hidden objects made of different materials, including a diffuse Statue, a glossy Dragon, and a metal Bike. Their spatial resolution is 512×512 spots but is down-sampled to 256×256.
To test NeTF under the non-confocal setting, experiments have been conducted on two additional objects from ZNLOS, i.e., the letter Z and the Bunny, and their transients simulated under non-confocal setups.
To further test on the robustness of NeTF vs. SOTA on occlusions, experiments have been conducted on a semi-occluded scene from ZNLOS using Eqn. 11.
Table 1 and Table 2 show that NeTF achieves accuracy comparable to the state-of-the-art (SOTA) in terms of Mean Absolute Error (MAE), demonstrating the feasibility and efficacy of deep neural network for NLOS under both confocal and non-confocal settings.
Table 1 compares the reconstruction error using NeTF and SOTA on three confocal NLOS datasets measured by MAE. Under the MAE metric, the benefit of using NeTF does not seem significant. However, MAE does not fully reflect the reconstruction quality. For example, Phasor Field produces the highest MAE on Indonesian, indicating lowest reconstruction quality, yet it manages to recover many fine details largely missing in F-K and DLCT, as shown in
Table 2 compares reconstruction error using NeTF and SOTA on two non-confocal NLOS datasets measured by MAE. As noted previously, low MAE does not sufficiently reflect reconstruction quality. For example, for the Z letter scene, NeTF performs slightly worse than FBP with respect to MAE but better preserves the silhouettes, as shown in
LCT can be formulated as a simplified NeTF model. First, the forward model Eqn. 12 is rewritten under triple integrals with the Dirac delta function that correlates time of flight t with distance r:
where the integral domain Ω is defined under the spherical coordinates. Eqn. 17 is consistent with the light-cone transform (LCT) model, and can be rewritten, under the Cartesian coordinates where d xd yd z=r2 sin θd rd θd ϕ, as:
If the diffuse and isotropic albedo are assumed as ρiso(x, y, z)=σ(x, y, z)ρ(x, y, z, θ, ϕ), Eqn. 18 can be degenerated to the LCT model (which equals to Eqn. 4 with g=1).
where Γ=Aar02EP/π. exp(−A∫γ σ(s)d s) corresponds to the attenuation coefficient along optical path γ: P→Q→P′ with length r1+r2=ct.
To compute the complete transient received at P′ from P, it should be noted that P′ should be radiated by all points lying on a semi-ellipsoid E with the foci P, P′, semi-major axis of length α=ct/2, focal length γ=|{right arrow over (OP)}−{right arrow over (OP′)}|, and the eccentricity e=γ/α. For simplicity, the coordinate system can be set up so that P and P′ are symmetric about origin O and {right arrow over (PP′)} parallel to y-axis. Thus the transient can be computed as:
Since Eqn. 19 is integrated on the semi-ellipsoid E but under spherical coordinates centered at P, E need to be rewrited under ellipsoidal coordinates with foci P and P′. Specifically, the ellipsoid is represented in terms of r1 and θ as:
Then Eqn. 20 is transformed to:
Next, spherical coordinates (r1, θ, ϕ) are transformed to ellipsoidal coordinates (μ, ν, ϕ) as:
The Jacobian J from the Cartesian to ellipsoidal coordinates is:
Spherical coordinates can be mapped to ellipsoidal coordinates via J as:
Substituting Eqn. 25 into Eqn. 22, the transient under the ellipsoidal coordinate system can be rewritten as:
Notice that with a fixed t, the corresponding μ for non-zero δ can be obtained so that the triple integrals can be simplified to double integrals of only ν and φ. In addition, if further discarding the attenuation term in EP′, the transient can be further simplified to:
where μ=arccos h(ct/2γ). A downside of discarding attenuation is that occlusions are ignored.
A novel neural modeling framework Neural Transient Field (NeTF) is provided for non-line-of-sight (NLOS) imaging. Similar to the recent Neural Radiance Field that seeks to use a multi-layer perception (MLP) to represent the 5D radiance function, NeTF recovers the 5D transient function in both spatial location and direction. Different from NeRF, the training data input is parametrized on the spherical wavefronts in NeTF rather than parametrized along lines (rays) as in NeRF. Therefore, the NLOS process is formulated under spherical coordinates, analogous to volume rendering under Cartesian coordinates. Another unique characteristic of NeTF is the use of Markov chain Monte Carlo (MCMC) to account for sparse and unbalanced sampling in NeTF. MCMC enables more reliable volume density estimation and produces more accurate shape estimation by recovering missing details caused by occlusions and non-uniform albedo. Experiments on both synthetic and real data demonstrate the benefits of NeTF over existing techniques in both robustness and accuracy.
This application is the national phase entry of International Application No. PCT/CN2021/104609, filed on Jul. 5, 2021, the entire contents of which are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/104609 | 7/5/2021 | WO |