N/A
The technology discussed below relates to scene reconstruction.
Accurate recovery of motion from a sequence of images can be performed in computer vision with numerous applications (e.g., in robotics, augmented reality, user interfaces, and autonomous navigation). However, traditional scene reconstruction or motion estimation technique in such conditions suffer from too much blur in the presence of high-speed motion and/or strong noise in low-light conditions. As the demand for computer vision tasks continues to increase, research and development continue to advance motion estimation or scene reconstruction technologies to meet the growing demand for computer vision.
The following presents a simplified summary of one or more aspects of the present disclosure, in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated features of the disclosure, and is intended neither to identify key or critical elements of all aspects of the disclosure nor to delineate the scope of any or all aspects of the disclosure. Its sole purpose is to present some concepts of one or more aspects of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
In one example, a method, a system, and/or an apparatus for scene reconstruction is disclosed. The method, the system, and/or the apparatus includes: obtaining a set of frames for a scene: grouping the set of frames into a first initial group and a second initial group: determining a first homography across a first merged frame of the first initial group and a second merged frame of the second initial group: warping the set of frames according to the first homography: resampling the set of frames to generate a first new group and a second new group: determining a second homography across a third merged frame of the first new group and a fourth merged frame of the second new group: warping the set of frames according to the second homography; and providing a reconstructed image based on the warped set of frames.
In another example, a method, a system, and/or an apparatus for scene reconstruction is disclosed. The method, the system, and/or the apparatus includes: obtaining a set of frames for a scene: grouping the set of frames into a first initial group and a second initial group: warping and merge a plurality of first frames in the first initial group and a plurality of second frames in the second initial group to generate a first merged frame and a second merged frame, respectively: determining a first homography across the first merged frame and the second merged frame: warping the set of frames according to the first homography: resampling the set of warped frames to generate a first new group and a second new group: merging a plurality of third frames in the first new group and a plurality of fourth frames in the second new group to generate a third merged frame and a fourth merged frame, respectively: determining a second homography across the third merged frame and the fourth merged frame: warping the set of warped framed according to the second homography; and providing a reconstructed image based on the warp of the set of frames.
These and other aspects of the disclosure will become more fully understood upon a review of the drawings and the detailed description, which follows. Other aspects, features, and embodiments of the present disclosure will become apparent to those skilled in the art, upon reviewing the following description of specific, example embodiments of the present disclosure in conjunction with the accompanying figures. While features of the present disclosure may be discussed relative to certain embodiments and figures below; all embodiments of the present disclosure can include one or more of the advantageous features discussed herein. In other words, while one or more embodiments may be discussed as having certain advantageous features, one or more of such features may also be used in accordance with the various embodiments of the disclosure discussed herein. Similarly, while example embodiments may be discussed below as devices, systems, or methods embodiments it should be understood that such example embodiments can be implemented in various devices, systems, and methods.
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations and is not intended to represent the only configurations in which the subject matter described herein may be practiced. The detailed description includes specific details to provide a thorough understanding of various embodiments of the present disclosure. However, it will be apparent to those skilled in the art that the various features, concepts and embodiments described herein may be implemented and practiced without these specific details. In some instances, well-known structures and components are shown in block diagram form to avoid obscuring such concepts.
In some examples, the imaging device 104 can include a single-photon camera, a single-photon avalanche diode, a high-speed camera, or any suitable camera, which is capable of high-speed imaging. In some examples, the single-photon camera is a sensor technology with ultra-high sensitivity down to individual photons. In addition to its extreme sensitivity, the single-photon camera based on single-photon avalanche diodes (SPADs) can also record photon-arrival timestamps with extremely high (sub-nanosecond) time resolution. Moreover, the SPAD-based single-photon camera is compatible with complementary metal-oxide semiconductor (CMOS) photolithography processes which can facilitate fabrication of kilo-to-mega-pixel resolution SPAD arrays. Due to these characteristics, the SPAD-based single-photon camera can be used in 3D imaging, passive low-light imaging, HDR imaging, non-line-of-sight (NLOS) imaging, fluorescence lifetime imaging (FLIM) microscopy, and diffuse optical tomography.
Unlike a conventional camera pixel that outputs a single intensity value integrated over micro-to-millisecond timescales, a SPAD pixel generates an electrical pulse for each photon detection event. A time-to-digital conversion circuit converts each pulse into a timestamp recording the time-of-arrival of each photon. Under normal illumination conditions, a SPAD pixel can generate millions of photon timestamps per second. The photon timestamps are often captured with respect to a periodic synchronization signal generated by a pulsed laser source. To make this large volume of timestamp data more manageable, the SPAD-based single-photon camera can build a timing histogram on-chip instead of transferring the raw photon timestamps to the host computer. The histogram can record the number of photons as a function of the time delay with respect to the synchronization pulse.
In some examples, the computing device 110 can include a processor 112. In some embodiments, the processor 112 can be any suitable hardware processor or combination of processors, such as a central processing unit (CPU), a graphics processing unit (GPU), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a digital signal processor (DSP), a microcontroller (MCU), etc.
In further examples, the computing device 110 can further include a memory 114. The memory 114 can include any suitable storage device(s) that can be used to store suitable data (e.g., the stream of frames 102, a reconstructed image 106, etc.) and instructions that can be used, for example, by the processor 112 to obtain a set of frames for a scene: group the set of frames into a first initial group and a second initial group; warp and merge a plurality of first frames in the first initial group and a plurality of second frames in the second initial group to generate a first merged frame and a second merged frame, respectively: determine a first homography across the first merged frame and the second merged frame: warp the set of frames according to the first homography: resample the set of warped frames to generate a first new group and a second new group: merge a plurality of third frames in the first new group and a plurality of fourth frames in the second new group to generate a third merged frame and a fourth merged frame, respectively: determine a second homography across the third merged frame and the fourth merged frame: warp the set of warped framed according to the second homography; and provide a reconstructed image based on the warp of the set of frames; select a first middle frame in the first initial group to warp the plurality of first frames based on the first middle frame: select a second middle frame in the second initial group to warp the plurality of first frames based on the first middle frame: merge a plurality of first warped frames of the first initial group to generate the first merged frame: merge a plurality of second warped frames of the second initial group to generate the second merged frame: warp the first merged frame and the second merged frame to be aligned together before determining the second homography across the first merged frame and the second merged frame: interpolate the first homography: warp the set of frames based on the interpolated first homography. The memory 114 can include any suitable volatile memory, non-volatile memory, storage, or any suitable combination thereof. For example, memory 114 can include random access memory (RAM), read-only memory (ROM), electronically-erasable programmable read-only memory (EEPROM), one or more flash drives, one or more hard disks, one or more solid state drives, one or more optical drives, etc. In some embodiments, the processor 112 can execute at least a portion of process 300 described below in connection with
In further examples, the computing device 110 can further include a communications system 116. The communications system 116 can include any suitable hardware, firmware, and/or software for communicating information over the communication network 120 and/or any other suitable communication networks. For example, the communications system 116 can include one or more transceivers, one or more communication chips and/or chip sets, etc. In a more particular example, communications system 116 can include hardware, firmware and/or software that can be used to establish a Wi-Fi connection, a Bluetooth connection, a cellular connection, an Ethernet connection, etc.
In further examples, the computing device 110 can receive or transmit information (e.g., a stream of frames 102, a reconstructed image 106, etc.) from or to any other suitable system over the communication network 120. In some examples, the communication network 120 can be any suitable communication network or combination of communication networks. For example, the communication network 120 can include a Wi-Fi network (which can include one or more wireless routers, one or more switches, etc.), a peer-to-peer network (e.g., a Bluetooth network), a cellular network (e.g., a 3G network, a 4G network, a 5G network, etc., complying with any suitable standard, such as CDMA, GSM, LTE, LTE Advanced, NR, etc.), a wired network, etc. In some embodiments, communication network 120 can be a local area network, a wide area network, a public network (e.g., the Internet), a private or semi-private network (e.g., a corporate or university intranet), any other suitable type of network, or any suitable combination of networks. Communications links shown in
In further examples, the computing device 110 can further include one or more inputs 118 and/or a display 120. In some embodiments, the input(s) 118 can include any suitable input devices (e.g., a keyboard, a mouse, a touchscreen, a microphone, etc.). In further embodiments, the display 120 can include any suitable display devices, such as a computer monitor, a touchscreen, a television, an infotainment screen, etc. to display the reconstructed image 106, or any suitable information.
In some examples, the disclosed scene reconstruction technique and system associated with
Referring to
Although single-photon cameras can capture scene information at high sensitivity and speed, each individual captured frame can be binary valued: a pixel is on if at least one photon is detected during the exposure time and off otherwise. This binary imaging model presents unique challenges. Traditional image registration techniques rely on feature-based matching, or direct optimization using differences between pixel intensities, both of which rely on image gradients to converge to a solution. Individual binary images suffer from severe noise and quantization (only having 1-bit worth of information per pixel), and are inherently non-differentiable, making it challenging, if not impossible, to apply conventional image registration and motion estimation techniques directly on binary frames. Aggregating sequences of binary frames over time increases signal (
The disclosed scene reconstruction technique and system is capable of estimating rapid motion from a sequence of high-speed binary frames captured using an image device (e.g., a single-photon camera, a high-speed camera, a CMOS camera, etc.). In some examples, these binary frames can be aggregated in post-processing in a motion-aware manner so that more signal and bit-depth are collected, while simultaneously minimizing motion blur. As seen in
At step 402, the process 400 obtains a set of frames for a scene. In some examples, the set of frames can be a stream of video frames (e.g., video frames). In other examples, the set of frames can be multiple image frames (e.g., captured by a camera). Further, the set of frames can be captured by an imaging device 104. For example, a user can move the imaging device 104 to capture a scene via the set of frames. In some examples, the imaging device can include a single-photon camera. However, it should be appreciated that the imaging device is not limited to a single-photon camera. For example, the imaging device can include a high-speed camera, which can capture images with frame rates in excess of 250 fps. In some examples, the single-photon camera is capable of high-speed imaging (e.g., 500 fps or higher) in low-light conditions. In further examples, the single-photon camera can be based on a single-photon avalanche diode (SPAD) sensor. In some examples, a frame of the set of frames can include binary values. For example, a pixel in the frame is one if at least one photon is detected during the exposure time and off otherwise. Thus, each pixel in the frame can include a binary value (i.e., 1-bit information). In further examples, the set of frames can be considered a three-dimensional photon cube (i.e., x and y spatial dimensions and an extra photon arrival time dimension).
In some examples, the image formation model for the SPAD sensor may enable high-speed photon-level sensing, which can emulate virtual exposures whose signal-to-noise ratio (SNR) is limited only by the fundamental limits of photon noise. For a static scene with a radiant flux (photons/second) of @, during an exposure time T, the probability of observing k incident photons on a SPAD camera pixel follows a Poisson distribution:
After each photon detection, the SPAD pixel enters a dead time during which the pixel circuitry resets. During this dead time, the SPAD may not detect additional photons. The SPAD pixel output during this exposure t is binary-valued and follows a Bernoulli distribution given by:
In some examples, source of noise such as dark counts and non-ideal quantum efficiency can be absorbed into the value of ϕ.
In some examples, given n binary observations Bi of a scene, a virtual exposure can be captured. In some examples, the virtual exposure can indicate aggregating the photon information (e.g., the set of frames). The virtual exposure can use the following maximum likelihood estimator:
Different virtual exposures can be emulated by varying the starting index i and the number n of binary frames. The granularity and flexibility of these virtual exposures is limited only by the frame rate of the SPAD array, which reaches up to ˜100 kfps, enabling robust motion estimation at extremely fine time scales. Furthermore, SPAD arrays have negligible read noise and quantization noise, leading to significantly higher SNR as compared to conventional images captured over the same exposures.
While the embodiments presented above are applicable to a wide range of motion models, the embodiment can be used for image homographies, which are a global motion model. In some examples, a modular technique can be used for homography estimation from photon cube data (i.e., the set of frames), which is capable of localizing high-speed motion even in ultra-low light settings. As an example application, a panorama image may be reconstructed from a photon cube (i.e., the set of frames) by using the homographies to warp binary frames onto a common reference frame. Given a temporal sequence of n binary frames {Bi}i=1n, image homographies can be computed and iteratively refined. The resulting reconstruction can be made through the following steps:
At step 404, the process 400 groups the set of frames into multiple initial groups including a first initial group and a second initial group. For example, the process 400 may sample the set of frames (e.g., the sequence of binary frames) across the photon cube to be merged together. In some examples, the set of frames (e.g., the entire sequence of binary frames) is re-sampled and grouped into subsets that are later aligned and merged. In some examples, a frame in the middle of each group or subset (e.g., a midpoint sampling) can be used as the grouping strategy. For example, given a group size of m, during the first iteration, the n binary frames can be splitted into [n/m] non-overlapping groups. A single frame within each group can be chosen to be the reference frame whose warp is later estimated in the “Locate” phase (i.e., step 408). At step 404 for the initial iteration, the center frame of each group can be chosed to be the reference frame.
In some examples, with this stratified resampling approach enabled by the virtual exposures, motion can be compensated at the level of individual photon arrivals to create high-fidelity aggregate frames, which in turn can be used to further refine the motion estimates. Virtual exposures are created by re-sampling the photon-cube post-capture, allowing arbitrary, fluid, and even overlapping exposures, enabling us to resolve higher speed motion.
An abstract example of the temporal resampling is illustrated in
This stratified re-sampling approach can deal with the motion blur and noise tradeoff. The number of frames (m) per group can deal with this tradeoff: a larger value of m helps counteract Poisson noise but also causes motion blur. In some examples, if SPAD binary frames are available at ˜100 kHz, setting m≈250-750 can achieve high-quality results across different motion speeds and light levels, with higher values better suited for extremely low light, and lower values for fast motion. See supplementary material for details on the asymptotic behavior of this grouping policy, and the impact of the choice of the reference frame for each group.
At step 406, the process 400 merges multiple first frames in a first group and multiple second frames in a second group to generate a first merged frame and a second merged frame, respectively. In some examples, the first group and the second group can be included in the multiple initial groups. In some examples, the process 400 can warp the first merged frame and the second merged frame to be aligned together before determining the homography across the first merged frame and the second merged frame or before merging the multiple first and second frames. In further examples, the frames within each group can be warped and merged. For example, the process 400 can select a first middle frame in the first group and a second middle frame in the second group. In some examples, the first middle frame is a frame at the center of the first group when multiple frames are placed in time sequence in the first group. In some examples, when there are N frames (e.g., frame 1, frame 2, . . . frame N) in the first group, the first middle frame is a frame placed at N/2 or N/2+1. For example, when there are 500 frames in the first group, the first middle frame can be frame 250 or frame 251. In other examples, when there are 501 frames in the first group, the first middle frame can be frame 251. The second middle frame is similar to the first middle frame. In some examples, the process 400 can warp the multiple first frames in the first group and the multiple second frames based on the first middle frame and the second middle frame. Thus, the objects in the multiple first and second frames can be aligned to the object in the first and second middle frames, respectively. The warp operation is applied locally within each group. By applying these warps locally with respect to the group's center frame (instead of a global reference frame), the frames within each group can be warped by small amounts.
In some examples, the process 400 can merge the multiple first frames and the multiple second frames to generate the first merged frame and the second merged frame, respectively. For example, the warped frames are then merged using Eq. (3). In some examples, the warped frames can be tone-mapped to sRGB. For example, the three sRGB components can have the same values to show a grayscale image in the merged frame. In other examples, the three sRGB components can have the different values to show color. In further examples, the tone-mapping can be applied independently of whether or not the image is in color. The tone-mapping can be a way to adjust the image intensities (whether color or grayscale) so that images look pleasing to human eyes. In some examples, the tonemapping can be a post-processing step that can be performed on-chip.
At step 408, the process 400 can determine a homography across the first merged frame and the second merged frame. In some examples, the process 400 can estimate homography (e.g., warp) for each pair of merged frames. For example, for two merged frames, the process 400 can estimate one homography relating the two merged frames. When the process 400 produces N merged frames, the process 400 can estimate N−1. In some examples, the pairwise warps between merged frames are estimated using an off-the-shelf method. Any drift introduced in this step is corrected during subsequent iterations. In some examples, the homography can include a 3×3 matrix, and the 3×3 matrix can be defined by:
where p1, p2, p3, p4, p5, p6, p7, matrix can be defined by: and p8 are homography parameters determined by the first merged frame and the second merged frame. In some examples, the process 400 can establish a warp between the merged frames. Once the relative warp between merged frames is established, the process 400 can decide on a global reference frame on which to project all frames to. The global frame can be a center frame of the set of frames or any suitable frame in the set of frames. Generally, the choice of the global frame can be a matter of picking which frame will not get warped (or equivalently will be warped by the identity warp).
At step 410, the process 400 can interpolate the homography to generate a warp to warp the set of frames. In some examples, the process 400 can interpolate the estimated homography matrices across time to get the fine-scale warps later used to warp individual binary frames. For example, to interpolate the homography, the process 400 can interpolate the homography parameters (e.g., p1, p2, p3, p4, p5, p6, p7, p8) of the 3×3 matrix. In some examples, the process 400 can use a geodesic interpolation to interpolate homographies. In other examples, In practice, the process 400 can use an extended Lucas-Kanade formulation to interpolate homographies to avoid computing matrix inverses and be numerically more stable. The resulting interpolated homographies are able to resolve extremely high-speed motion at the granularity of individual binary frames (˜100 kHz), thus significantly mitigating the noise-blur tradeoff. In some examples, the process 400 can interpolate between the homographies, which are obtained from step 408 (i.e., the locating step) and are obtained from pairwise merged frames. In such examples, interpolating between the homographies, which are obtained from each pair of merged frames, can include estimating a homography for each frame of the set of frames. After interpolation, the process 400 can produce a unique homography per binary frame to warp all binary frames. These warped binary frames can be then grouped together and re-merged. Since each binary frame was warped, the new merged frames (created from binary frames which have been warped with the latest and best estimate of their true homography) can be less blurry/noisy thus allowing even better subsequent localization. This process is repeated as shown in step 412 below until the reconstruction quality of image (e.g., which is improved, satisfied, or higher than a reconstruction quality threshold) is produced
At step 412, the process 400 can determine to repeat the sub-process of the resampling of step 414, merging of step 406, the determining of the homography of step 408, and the interpolating of step 410. If the process 400 determines to repeat the sub-process, the process 400 can perform steps 414, 406, 408, and 410. If the process 400 does not determine to repeat the sub-process, the process 400 can perform step 416.
At step 414, when the process 400 determines the repeat, the process 400 can resample the set of frames to generate multiple new groups. The multiple new groups include a first new group corresponding to the first initial group and a second new group corresponding to the second initial group. In some examples, in subsequent iterations, the binary frame sequence (i.e., the set of frames) can be re-sampled to create new groups including m frames that are chosen such that they are centered between the previous iteration's groups. This introduces overlapping groups and ensures a progressively denser sampling of the motion trajectory. For example, in the initial grouping of step 404, the process 400 can group n frames into multiple groups where each group includes m frames. The initial groups can be n/m non-overlapping groups. In the second iteration, the process 400 can resample the set of frames (e.g., n frames) to generate new groups. Each new group can include m frames, which are centered between two consecutive new groups. In other examples, frames of a new group are centered around the midpoint of a group before the iteration. In some examples, all binary frames can be warped with interpolated homographies before each merged operation at step 406. In some examples, the group size m can stay constant. In some examples, two groups can overlap each other. For example, if groups in the first iteration are of size m and start every m frames (i.e: [0, m), [m, 2m), [2m, 3m), etc.], groups in the second iteration can overlap by m/2 (i.e: [0, m), [m/2, m+m/2), [m, 2m), [m+m/2, 2m+m/2), etc After the resampling of the set of frames, the process 400 can further merge multiple third frames in the first new group and multiple fourth frames in the second new group to generate a third merged frame and a fourth merged frame, respectively, determine a second homography across the third merged frame and the fourth merged frame, and interpolate the second homography to generate a warp to warp the set of frames. In some examples, the process 400 can repeat steps 414 (the resampling step), 406 (the merging step), 408 (the homography determining step), and 410 (the interpolating step).
At step 416, when the process 400 determines not to repeat, the process 400 can provide a reconstructed image based on the warp of the set of frames. In some examples as shown in
The disclosed technique was demonstrated in simulation and through real-world experiments using a SPAD hardware prototype.
Simulation Details: A SPAD array capturing a panoramic scene was simulated by starting with high-resolution panoramic images downloaded from the internet. Camera trajectories were created across the scene such that the SPAD's field of view (FOV) sees only a small portion of the panorama at a time. At each time instant of the trajectory, a binary frame from the FOV of the ground truth image was simulated by first undoing the sRGB tone mapping to obtain linear intensity estimates, and then Eq. (2) is applied to simulate the binary photon stream. RGB images were simulated by averaging the ground truth linear intensities over a certain exposure and adding Gaussian noise.
Hardware Prototype: For experiments, the SwissSPAD was used to capture binary frames (
Implementation Details: The implementation took on the order of ten minutes, per iteration, to process 100k frames. While factors such as resolution and window size (m) affect runtime, the implementation is throttled by the underlying registration algorithm which recomputes features at every level. Further optimizations and feature caching can improve runtime.
Fast Motion Recovery:
Low Light Robustness:
Globally Consistent Matching: An issue when global motion is estimated piece by piece is that of drift: any error in the pairwise registration process accumulates over time. This phenomenon is clearly visible in the RGB panorama in
Super-Resolution and Efficient Registration: Due to dense temporal sampling, and the resulting fine-grained homography estimates, the example method enables super-resolution in the reconstructed panoramas. This is achieved by applying a scaling transform to the estimated homographies before the merging step. This scaling transform stretches the grid of pixels into a larger grid, resulting in super-resolution. Further, to save on compute and memory costs, this scaling factor can be gradually introduced across iterations. For example, if the goal is to super-resolve by a scale of 4×, the estimated warps can be scaled by a factor of two over two iterations. It is also possible to use scaling factors that are smaller than one in the initial iterations of the pipeline. This can be done to create large-scale panoramas, such as the one in
High Dynamic Range: Single photon cameras have recently been demonstrated to have high dynamic range (HDR) capabilities. By performing high-accuracy homography estimation and registration, the example method can merge a large number of binary measurements from a given scene point, thus achieving HDR.
Extension to High-Speed Cameras: The stratified re-sampling approach can be extended to other high-speed imaging modalities that allow fast sampling. For example, a conventional high-speed camera can be used for the stratified re-sampling approach.
Edge Effects and Pre-warping: When an off-the-shelf homography estimation algorithm is used over a virtual exposure of duration t, with respect to which time instant within the exposure duration will the estimated localization be? This is not an issue for small camera movements, but with fast motion, this ambiguity has a compounding effect. A sensible assumption would be to presume that the base model estimates the average location over an exposure time, or perhaps, the location at the center of the exposure. This observation can indicate i) if some motion has been already compensated when creating the aggregate frame, the new estimate will be relative to it, and ii) it allows to localize with respect to any time instant during the exposure by warping the photon data, before aggregation, such that the time instant of interest is warped by the identity warp instead of the current motion estimate.
Without accounting for the former, any motion estimate would rapidly drift away. In practice, it is beneficial to compensate for this relative offset before aggregation and localization as it helps constrain the size of the aggregate frames and can lead to better matches. The latter enables the localization of off-center time slices of a virtual exposure, enabling precise localization at the boundaries of a captured sequence, which, as seen in
Comparison with One-Shot Motion Compensation Methods: Quanta burst photography (QBP) is a recently proposed algorithm that uses a more general optical flow-based motion model that locally warps and registers groups of binary frames. A comparison with our method is shown in
Using Conventional High-Speed Cameras: The iterative stratified motion estimation and alignment method presented above is not restricted to single-photon cameras and can be applied to images captured by a high-speed camera as well. The only assumption is that individual frames obtained from the high speed camera contain minimal motion blur and are high enough SNR to allow frame-to-frame feature matching.
We demonstrate this using a commercially available high-speed camera (e.g., Photron Infinicam shown in
To get sufficient signal to overcome these limitations we increased the aperture to allow the camera to capture 4× more light.
In the foregoing specification, implementations of the disclosure have been described with reference to specific example implementations thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of implementations of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.
This invention was made with government support under 1943149 and 2107060 awarded by the National Science Foundation. The government has certain rights in the invention.