Detectors that can detect the arrival time of an individual photon, such as single-photon avalanche diodes (SPADs), can facilitate active vision applications in which a light source is used to interrogate a scene. For example, such single-photon detectors have proposed for use with fluorescence lifetime-imaging microscopy (FLIM), non-line-of-sight (NLOS) imaging, transient imaging, LiDAR systems, and other depth imaging systems. The combination of high sensitivity and high timing resolution has the potential to improve performance of such systems in demanding imaging scenarios, such as in systems having a limited power budget. For example, single-photon detectors can play a role in realizing effective long-range LiDAR for automotive applications (e.g., as sensors for autonomous vehicles) in which a power budget is limited and/or in which a signal strength of the light source is limited due to safety concerns.
Three-dimensional imaging systems (e.g., cameras) based on SPAD technology are becoming increasingly popular for a wide range of applications that require high-resolution and low-power depth sensing, ranging from autonomous vehicles to consumer smartphones. Kilopixel to megapixel resolution SPAD pixel arrays that are being developed will have the capability of capturing the time-of-arrival of billions of individual photons per frame with extremely high (picosecond) time resolution. However, this extreme sensitivity and high speed comes at a cost, as the raw timestamp data causes a severe bottleneck between the image sensor and an image signal processor (ISP) that processes this data (e.g., as described below in connection with
One approach that has been developed to avoid transferring individual photon timestamps is to build a histogram in each pixel. This results in a 3D histogram tensor that is transferred off-sensor for processing. Although this may be practical at low spatio-temporal resolutions (e.g., 64×32 pixels with 16 time bins), it requires higher in-sensor memory. Additionally, the data rates of this histogram tensor representation also scale rapidly with the spatio-temporal resolution and maximum depth range (or other time bins) of the sensor. For example, a megapixel SPAD-based 3D camera operating at 30 frames per second (fps) that outputs a histogram tensor with a thousand 8-bit bins per pixel would require a data transfer rate of 240 gigabits pers second (Gbps).
In such systems, the first photon detection times in each laser cycle can be collected and used to generate a histogram of the time-of-arrival of the photons that represents the distribution of detections. For example, as described below in connection with
As described above, SPAD-based systems can generate very large amounts of data. For example, consider a megapixel SPAD-based 3D camera. For short range indoor applications (e.g., up to tens of meters), a millimeter depth resolution would be desirable. For longer range outdoor applications (e.g., hundreds of meters), centimeter level depth resolution would be desirable. Assuming state-of-the-art sub-bin processing techniques, this corresponds to histograms with thousands of bins per pixel, which would require reading out thousands of values per pixel in order to generate a depth for each pixel. Additionally, the rate at which such histograms can be generated can vary from tens of fps for low speed applications (e.g., land surveying) to hundreds of fps for high speed applications (e.g., an automotive application where objects may be moving at high speeds). Even a conservative estimate of a 30 fps megapixel camera leads to a large data rate (e.g., 106 pixels/frame×1000 bins/pixel×2 bytes/bin×30 fps=60 GB/sec).
Coarse in-pixel histogramming has been proposed to reduce data rates in SPAD-based 3D cameras. Despite the low time resolution in coarse histograms, it is possible to achieve relatively high depth resolution by using wide pulses, pulse dithering, or with coarse-to-fine histogram architectures. However, as described below, coarse histogramming is a sub-optimal strategy.
Accordingly, systems, methods, and media for single photon depth imaging with improved efficiency using learned compressive representations are desirable.
In accordance with some embodiments of the disclosed subject matter, systems, methods, and media for single photon depth imaging with improved efficiency using learned compressive representations are provided.
In some aspects, the present disclosure can provide a system for determining a depth in a scene. The system can include a light source and an array including a plurality of detectors. The plurality detectors can detect arrival of individual photons. At least one processor can be programmed to detect a photon arrival based on a signal from a detector of the plurality of detectors. The detector of the plurality of detectors can have a position p′. The processor can be programmed to determine a time bin i associated with the photon arrival. The time bin can be in a range from 1 to Nt where Nt is a total number of time bins. The processor can be programmed to update a compressed histogram with K stored values representing bins of the compressed histogram based on K values in a code word calculated based on the time bin i and the position p′ and K coding tensors. Each coding tensor of the K coding tensors can be different than each other coding tensor. The processor can be programmed to perform an imaging task based on the K values of the compressed histogram.
In some aspects, the present disclosure can provide a method for determining a depth in a scene. A photon arrival can be detected based on a signal from a detector of a plurality of detectors. The detector of the plurality of detectors can have a position p′. A time bin i can be determined. The time bin i can be associated with the photon arrival. The time bin can be in a range from 1 to Nt where Nt is a total number of time bins. A compressed histogram can be updated. The compressed histogram can include K stored values representing bins of the compressed histogram based on K values in a code word calculated based on the time bin i and the position p′ and K code tensors. Each coding tensor of K code tensors can be different than each other coding tensor. An imaging tasks can be performed based on the K values of the compressed histogram.
In some aspects, the present disclosure can provide a system for generating compressed single-photon histograms. The system can include a light source and an array including a plurality of detectors. The plurality of detectors can detect arrival of individual photons. At least one processor can be programmed to detect a photon arrival based on a signal from a detector of the plurality of detectors. The detector of the plurality of detectors can have a position p′. The at least one processor can be programmed to determine a time bin i associated with the photon arrival. The time bin can be in a range from 1 to Nt where Nt is a total number of time bins. The at least one processor can be programmed to update a compressed histogram including K stored values representing bins of the compressed histogram based on K values in a code word calculated based on the time bin i and the position p′ and K coding tensors. Each coding tensor of the K coding tensors can be different than each other. The at least one processor can be programmed to output the compressed histogram to another processor.
Various objects, features, and advantages of the disclosed subject matter can be more fully appreciated with reference to the following detailed description of the disclosed subject matter when considered in connection with the following drawings, in which like reference numerals identify like elements.
In accordance with various embodiments, mechanisms (which can, for example, include systems, methods, and media) for single photon depth imaging with improved efficiency using learned compressive representations are provided.
In some embodiments, mechanisms described herein can be used to generate compressive histograms that can improve the efficiency of single photon depth imaging systems, for example, by reducing the per-pixel output data rate at a particular depth resolution and frame rate.
Single-photon cameras (SPC) are an emerging sensor technology with ultra-high sensitivity down to individual photons. In addition to their extreme sensitivity, SPCs based on single-photon avalanche diodes (SPADs) can also record photon-arrival timestamps with extremely high (sub-nanosecond) time resolution. Moreover, SPAD-based SPCs are compatible with complementary metal-oxide semiconductor (CMOS) photolithography processes which can facilitate fabrication of kilo-to-mega-pixel resolution SPAD arrays at relatively low costs. Due to these characteristics, SPAD-based SPCs are gaining popularity in various imaging applications including 3D imaging, passive low-light imaging, HDR imaging, non-line-of-sight (NLOS) imaging, fluorescence lifetime imaging (FLIM) microscopy, and diffuse optical tomography.
Unlike a conventional camera pixel that outputs a single intensity value integrated over micro-to-millisecond timescales, a SPAD pixel generates an electrical pulse for each photon detection event. A time-to-digital conversion circuit converts each pulse into a timestamp recording the time-of-arrival of each photon. Under normal illumination conditions, a SPAD pixel can generate millions of photon timestamps per second. The photon timestamps are often captured with respect to a periodic synchronization signal generated by a pulsed laser source. To make this large volume of timestamp data more manageable, SPAD-based SPCs can build a timing histogram on-chip instead of transferring the raw photon timestamps to the host computer. The histogram can record the number of photons as a function of the time delay with respect to the synchronization pulse.
In some embodiments, mechanisms described herein can be used to implement bandwidth-efficient acquisition strategies using coding tensors which can be used to encode a block of a 3D histogram tensor into a single compressive histogram. In some embodiments, rather than capturing the full timing histogram in each pixel, compressive histograms can be constructed by mapping the time bins of the histograms for multiple pixels onto multiple a compressive histogram though an encoding process.
As described below, in some embodiments, mechanisms described herein can utilize a family of compressive encoders that can be represented as a series of simple matrix operation. Such compressive encoders can be implemented efficiently using operations equivalent to multiply-add operations that can be computed on-the-fly (e.g., as each photon arrives), without the need to store large arrays of photon timestamps on-chip. In some embodiments, using mechanisms described herein can decouple the dependence of output data rate on the desired depth resolution. For example, while a full histogram requires more time bins to achieve higher depth resolution, compressive histograms generated using mechanisms described herein can represent a higher depth resolution using a similar (e.g., almost the same) number, or lower number, of data points.
As described below in connection with
For a fixed compression level, as noise increases (e.g., as signal to background ratio (SBR) decreases) reconstruction quality degrades. In results described below, degradation in image quality first results in the loss of fine details and subsequently the loss of coarser higher-level scene information. By increasing the size of coding tensors, reconstruction quality can be recovered up to a certain degree (e.g., as described below in connection with
As described in herein coding tensors with Mt=Nt can develop a depth range bias. For example, these coding tensors can learn to zero out photons coming from distances that are less common in the dataset since they are usually background/noise photons. Interestingly, learned coding tensors with Mt<Nt avoid this bias and generalize to depths that are less common in the training set.
While SPAD-based 3D cameras with large in-pixel memory could potentially store per-pixel histograms and reduce data rates by computing depths in-pixel (sometimes referred to herein as a peak compression oracle), results described below show that compressive histograms can provide similar reconstruction quality and outperform this technique at low SBR without requiring the storage of the full histogram tensor in-sensor.
Note that although promising empirical results are shown for the coding tensor representations described herein, an optimal set of coding tensors depends on the hardware specifications (e.g, in-pixel memory, system bandwidth) and scene-dependent parameters (e.g., SBR, geometry, albedo). Additional lightweight coding tensors can be implemented using mechanisms described herein that rely on other factorization techniques and weight quantization.
Note that histogram tensors are used in various active single-photon imaging modalities, in addition to depth sensing, such as fluorescence lifetime microscopy (FLIM), non-line-of-sight, and diffuse optical tomography. In some embodiments, mechanisms described herein can be used to find compressive representations suitable for these additional applications.
As described below, while coarse histogramming can be considered a form of compressive histogram, coarse histogramming is sub-optimal compared to other compressive histogramming strategies. Other data reduction strategies, such as motion-driven operation or multi-photon triggering have been proposed to reduce the amount of data generated by SPADs. Additionally, in the context of scanning-based systems, adaptive sampling techniques have been proposed to reduce sampling rates and consequently data transfers. In some embodiments, such techniques can be used in a complementary manner with mechanisms described herein to further reduce data rates.
Recently, Fourier-domain histograms (FDHs) were proposed for fast non-line-of-sight (NLOS) imaging and for single-photon 3D imaging. FDHs can be generated using mechanisms described herein as one type of compressive histogram that can achieve significant compression over regular histogramming. However, strategies described below can be used to implement coding tensors that are more efficient than FDH for 3D imaging, and that are also more robust to diffuse indirect reflections commonly found in flash illumination systems are described below.
In some embodiments, mechanisms described herein can use a set of coding tensors (e.g., K matrices that are each Mt×Mr×Mc described below in connection with
Compressive histograms, sometimes referred to as sketches, are an emerging framework for online in-sensor compression of SPAD timestamp data. A coarse histogram is one common compressive histogram approach to reduce data rates and in-pixel memory. Despite their practical hardware implementation, coarse histograms achieve sub-optimal depth accuracy compared to compressive histograms based on Fourier and Gray codes. One limitation of these approaches is that the compressive representation only exploits the temporal information of the incident timestamp, and disregards the spatial redundancy. As described herein, mechanisms described herein can generalize the compressive histogram framework to utilize spatio-temporal information of each timestamp. Additionally, instead of relying on hand-designed coded projections, mechanisms described herein can be used to learn coding tensors as a layer (e.g., a first layer) of a convolutional neural network (CNN).
In some embodiments, multiple neighboring single-photon detectors (e.g., SPAD pixels) can utilize a single shared memory where all timestamps are aggregated into a coarse histogram (e.g., a 4×4 block of pixels, a 3×3 block of pixels, etc.). Such techniques can discard local spatial information (e.g., precise pixel location) of the detected photon timestamps. Compressive histograms implemented using mechanisms described herein are well-suited for such a shared memory implemented, as a compressive histogram can be shared among multiple SPAD pixels and can preserve spatial information through the coded projection. For example, when a timestamp is detected at a pixel, K values are identified from the coding tensor, and those K values are aggregated to the compressive histogram. The K value from the coding tensor can encode spatial and temporal information, which help preserve the spatial information. If the timestamp is only aggregated on a regular histogram, the spatial information is discarded.
Pixel processor arrays (PPAs) are an emerging sensing technology that embeds processing electronics inside each pixel. This sensing paradigm begins processing the image at the focal plane array, which allows it to reduce the sensor data rates by transmitting only the relevant information, and consequently, can increase sensor throughput which can facilitate computer vision at 3,000 fps. PPAs have also become building blocks of novel computational imaging systems optimized end-to-end for HDR imaging, motion deblurring, video compressive sensing, and light field imaging. In some embodiments, compressive histograms implemented using mechanisms described herein can utilize in-pixel processing techniques. For example, mechanisms described herein can optimize sensor parameters (e.g., values of coding tensors used to generate compressive histograms) and a processing algorithm (e.g., a CNN) to compress data generated by an array of single-photon detectors.
In such systems, the first photon detection times in each laser cycle can be collected and used to generate a histogram of the time-of-arrival of the photons that represents the distribution of detections. For example,
In a SPAD-based 3D camera, the SPAD-based camera can include a SPAD sensor and a pulsed laser that illuminates the scene. The photon flux signal arriving at pixel, p, can be expressed as:
where ap is the amplitude of the returning signal accounting for laser power, reflectivity, and light fall-off; h (t) is the system's impulse response function (IRF) which accounts for pulse waveform and sensor IRF; dp is the distance to the point imaged by point p; c is the speed of light; and Φpbkg is the constant photon flux due to background illumination (e.g., sunlight, indoor lighting, etc.). This model assumes direct-only reflections which is generally a valid approximation, in particular, for scanning-based ToF 3D imaging systems.
Time-correlated single photon counting (TCSPC)-based SPAD cameras can measure (t) by building a per-pixel timing histogram (e.g., as shown in
A pulse repetition period, τ, can determine the maximum timestamp value and the length of a histogram vector Φp=(Φi,p)i=0N
Additionally, it can be assumed that pile-up distortions are minimized through various SPAD data acquisition techniques. For example, it can be assumed that the SPAD sensor is being operated in asynchronous mode or is capable of multi-event timestamp collection, which can mitigate pile-up distortions, and can guarantee that Φi,p is an appropriate approximation of Φp(t). As shown in
This process can generate a Nt×Nr×Nc 3D histogram tensor, H=(Φi,p)p=(0,0)(N
In some embodiments, mechanisms described herein can be used to generate compressive representations of 3D histogram tensors. In order to reduce the data rates output by the SPAD camera, the compact representation can be built in-pixel or inside the focal plane array (FPA) (e.g., as shown in
In some embodiments, mechanisms described herein can be used to implement a family of compressive representations for 3D histogram tensors that can be computed in an online fashion with limited memory and compute. Compressive representations implemented in accordance with mechanisms described herein can be based on the linear spatio-temporal projection of each photon timestamp, which can be expressed as a simple matrix operation. Instead of constructing per-pixel timestamp histograms, a compressive encoding implemented in accordance with mechanisms described herein can map its spatio-temporal information into a compressive histogram. To exploit local spatio-temporal correlations, a single compressive histogram can be built for a local 3D histogram block (e.g., as described below in connection with
In some embodiments, mechanisms described herein can be used to integrate a compression framework with data-driven single-photon data processing techniques (e.g., using convolutional neural networks (CNNs)), which can facilitate end-to-end optimization of the compressive encoding and a single-photon data (e.g., SPAD data) processing CNN.
As described below in connection with
A natural approach to compress H that exploits its local correlations due to smooth depths and photon flux, is to build a compressive representation of a local 3D histogram block as illustrated in
A histogram Hb can be defined as a bth histogram block of H with dimensions Mt×Mr×Mc, where Mt≤Nt, Mr≤Nc, and Mc≤Nc. It can be observed that Hb can be expressed as the sum of J one-hot encoding tensors, each representing one photon detection within Hb (e.g., as shown in
Tj is the timestamp value, and p′ is the pixel where the timestamps was detected. Using this representation Hb can be represented as follows:
In some embodiments, Hb can be compressed in an online fashion through the linear projection of each timestamp tensor. For example, expressed as an inner product with K pre-designed coding tensors, Ck, with dimensions Mt×Mr×Mc (e.g., as shown in
where · denotes element-wise multiplication, and l and p′ are indices where tb,j,l,p′=1. Using this representation, Ŷb=(Ŷb,k)k=0K-1 can be defined as a compressive histogram of Hb. A special case of EQ. (4) is described in U.S. patent application Ser. No. 17/834,884, filed Jun. 7, 2022, which is hereby incorporated by reference herein in its entirety, where C compresses histograms associated with individual pixels, and disregards spatial information (e.g., Mt=Nt, Mr=1, Mc=1). Note that each timestamp in EQ. (4) can be processed efficiently on-the-fly after each photon detection through a simple lookup operation. Additionally, individual histogram blocks and/or timestamps do not need to be explicitly stored or transferred off-sensor.
Compressive histograms, when implemented as in EQ. (4), can introduce an in-sensor memory overhead because, in addition to storing Ŷb, C needs to be stored in-sensor for efficient lookup operations. In some examples, C can be shared among multiple pixels. So if a block has a spatial dimension of 2×2, C can be shared among 2×2 pixels. Therefore, a given pixel would only have to store the row of C that is associated to it. In some embodiments, a practical compressive single-photon camera implemented using mechanisms described herein can rely on parameter-efficient coding tensors that mitigate such overhead. Note that two strategies to design practical coding tensors were evaluated, and results are described below in connection with
Certain coding tensor implementations may be impractical, as they may incur memory overhead that renders them unsuitable. Consider a set of coding tensors that operate on the full histogram tensor (e.g., Hb=H). In this case, the number of elements in C exceeds the number of elements of H. Consequently, although the data rates are reduced in this scenario since K<(Nt·Nr·Nc), the in-sensor memory required exceeds the size of the histogram tensor. To mitigate this issue, two complementary strategies to implement lightweight coding tensors are described below: local block-based and separable.
Local Block-based Coding Tensors: As the size of the histogram block Hb represented by a compressive histogram is reduced, the size of the coding tensors also decreases. Therefore, compressing local histogram blocks not only offers benefits due to local spatio-temporal redundancies, but also can be beneficial because these local coding tensors have fewer parameters than larger coding tensors. For example, local block-based coding tensors are used in temporal compressive histograms described in U.S. patent application Ser. No. 17/834,884, which has been incorporated by reference herein, where Hb is a per-pixel histogram. Separable Coding Tensors: Another approach to implementing lightweight coding tensors that can be used is to make them separable along the temporal and spatial dimensions. This approach can be also be used in parameter-efficient CNN models that use separable depth-wise convolutional layers to reduce model size. A separable coding tensor can be represented as the outer product of two smaller tensors:
where Cktemporal is a Mt×1×1 tensor, and Ckspatial is a 1×Mr×Mc tensor. In some examples, each k has a different temporal component. So, for every k, there can be one Ctemporal and one Cspatial. This implementation can also be beneficial due to differences between the temporal and spatial correlations encountered in histogram blocks. In addition to local correlations present in both dimensions, the temporal dimension often exhibits long-range correlations due to the background illumination offset (Φpbkg) in every histogram bin. Accordingly, EQ. (5) can be used to represent such correlations by encoding the temporal and spatial information independently.
One assumption made in the memory overhead analysis described herein is that a compressive SPAD-based 3D camera only needs to store a single C that is shared across the full sensor, which can be implemented in two general ways. One approach can distribute C across the local memory of all pixels and then allow communication across pixels (e.g., as in pixel processor arrays (PPAs)). Another approach can store C in a global memory that can be accessed by all pixels which can be facilitated using any suitable techniques, such as in 3D-stacked SPAD cameras. Additionally, some of the coding tensor implementations described herein have as few as 640 parameters. In such an example, even if C is stored for every 4×4 group of pixels, the in-sensor memory can still be reduced by 20× compared to storing a 1024 bin per-pixel histogram.
tensors.
In some embodiments, mechanisms described herein can build a compressive histogram for each histogram block Hb. Using a block size that is less than the size of H, multiple compressive histograms can be used to encode the complete histogram tensor H. In this way, the coding tensors can be viewed as a set of 3D convolutional filters, which can be implemented as a layer of a CNN (e.g., a first layer of a 3D CNN). For simplicity, it is assumed that histogram blocks do not overlap, with the stride of the convolutional filters being equal to their dimensions. However, in some examples, the systems and methods described herein may also be used when the histogram blocks overlap by accounting for the overlap using the compression ratio expression.
Note that the compressed histogram tensor representation is not directly compatible with 3D CNNs that have been designed for SPAD-based 3D imaging (e.g., as described in Peng et al., “Photon-efficient 3d imaging with a non-local neural network,” in European Conference on Computer Vision, pp. 225-241 (2020) and Lindell et al., “Single-photon 3d imaging with deep sensor fusion,” ACM Trans. Graph, 37 (4): 113-1 (2018)). In some embodiments, each compressive histogram can be lifted back to the original 3D domain through an unfiltered backprojection when used in connection with such a pre-trained CNN. For example, the following relationship can be used to decode the compressive histograms:
where Ĥb is the decoded compressive histogram for block b, which is a weighted linear combination of the coding tensors. The decoded histogram blocks can be concatenated and given as input to a processing 3D CNN.
In some embodiments, a compressive histogram layer can include an encoding/compression portion implemented on a single-photon detection chip, and a decoding/decompression portion and can be implemented off the single-photon detection chip (e.g., on a processor that receives the compressed histograms). This layer can be appended to the beginning of any CNN that has been designed to process 3D histogram tensors in any suitable application (e.g., depth estimation, FLIM, NLOS, etc.). Additionally, in some embodiments, at least a portion of the coding tensors can be jointly optimized with the downstream CNN in an end-to-end manner. In some examples, when the CNN is a three-dimensional CNN what was trained to process a three-dimensional histogram tensor, the decoding step in
In some embodiments, light source 602 can be any suitable light source that can be configured to emit modulated light (e.g., as a stream of pulses) toward a scene 618 illuminated by an ambient light source 620 in accordance with a signal received from signal generator 616. For example, light source 602 can include one or more laser diodes, one or more lasers, one or more light emitting diodes, and/or any other suitable light source. In some embodiments, light source 602 can emit light at any suitable wavelength. For example, light source 602 can emit ultraviolet light, visible light, near-infrared light, infrared light, etc. In a more particular example, light source 602 can be a coherent light source that emits light in the green portion of the visible spectrum (e.g., centered at 532 nm). In another more particular example, light source 602 can be a coherent light source that emits light in the infrared portion of the spectrum (e.g., centered at a wavelength in the near-infrared such as 1060 nm or 1064 nm).
In some embodiments, image sensor 604 can be an image sensor that is implemented at least in part using one or more SPAD detectors (sometimes referred to as a Geiger-mode avalanche diode) and/or one or more other detectors that are configured to detect the arrival time of individual photons. In some embodiments, one or more elements of image sensor 604 can be configured to generate data indicative of the arrival time of photons from the scene via optics 606. For example, in some embodiments, image sensor 604 can be a single SPAD detector. As another example, image sensor 604 can be an array of multiple SPAD detectors. As yet another example, image sensor 604 can be a hybrid array including one or more SPAD detectors and one or more conventional light detectors (e.g., CMOS-based pixels). As still another example, image sensor 604 can be multiple image sensors, such as a first image sensor that includes one or more SPAD detectors that is used to generate depth information and a second image sensor that includes one or more conventional pixels that is used to generate ambient brightness information and/or image data. In such an example, optical components can be included in optics 606 (e.g., multiple lenses, a beam splitter, etc.) to direct a portion of incoming light toward the SPAD-based image sensor and another portion toward the conventional image sensor that is used for light metering.
In some embodiments, image sensor 604 can include on-chip processing circuitry 622 that can be used to generate compressive histograms (e.g., using memory and logic implemented on the image sensor chip), which can be output to processor 608, which can facilitate a reduction in the volume of data transferred from image sensor 604. For example, single-photon detectors of image sensor 604 can be associated with circuitry that implements at least a portion of process 700, described below. As a particular example, single-photon detectors of image sensor 604 can be associated with circuitry that is configured to determine which bin of a full resolution histogram (e.g., which column and row of a block the detector is located in) is associated with a time at which a photon is detected.
As another more particular example, a single-photon detector or a group of single photon detectors of image sensor 604 can be associated with accumulators that are configured to update and store values for bins of the compressive histogram K associated with the single-photon detector(s) based on values of the code in a coding tensor associated with a time at which a photon is detected and the detector at which the photon was detected. In some embodiments, the accumulators can be implemented using any suitable technique or combination of techniques. For example, for a fully binary coding tensor (e.g., in which each element represents a 1 or a −1), the accumulators can be configured to increment or decrement by 1 from a current value (e.g., using a register configured to store a two's complement representation of an integer). As another example, for a coding tensor configured to use floating point values (e.g., Gray-based Fourier, Truncated Fourier, etc.), the accumulators can be configured to add a (positive or negative) multi bit value (e.g., a fixed-point number, a floating point number, an integer, etc.). As a more particular example, for a coding tensor configured to use fixed point values, the accumulators can be configured to add a (positive or negative) multi bit fixed point value (e.g., an 8-bit value, a 10 bit value, etc.). In some embodiments, a coding tensor can be configured to store values in a range of [−1,1] using fixed point values that each represent a value in the range (e.g., using only positive binary values, using a two's complement representation, etc.). In such an example, the value from the coding tensor can be converted into a floating point or fixed-point representation prior to being added to the accumulator, or values stored in an accumulator can be converted to a floating point or fixed-point representation prior to being used to calculate a depth value. In a more particular example, values in a coding tensor can be represented using a representation of a particular bit depth (e.g., using 8 bits, using 10 bits, using 12 bits, using two bytes, etc.), which can create a quantized representation of the value in the coding tensor (e.g., in an 8 bit quantized representation 0000 0000 can represent −1, 1111 1111 can represent 1, 0000 0001 can represent-0.9921875, etc.; in an 8 bit two's complement quantized representation 0000 0000 can represent 0, 1000 0000 can represent 31 1, 0111 1111 can represent 1, 1000 0001 can represent −0.9921875, etc.). In such an example, the values in the coding tensor can be represented using the closest value available in the quantized representation. In some embodiments, the accumulators can be implemented using various different hardware implementations.
As yet another more particular example, single-photon detectors of image sensor 604 can be associated with components (e.g., memory, logic, etc.) configured to store a representation of the coding tensors C. In some embodiments, a single representation of coding tensors C can be stored in on-chip memory, and can be accessed by circuitry associated with multiple single-photon detectors. For example, a single representation of coding tensors C can be stored in global memory (e.g., memory implemented on image sensor 604), and circuitry associated with each single-photon detector can be configured to retrieve coding tensors C from the global memory (e.g., in connection with each frame). As another example, a representation of coding tensors C can be stored in multiple local memories (e.g., associated with one or more single-photon detectors), and circuitry associated with each single-photon detector can be configured to retrieve coding tensors C from the local memory (e.g., in connection with each frame). Such local memory can be shared, for example, among a spatially local neighborhood of single-photon detectors of an array (e.g., among any suitable number of neighboring single-photon detectors, from several to hundreds, thousands, etc.). In such examples, image sensor 604 can be configured to use the representation(s) of the coding tensors C to update a compressive histogram associated with a single-photon detector or block of single-photon detectors responsive to detection of a photon.
In some embodiments, different coding tensors C can be used in connection with different time periods (e.g., coding matrices can be changed for difference frames) and/or for different areas of the image sensor. For example, during a first time period a first set of coding tensors C1 with a first compression ratio can be used, and during another time period another set of coding tensors C2 with a different compression ratio can be used. In such an example, the coding tensors C can be adjusted based on environmental conditions. For example, as the amount of ambient light increases, coding tensors C with a lower compression ratio can be used to reduce noise by decreasing compression. As described below in connection with
In some embodiments, the on-chip processing circuitry can be implemented using any suitable fabrication techniques. For example, 3D-stacking CMOS techniques can be used to implement circuit components configured to generate a compressive histogram for each single-photon detector.
In some embodiments, system 600 can include additional optics. For example, although optics 606 is shown as a single lens and attenuation element, it can be implemented as a compound lens or combination of lenses. Note that although the mechanisms described herein are generally described as using SPAD-based detectors, this is merely an example of a single photon detector that is configured to record the arrival time of a pixel with a time resolution on the order of picoseconds, and other components can be used in place of SPAD detectors. For example, a photomultiplier tube in Geiger mode can be used to detect single photon arrivals.
In some embodiments, optics 606 can include optics for focusing light received from scene 618, one or more narrow bandpass filters centered around the wavelength of light emitted by light source 602, any other suitable optics, and/or any suitable combination thereof. In some embodiments, a single filter can be used for the entire area of image sensor 604 and/or multiple filters can be used that are each associated with a smaller area of image sensor 604 (e.g., with individual pixels or groups of pixels). Additionally, in some embodiments, optics 606 can include one or more optical components configured to attenuate the input flux (e.g., a neutral density filter, a diaphragm, etc.).
In some embodiments, system 600 can communicate with a remote device over a network using communication system(s) 614 and a communication link. Additionally, or alternatively, system 600 can be included as part of another device, such as a smartphone, a tablet computer, a laptop computer, an autonomous vehicle, a robot, etc. Parts of system 600 can be shared with a device within which system 600 is integrated. For example, if system 600 is integrated with an autonomous vehicle, processor 608 can be a processor of the autonomous vehicle and can be used to control operation of system 600.
In some embodiments, system 600 can communicate with any other suitable device, where the other device can be one of a general purpose device such as a computer or a special purpose device such as a client, a server, etc. Any of these general or special purpose devices can include any suitable components such as a hardware processor (which can be a microprocessor, digital signal processor, a controller, etc.), memory, communication interfaces, display controllers, input devices, etc. For example, the other device can be implemented as a digital camera, security camera, outdoor monitoring system, a smartphone, a wearable computer, a tablet computer, a personal data assistant (PDA), a personal computer, a laptop computer, a multimedia terminal, a game console, a peripheral for a game counsel or any of the above devices, a special purpose device, etc.
Communications by communication system 614 via a communication link can be carried out using any suitable computer network, or any suitable combination of networks, including the Internet, an intranet, a wide-area network (WAN), a local-area network (LAN), a wireless network, a digital subscriber line (DSL) network, a frame relay network, an asynchronous transfer mode (ATM) network, a virtual private network (VPN). In further examples, data may also be transferred between one or more sensor chips to a processing chip/device using a universal serial bus (USB), a peripheral component interconnect express (PCIe), and/or low-voltage differential signaling (LVDS). The communications link can include any communication links suitable for communicating data between system 100 and another device, such as a network link, a dial-up link, a wireless link, a hard-wired link, any other suitable communication link, or any suitable combination of such links.
It should also be noted that data received through the communication link or any other communication link(s) can be received from any suitable source. In some embodiments, processor 608 can send and receive data through the communication link or any other communication link(s) using, for example, a transmitter, receiver, transmitter/receiver, transceiver, or any other suitable communication device.
At 702, process 700 can include modifying a convolutional neural network (CNN) to add a first layer that includes multiple coding tensors (e.g., coding tensors C) that can be used to encode timestamps corresponding to detected photons into compressed histograms. For example, a layer that includes K convolutional filters, as described above in connection with
Additionally, at 702, in some embodiments, process 700 can include modifying the CNN to add a decoding layer that can be used to decode the compressed histograms into an uncompressed histogram with the same dimension as an uncompressed 3D histogram tensor that would be generated from raw timestamp information.
In some embodiments, 702 can be omitted. For example, in an implementation in which a downstream CNN is not used to analyze the data, the first layer and decoding layer can be implemented without modifying a pre-existing CNN architecture.
At 704, process 700 can include training the modified CNN using training and test datasets. For example, in some embodiments, the CNN modified at 702 can be a pre-trained CNN (e.g., trained to perform depth estimation from 3D histogram tensor data), and at 704 the first layer can be trained, without further training the rest of the CNN. As another example, in some embodiments, the CNN modified at 702 can be a pre-trained CNN (e.g., trained to perform depth estimation from 3D histogram tensor data), and at 704 the first layer can be trained along with further training of other layers of the CNN. As yet another example, in some embodiments, the CNN modified at 702 can be an untrained or partially trained CNN, and at 704 the first layer can be trained along with training of other layers of the CNN.
In some embodiments, weights for portions of the first layer can be trained (e.g., a spatial portion of a separable coding tensor) and weights for another portion of the first layer (e.g., a temporal portion of the separable coding tensor) can be predetermined.
In some embodiments, 704 can be omitted. For example, in an implementation in which coding tensors are not learned, the first layer and decoding layer can be implemented without training.
At 706, process 700 can detect an arrival of photon at a single-photon detector at time t using any suitable technique or combination of techniques. For example, process 700 can detect the arrival of a photon based on activation of a SPAD, which can cause a timestamp (e.g., by a time-to-digital converter) corresponding to the time t at which the photon was detected to be generated. The single-photon detector at which the photon was detected can be associated with a position within an array of single-photon detectors, which can correspond to a particular position within a particular block Hb of a 3D histogram tensor H.
At 708, process 700 can determine a time bin i of a histogram (e.g., a full-resolution histogram having a number of time bins corresponding to a full 3D histogram tensor H) for the photon detected at time t using any suitable technique or combination of techniques. For example, process 700 can determine a difference between a time when a light source emitted a pulse, and time t when the photon was detected, and can determine which bin of the histogram corresponds to the time difference.
Additionally, at 708, process 700 can determine a position p′ of the detector at which the photon was detected using any suitable technique or combination of techniques. For example, p′ can be a position of the pixel where the photon was detected. This information can be obtained by using a logic in the SPAD.
At 710, process 700 can generate a code word (e.g., of length K) representing time bin i for position p′ using coding tensors C, and the relative position of p′ within block Hb. For example, as described above in connection with
At 712, process 700 can update the values of K bins of a compressive histogram for a block b that includes position p′. For example, process 700 can update values in a memory storing K bins of the compressive histogram for block b. As another example, process 700 can update values of K accumulators used to store the values of the compressive histogram being constructed for block b. For example, each value in the code word can be added to a corresponding bin of the K bins or accumulator of the K accumulators. In some embodiments, accumulator updates can be performed using various techniques, and the techniques used can depend on the implementation of the coding tensor (e.g., using integers, fixed-point numbers, floating point numbers, etc.).
At 714, process 700 can determine whether a frame has elapsed (e.g., whether a time period corresponding to a single depth measurement has elapsed). For example, after a time associated with a frame (e.g., 33 milliseconds at 30 fps, 10 milliseconds at 100 fps) has elapsed from a previous readout (e.g., based on a reset signal), process 700 can determine that a frame has elapsed.
If process 700 determines that a frame has not elapsed (“NO” at 714), process 700 can return to 706 and can detect a next photon. Note that a photon may not be detected for each light source emission, and process 700 can move from 706 to 714 without a photon detection if a detection period T has elapsed without a photon detection. In some embodiments, process 700 can move from 706 to 714 for each detector during each detection period τ.
Otherwise, if process 700 determines that a frame has elapsed (“YES” at 714), process 700 can move to 716. At 716, process 700 can output values from the K bins of the compressive histogram associated with each block. For example, after a frame has elapsed, process 700 can output the values of the K bins to processor 608.
At 718, process 700 can decode and/or decompress the compressed histogram values for each block (e.g., using techniques described above in connection with
At 720, process 700 can provide the decompressed histogram values as input to a layer of the modified CNN. For example, process 700 can provide a 3D tensor histogram H decoded from the compressed histograms for each block to a subsequent layer of the CNN.
At 722, process 700 can receive an output from the CNN indicative of one or more properties of the scene based on the information included in the decompressed histogram. For example, process 700 can receive depth values as output from the CNN. As another example, process 700 can receive lifetime values (e.g., in FLIM imaging) as output from the CNN.
In some embodiments, 720 and/or 722 can be omitted. For example, in an implementation in which a CNN is not used to generate depth or other values (e.g., where depth values or other values are calculated directly from the histogram values).
At 724, process 700 can generate values based on the decompressed histogram and/or outputs from the CNN received at 722. For example, process 700 can use any suitable technique or combination of techniques to determine a depth value from the decompressed histogram (such as techniques described in connection with 814 of FIG. 8 in U.S. patent application Ser. No. 17/834,884, which has been incorporated by reference herein). As another example, process 700 can use the outputs from CNN to calculate values. In some examples, process 700 can perform an imaging task based on the values. For example, process 700 can estimate a depth value based on the values (e.g., in SPAD LiDAR three-dimensional imaging, Fluorescence lifetime imaging, Non-line-of-sight (NLOS) imaging, computer vision imaging, or any other suitable imaging task).
In some embodiments, one or more portions of process 700 can be repeated for each frame and/or for each block of single-photon detectors (e.g., each block of SPAD outputs) of the image sensor.
In some embodiments, depth values generated using process 700 can be used to generate a depth image (e.g., such as depth images shown in
Datasets used for model training and testing are described below, as well as implementation details for the compressive histogram layer and the 3D CNN used for the experiments.
For training, a synthetic SPAD measurement dataset including different scenes at a wide range of illumination settings were generated. A similar synthetic data generation pipeline was used as in previous learning-based SPAD-based 3D imaging works (e.g., as described in Peng et al., “Photon-efficient 3d imaging with a non-local neural network”; Lindell et al., “Single-photon 3d imaging with deep sensor fusion”; and Sun et al., “Spadnet: deep rgb-spad sensor fusion assisted by monocular depth estimation,” Optics Express, 28 (10): 14948-14962 (2020)). Using EQ. (2), SPAD measurements can be simulated given an RGB-D image, the pulse waveform (h(t)), and the average number of detected signal and background photons per pixel. Additional details of the simulation pipeline used are described below.
Simulated Training Dataset: RGB-D images from the NYU v2 dataset were used to generate a simulated training dataset. The simulated histograms have Nt=1024 bins and a Δ=80 picosecond (ps) bin size (corresponding to a 12.3 meter (m) depth range). The pulse waveform used has a full-width half maximum (FWHM) of 400 ps. For each scene, the average number of signal and background photons detected per pixel were randomly set to [2, 5, or 10] and [2, 10, 50], respectively. With appropriate normalization, the models generalize to other photon levels despite being trained on this photon-starved dataset. A total of 16,628 histogram tensors with dimensions 1024×64×64 were simulated and split into a training and a validation set with 13,851 and 2,777 examples, respectively.
Simulated Test Dataset: For testing 8 RGB-D images from the Middlebury stereo dataset were used. The simulated histograms have Nt=1024 bins and a 4=100 ps bin size (corresponding to a 15.3 m depth range). The pulse waveform used is a Gaussian pulse with an FWHM of 318 ps (with a σ=135 ps). In some examples, σ is the variance of the Gaussian pulse. The width of the Gaussian pulse can be described using the FWHM or the variance of the Gaussian. A total of 128 test histogram tensors were generated by simulating each scene with the following average number of detected signal/background photons: 2/2, 2/5, 2/50, 5/2, 5/10, 5/50, 10/2, 10/10, 10/50, 10/200, 10/500, 10/1000, 50/50, 50/200, 50/500, and 50/1000.
Real-world Experimental Data: An evaluation of the generalization of our models to real-world experimental data captured in Lindell et al., “Single-photon 3d imaging with deep sensor fusion” is described.
To simplify training, the input to all models was a 3D histogram tensor, though techniques for generating compressive histograms can be used directly on streams of photon timestamps (EQ. (4)).
Compressive Histogram Layer: The encoder was implemented as a 3D convolution with a stride equal to the filter size, with learned filters that are the coding tensors, Ck. A constraint was applied to Ck to be zero-mean along the time dimension. The unfiltered backprojection decoder was implemented as a 3D transposed convolution with a stride equal to its filter size. To help the CNN model generalize to different photon count levels zero-normalization was applied along the channel dimension (along K) to the inputs (Ŷb) and the weights (C) of the transposed convolution, sometimes referred to as layer-norm. This normalization is also commonly used in depth decoding algorithms.
Depth Estimation 3D CNN: To estimate depths from the decoded histogram tensor, the 3D deep boosting CNN model described in Peng et al., “Photon-efficient 3d imaging with a non-local neural network,” was used for single-photon 3D imaging, without the non-local block. Similar to Peng et al., “Photon-efficient 3d imaging with a non-local neural network,” and Lindell et al., “Single-photon 3d imaging with deep sensor fusion,” the pixel-wise KL-divergence between the output histogram tensor and a normalized true histogram tensor was used as the objective, and depths were estimated using a softargmax. In some examples, the models including the encoding and decoding layers can be trained without using pre-trained models. In other examples, a pre-trained 3D CNN can be used by fine-tuning the model while the encoding layer is trained. In some embodiments, the decoding layer may not trained. The decoding layer can depend on the weights of the encoding layer. In other examples, only a subset of the weights of the encoding layer (i.e., parts of the coding tensors) can be trained. For example, some weights can be selected and initialized to initialized code (e.g., Fourier codes). Then, the other weights can be learned.
Training: At each training iteration patches of size 1024×32×32 were randomly sampled. All models were trained using the ADAM optimizer with β1=0.9, β2=0.999, batch size of 4, and a learning rate of 0.001 that decays by 0.9 after every epoch. All models were trained for 30 epochs with periodical checkpoints, and for a given model the checkpoint was chosen that achieves the lowest root mean squared error (RMSE) on the validation set.
The performance at various compression levels for different coding tensor designs jointly optimized with the depth estimation CNN described above are described below. A coding tensor design was determined by the dimensions of Ck (Mt×Mr×Mc), the size of the compressive histograms (K), and if Ck is separable.
Comparisons are against the following baselines:
Temporal Truncated Fourier: A compressive histogram that uses coding tensors with dimensions 1024×1×1 and whose weights are set using the first K/2 frequencies of the Fourier matrix (e.g., as described in Gutierrez-Barragan et al., “Compressive single-photon 3d cameras,” in Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 17854-17864 (2022), and Shechan et al., “A sketching framework for reduced data transfer in photon counting lidar,” IEEE Transactions on Computational Imaging, 7:989-1004 (2021).
Temporal Coarse Histogram: Here C is a box downsampling operator along the temporal dimensions which produces a coarse histogram with K bins.
No Compression Oracle: In this baseline, the ideal scenario is assumed where the histogram tensor is transferred off-sensor and processed with the depth estimation 3D CNN. Similar to Peng et al., “Photon-efficient 3d imaging with a non-local neural network,” this model was trained with an initial learning rate of 1e-4 and total variation regularization.
Peak Compression Oracle: This baseline implements an ideal SPAD camera with sufficient in-sensor memory to store a histogram tensor and sufficient computation power to compute per-pixel depths through an argmax along the temporal axis. To process the noisy 2D depth images with the 3D CNN, a 3D grid was generated where all elements are 0 except for one element per spatial location whose index is proportional to the depth value. This model was trained like the no compression oracle.
Similar to mechanisms described herein, all compressive histogram baselines described here were implemented as a compressive histogram layer, with fixed weights, whose outputs were processed by the depth estimation 3D CNN.
Evaluation Metrics: The 3D imaging performance of each model was summarized using two metrics: (1) the mean absolute depth error (MAE), and (2) and the percent of pixels with absolute depth errors that are lower than 10 mm. To understand the performance under these metrics the test set was divided into different SBR ranges and the metrics for each range were reported individually. The overall dataset performance was also visualized as scatter plots (e.g., as shown in
As shown in
Comparisons of depth reconstructions of compressive histograms with coding tensors that were optimized (using mechanisms described herein) against coding tensors that were fixed and not optimized throughout training are shown in
In
Each scatter plot point in
In
Each scatter plot point in
The effect of reducing the size of C at 128× compression is shown in
Dataset Bias in Learned Coding Tensors with Mt=Nt: The training dataset depth bias that is embedded in some coding tensor designs is analyzed, and its effect on generalization to scenes with depths that appear less often in the dataset.
Depth Range Bias in Learned Coding Tensors:
Quantitative Performance Analysis on Depths Between 7-10m:
Qualitative Performance Analysis on Depths Between 7-10m:
In general, dataset bias can be analyzed in any learning-based model. It was found that this bias can lead to learned coding tensors that only work well for a subset of depths. One way to resolve this problem is by augmenting the dataset to include examples with depths for the full depth range. However, an even simpler approach is to consider a coding tensor design that considers a smaller block size, making the tensor convolutional. In the disclosure, it was shown that the learned coding tensors that are robust to this dataset bias, continue to provide the same performance benefits that were observed.
Supplemental Analysis of the Coding Tensor Design Space: In this section, additional results related to the ablation study on the coding tensor design space are presented. These results include the effect of the spatial block dimensions, the effect of the size of C, and the performance difference between coding tensors whose temporal dimension is learned vs coding tensors whose temporal dimension is initialized and fixed to truncated Fourier codes.
Why does 256×1×1 fail at 128× compression? As observed in
Overall, coding tensors that exploit spatial correlations are more robust to low SBR settings. However, at high SBR, operating in a large spatial neighborhood can make it harder to resolve fine scene details. Moreover, increasing the spatial block dimension further increases the number of parameters of the coding tensor which, as discussed in the Detailed Description, is less practical. In this analysis, it was found that coding tensors with spatial block of 2×2 and 4×4 achieve good balance of robustness to noise at low SBR, while being able to reconstruct fine scene details at high SBR.
The number of parameters is reduced by either making the coding tensors separable or making their temporal dimension smaller. At low compression levels (32× compression), coding tensors with as few as 2,560 parameters perform comparably to larger coding tensors with 100× more parameters. At higher compression levels (128× compression), the size of the coding tensors starts having a more pronounced effect on depth image quality. At higher SBR levels (i.e., SBR >0.1), the larger coding tensors are able to better recover fine scene structures such as the spikes in
Learned vs. Fourier-based Temporal Compressive Representations: In this section, the performance of separable coding tensors whose Cktemporal is either learned or fixed to a truncated Fourier coding tensor during training are compared.
Temporal Gray Fourier: In addition to comparing with the Truncated Fourier coding tensors, performance was also compared to another Fourier-based coding tensor design, which is referred to herein as Gray Fourier. A Gray Fourier compressive histogram can use coding tensors with dimensions 1024×1×1. The coding tensors are a Fourier matrix where every two rows the frequency of the sinusoidal signal doubles as illustrated in
Fourier+Learned C: In this separable coding tensor design, the temporal coding tensors (Cktemporal) were fixed to truncated Fourier coding tensors, and the spatial coding tensors (Ckspatial) were learned. The temporal coding tensors in this design can be represented with a small number of parameters that do not scale with K discussed above hence, the in-sensor memory overhead they introduce is smaller than a fully learned coding tensor.
Results:
As described in previous sections, coding tensors that exploit spatial information (e.g., 256×4×4 or 256×2×2) can be expected to provide higher quality reconstructions, especially, at lower SBR levels. At 64× compression (
Fourier-based temporal coding tensors have a memory-efficient implementation that does not scale with K. Using a flexible spatio-temporal compressive histogram framework described in the Detailed Description, coding tensors whose temporal dimension is fixed to Fourier codes and the spatial coding tensors are learned can be implemented. This results in a practical compressive histogram model that can be implemented in existing SPAD pixels while providing robust performance across SBR and compression levels. Fully learned coding tensors can still provide some improvements in the most challenging situations (high compression and low SBR), however, they require additional in-sensor memory. Nonetheless, this additional in-sensor memory may be negligible in implementations where a large number of SPAD pixels share the same copy of the coding tensors.
Evaluation on Real-world Data: To evaluate the generalization of the proposed models, raw histogram tensor data captured with a SPAD-based 3D camera prototype was downloaded. The dataset was captured with a line scanning system including a co-located picosecond laser and a 1D LinoSPAD array with 256 SPAD pixels. The histogram tensors have Nt=1536 time bins, a spatial resolution of 256×256, and a bin size Δ=26 ps. The raw histogram tensors were downsampled to be 1024×128×128 to make the time domain compatible with the learned coding tensors that use Mt=1024 and also to avoid out-of-memory errors.
All models were able to produce plausible depth reconstructions, suggesting good generalization to real-world data. However, all compressive histogram models display small artifacts throughout the image that could be due to high noise levels or generalization problems. These artifacts seem to be avoided by the oracle baselines (no compression and peak compression) by over-smoothing the images. This over-smoothing is due to the total variation regularizer that was used for the oracle baselines but not for the compressive histogram models, which was found to produce the better oracle models on the synthetic datasets. Therefore, these results suggest that a spatial regularizer can be used to improve compressive histogram model's generalization on real-world data. Nonetheless, despite these minor artifacts, the depth reconstructions suggest good generalization by all models to these challenging scenarios.
Comparison with Coarse Histogramming: Although, the depth images for the coarse histogramming coding tensor shown in
Comparison with Fourier-based C: It was observed that the model that use a Gray-based Fourier C at 128× compression produce blurrier depth reconstructions than the learned C models, This was observed in the lamp scene where the wires merge into a single blob, or in the staircase scene where the stair edges are blurred. On the other hand, the models with a learned C produce sharper depth reconstructions at the same compression level, despite being trained in the exact same manner.
Simulating SPAD Measurements: In this section, a detailed description of how SPAD measurements were simulated for the synthetic datasets used to generate results described herein and in the Detailed Description.
Given an RGB-D image, pulse waveform (h(t)), and the mean number of detected signal (Φmeansig) and background (Φmeanbkg) photons per pixel, the photon detection parameters for EQ. (2) was set as follows. First, the amplitude of the illumination signal arriving at each pixel was calculated (ap in EQ. (1)) by using the reflectance at that pixel and accounting for the intensity radial fall-off due to distance. The NYUv2 training set reflectance was estimated using intrinsic image decomposition on the blue channel of the RGB image, and the Middlebury testing set reflectance was estimated using the mean of the RGB channels. Intrinsic image decomposition can lead to more accurate reflectance estimates for non-lambertian surfaces. Consequently, given ap, h (t), the per-pixel depths, and the average number of signal photon per pixel, the average number of signal photons arriving at each time bin can be scaled such that ΣpΣiΦi,psig=Φmeansig. Similarly, the per-pixel background illumination (Φbkg) can be emulated using the RGB channel mean and scaling it such that it matches the desired mean number of background photons per pixel. Finally, dark counts can be added to the per-pixel background illumination component. This can be done on the training set using a calibration dark count image obtained from the hardware prototype in. It was observed that the models trained with this dark count component generalize well to histogram tensors without the dark counts, as shown in the test results.
Training and Implementation Details: In this section, further training and implementation details are described. All models used to generate results described herein and the in the Detailed Description were implemented in PyTorch. The input to all the models was a 3D histogram tensor. Recall that due to the linearity of compressive histograms, encoding the histogram tensors is equivalent to encoding each individual photon timestamp and summing them up. Hence, models deployed with a compressive histogram layer can also take as input a stream of photon timestamps and build the compressive histogram.
Compressive Histogram Layer: The compressive histogram layer was implemented as a single-layer encoder and decoder. The encoder was a 3D convolution with a stride equal to the filter size. The coding tensors, Ck, were the learned filters. All coding tensors were constrained to be zero-mean along the time dimension. This constraint makes the expected encoded value for background photons distributed uniformly along the time dimension be 0. The outputs of the encoder were the compressive histograms Ŷb. The decoder was an unfiltered backprojection that was implemented as a 3D transposed convolution with a stride equal to its filter size. To help the CNN model generalize to different photon count levels zero-normalization was applied along the channel dimension (K) to the inputs (Ŷb) and the weights (C) of the transposed convolution as follows:
where the mean and L2 norm are computed over the channel dimension. This normalization is also known as layer normalization.
Depth Estimation 3D CNN Model: To estimate depths from the decoded histogram tensor, 3D deep boosting CNN model described in Peng et al., “Photon-efficient 3d imaging with a non-local neural network,” in European Conference on Computer Vision, pp. 225-241 (2020) was used for single-photon 3D imaging. Different from Peng et al., the implementation used to generate results described herein does not include a non-local block after the feature extraction stage. The output of the model was a denoised histogram tensor, Hout, from which depths were estimated using a softargmax function along the time dimension. The pixel-wise Kullback-Leibler (KL) divergence between the denoised histogram tensor and a normalized ground truth histogram tensor, Hgt, was used as an objective function. This loss can be written for each pixel, p, as:
Training: At each training iteration, sample patches of size 1024×32×32 were randomly sampled from the training set. All models were trained using the ADAM optimizer with default parameters (β1=0.9, β2=0.999), batch size of 4, and an initial learning rate of 0.001 that decays by 0.9 after every epoch. All models were trained for 30 epochs with checkpoints every half an epoch, and for a given model the checkpoint that achieved the lowest root mean squared error (RMSE) on the validation set was chosen.
Analysis of the Memory Overhead of Coding Tensors: Compressive histograms have the potential to greatly reduce off-sensor data transmissions and the amount of in-sensor memory required compared to a conventional histogram tensor representation. However, the general compression framework described in the Detailed Description can include the in-sensor storage of the K coding tensors (C=(Ck)k=0K-1) that are used for compression. This means that a large C may introduce a significant amount of in-sensor memory overhead, making these designs for C less practical. In this section, a quantitative analysis of this memory overhead is described for different coding tensor designs.
Recall that H and C can be Nt×Nr×Nc and K×Mt×Mr×Mc tensors, respectively. Let, N=Nt·Nr·Nc be the total number of elements in the histogram tensor. Moreover, let M=Mt·Mr·Mc be the size of a single coding tensor which is also the size of the histogram block, Hb that is being compressed. For the remainder of this analysis, the following is assumed:
1. That all histogram blocks Hb that are compressed are non-overlapping. This means that the total number of compressive histograms that are transferred off-sensor is B=N/M.
2. That only a single C is stored inside the sensor. This C will be shared among all SPAD pixels.
3. That the elements of C and H are represented using the same number of bits.
Table 1 provides the expected compression ratios for off-sensor data transmission and in-sensor storage. These two compression ratios will differ due to the memory overhead incurred by compressive histograms when having to store the coding tensors. It is clear that a compressive histogram for a histogram block whose size equals the size of the histogram tensor (i.e., M=N), would actually require more in-sensor memory than the histogram tensor making this compressive histogram design impractical.
Table 1: Data Transmission and In-sensor Storage Requirements. This table shows the off-sensor data transmission and in-sensor storage requirements for a histogram tensor of size N and a set of B compressive histograms that use K coding tensors of size M for compression. The compression ratio column shows the amount of compression that can be achieved for off-sensor data transmission and in-sensor storage.
Compression Ratios for Full Coding Tensors:
As the number of compressive histograms are reduced to represent the histogram tensor, the size of the coding tensors will increase and consequently lower in-sensor compression is achieved. Since the coding tensors do not need to be transferred off-sensor, the data rate compression ratio continues to increase as the number of compressive histograms are reduced because the overall size of the compressive representation does decrease when K is fixed. Overall, a good balance between reducing in-sensor memory and data transmission seems to be achieved when using 10,000-100,000 compressive histograms to represent a histogram tensor with 1e9 element (e.g., a 1 megapixel SPAD array with 1000 bins per pixel). In this case, the size of a single coding tensor (M) should range between 10,000-100,000 for 64≤K≤512. Some of the coding tensors with K≥64 that were evaluated in this paper approximately match this size range, e.g., M=16,384 for 1024×4×4 or for 256×8×8.
Compression Ratios for Separable Coding Tensors:
A separable coding tensor that is ˜16× smaller is consistent with the separable coding tensors used in the main paper. For instance, a 256×4×4 Ck is ˜16× smaller if its temporal and spatial dimensions are separable. In this scenario, a good balance between in-sensor storage and data transmission compression can be achieved when using 1,000-100,000 compressive histograms to represent a histogram tensor with 1e9 elements. In this case, the size of a single separable coding tensor should range between 625-62,500 for 64≤K≤512. Some of the separable coding tensors with K≥64 that were evaluated in this paper approximately match this size range: M=272 (256×4×4), M=1040 (1024×4×4).
Parameter-efficient coding tensors can reduce the in-sensor memory overhead that compressive histograms introduce. In this section, it was shown that the local block-based separable coding tensor designs explored in this paper are able to reduce the memory overhead for histogram tensors of size N≥1e7 Additional lightweight C designs can rely on other factorization techniques such as low-rank approximations. In some examples, weight quantization can be an effective technique in further compressing C. Finally, designs with parameters that can be computed on the fly, such as Fourier-based (Sec. A2) or Gray codes, are a practical design when multiple C need to be stored across the SPAD array. Ultimately, a practical coding tensor representation can be determined by the hardware constraints of a given SPAD camera.
Implementation examples are described in the following numbered claims.
In some embodiments, any suitable computer readable media can be used for storing instructions for performing the functions and/or processes described herein. For example, in some embodiments, computer readable media can be transitory or non-transitory. For example, non-transitory computer readable media can include media such as magnetic media (such as hard disks, floppy disks, etc.), optical media (such as compact discs, digital video discs, Blu-ray discs, etc.), semiconductor media (such as RAM, Flash memory, electrically programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), etc.), any suitable media that is not fleeting or devoid of any semblance of permanence during transmission, and/or any suitable tangible media. As another example, transitory computer readable media can include signals on networks, in wires, conductors, optical fibers, circuits, or any suitable media that is fleeting and devoid of any semblance of permanence during transmission, and/or any suitable intangible media.
It should be noted that, as used herein, the term mechanism can encompass hardware, software, firmware, or any suitable combination thereof.
It should be understood that the above described steps of the process of
Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention, which is limited only by the claims that follow. Features of the disclosed embodiments can be combined and rearranged in various ways.
The present application is based on, claims priority to, and incorporates herein by reference in its entirety for all purposes, U.S. Provisional Patent Application Ser. No. 63/516,137, filed Jul. 27, 2023.
This invention was made with government support under 1846884, 1943149 and 2138471 awarded by the National Science Foundation and under DE-NA0003921 awarded by the US Department of Energy. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63516137 | Jul 2023 | US |