1. Field of the Invention
This invention relates generally to multiframe image reconstruction techniques and, more particularly, to the adaptive acquisition and/or display of image frames using multi-focal displays.
2. Description of the Related Art
Real world scenes contain an extremely wide range of focal depths, radiance and color and thus it is difficult to design a camera capable of imaging a wide range of scenes with high quality. To increase the versatility of its imaging system, most cameras have adjustable optical settings, such as the focus, exposure, and aperture. In most such systems, the camera includes some form of automatic adjustment of these settings depending on the object scene, such as auto-focus (AF), automatic gain (AG), and auto-exposure (AE) algorithms. These automatic algorithms typically use image data to perform adjustment. The camera will capture multiple images under different acquisition settings until it finds the optimal settings for a single image. The adjustment process often consumes significant power to adjust the focus and aperture settings. Finding efficient algorithms for automatically adjusting the camera settings is thus important for minimizing power consumption as well as improving performance for the user.
Traditional settings adjustment algorithms rely on multiple tests in order to find the best settings for acquiring a single image. A large class of alternate image processing algorithms, known as multiframe reconstruction algorithms combine a set of multiple images to synthesize a single image of higher quality. Such multiframe algorithms operate on a set of images where each image contains different information about the scene. The reconstruction algorithm combines these multiple sources of information, typically based on information about the source of the image variations (shifts, defocus, exposure level, etc.) to form a single reconstructed image. Typically, the set of images is captured using predetermined acquisition settings. In other words, the acquisition settings do not depend on image content. The traditional problem addressed by multiframe reconstruction is then, given the set of already acquired images, synthesize the best quality reconstructed image from the set of available images.
The choice of acquired images, however, can significantly affect the quality of the final reconstructed image. Multiframe reconstruction combines different information from different images into the single reconstructed image. However, if no image in the set has collected certain information, then that information cannot be represented in the reconstructed image. More generally, some visual information is more important than other information when constructing an image of a particular scene.
Multi-focal displays (MFDs) are one device that can implement multiframe reconstruction. MFDs typically use rapid temporal and focal modulation of a series of 2-dimensional images to render 3-dimensional (3D) scenes that occupy a certain 3D volume. This series of images is typically focused at parallel planes positioned at different, discrete distances from the viewer. The number of focal planes directly affects the viewers' eye accommodation and 3D perception quality of a displayed scene. If a given 3D scene is continuous in depth, too few planes may make the MFD rendering look piecewise with discontinuities between planes or result in contrast loss. More planes is typically better in terms of perceptual quality, but can be more expensive to implement and often may not be achievable because of practical display limitations including bandwidth and focal modulation speed.
Therefore, an important consideration for MFDs is the focal plane configuration, including the number of focal planes and the location of the focal planes (that is, distances from the viewer). Multi-focal displays typically use focal plane configurations where the number and location of focal planes are fixed. Often, the focal planes are uniformly spaced. This one size fits all approach does not take into account differences in the scenes to be displayed and the result can be a loss of spatial resolution and perceptual accuracy.
Therefore, there is a need for multiframe reconstruction techniques that actively select which images should be acquired, in addition to combining the acquired images into a reconstructed image. There is a need for better approaches to determining focal plane configurations for multi-focal displays.
In one aspect, the present disclosure overcomes the limitations of the prior art in multiframe imaging by automatically selecting which images to acquire based at least in part on the content of previously acquired images and also on reconstruction of the object on a multi-focal display. In one aspect, at least two images of an object are acquired at different acquisition settings. For at least one of the images, the acquisition setting for the image is determined based at least in part on content of previously acquired images and also at least in part on reconstruction of the object on a multi-focal display. The object is then rendered on a multi-focal display from the acquired set of images.
Other aspects of the invention include components, devices, systems, improvements, variations, methods, processes, applications, computer readable mediums, and other technologies related to any of the above.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The invention has other advantages and features which will be more readily apparent from the following detailed description of the invention and the appended claims, when taken in conjunction with the accompanying drawings, in which:
The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
Outline
III.A. MSE Estimate
III.B. Determining Acquisition Setting based on RMSE
III.C. Determining Acquisition Setting based on RMSE and Energy Constraints
III.D. Objects with Depth
III.E. Simulation Results
IV.A. Depth Blending
IV.B. Problem Formulation
IV.C. Solution Example 1
IV.D. Solution Example 2
In one aspect, the present disclosure overcomes the limitations of the prior art in multiframe imaging by automatically selecting which images to acquire based at least in part on the content of previously acquired images. In one approach, a set of at least three images of an object are acquired at different acquisition settings. For at least one of the images in the set, the acquisition setting for the image is determined based at least in part on the content of one or more previously acquired images. In one approach, the acquisition parameters for the K+1 image are (optimally) adjusted based on the information in the previously acquired K images, where “optimally” refers to the final image quality of the K+1 multiframe reconstructed image. Multiframe reconstruction is applied to the set of acquired images to synthesize a reconstructed image of the object.
In a common implementation, image acquisition begins with the acquisition of at least two initial images at acquisition settings that do not depend on content of previously acquired images. Then, for every image acquired after the initial images, the acquisition setting for the image is determined based at least in part on content of previously acquired images. The acquisition setting for later images can be determined in a number of different ways. For example, it can be determined without regard to whether any additional images will be acquired afterwards. Alternately, it can be determined assuming that at least one additional image will be acquired afterwards. In yet another alternative, it can be determined assuming that a total of K images will be acquired.
In another aspect, the acquisition setting can be based on increasing a performance of the multiframe reconstruction, given the previously acquired images. One approach measures performance based on maximum likelihood estimation, including for example using the Cramer-Rao performance bound. The acquisition setting can also be based on increasing the information captured by the image, compared to the information already captured by previously acquired images.
In yet another aspect, the acquisition setting is based on reducing change in the acquisition setting relative to the immediately previously acquired image, for example to conserve energy and/or reduce the time lag between acquisitions. The cost or merit function could also include power, energy, or time constraints associated with changing the acquisition settings. Thus, for instance, if camera battery power is of significant concern, the merit function can penalize large lens motions which require significant power consumption.
Examples of parameters that may be determined as part of the acquisition setting include aperture, focus, exposure, spatial shift, and zoom.
Yet another aspect of the present disclosure overcomes the limitations of the prior art by selecting the locations of the focal planes for a multi-focal display, based on an analysis of the scene to be rendered by the multi-focal display. In one example, a distortion metric is defined that measures a distortion between an ideal rendering of a three-dimensional scene versus the rendering by a limited number of focal planes in the multi-focal display. The locations of the focal planes are selected by optimizing the distortion metric. One distortion metric is based on differences between the location of a point in the ideal rendering versus the location of the closest focal planes of the multi-focal display. Another distortion metric is based on differences in the defocus blurring for the ideal rendering versus the rendering by the multi-focal display.
Yet another aspect combines the adaptive acquisition with the focal plane optimization for multi-focal displays.
However, this is not the case in
From an information point of view, the adaptive acquisition module 170 preferably selects images so that the set of images 120, as a whole, contain as much visual information as possible about the scene. Accordingly, which next image adds the most new information to the set will depend in part on what information has already been collected by previously acquired images and also in part on what information is thought to be missing or poorly represented based on analysis of the previously acquired images. While each individual image may itself be poor quality, as a collection, the set of images preferably contain a significant amount of information about the scene. This differentiates the adaptive multiframe approach from the conventional single-frame approaches, such as autoexposure and autofocus, which find the best settings for a single captured image.
The following sections develop some of the underlying principles for a specific adaptive approach based on a combination of the Cramer-Rao (CR) Bound and the asymptotic properties of Maximum-Likelihood estimation. Some examples are presented based on the dynamic optimization of focus and aperture settings.
Multiframe image reconstruction is usually based on a model of the imaging system as a function of the acquisition setting parameters. This section presents a particular model that is chosen to illustrate the underlying principles. The invention is not limited to this particular model. Other models and underlying assumptions can also be used.
In this example, the captured image is modeled using the linear model
y
k
=H(φk)s+n(φk) (1)
where yk is the kth captured image, H is the sampled optical point spread function, s is the unknown ideally sampled image, and n is the noise inherent to the imaging system. The vector φk represents the acquisition setting for the kth frame. The collection of the acquisition settings for all frames will be referred to as Φ. For simplicity, the following example considers two acquisition setting parameters: the aperture diameter A and the back focal distance b, with a description of how this may be extended to include the exposure time T as well. However, the adaptive approach is not limited to these parameters. Examples of other acquisition setting parameters include the field of view, camera angle (i.e., where the camera is pointed), magnification, wavelength, polarization, and various aspects of illumination including brightness and spatial variation.
The ideal image s is the image formed by an ideal pinhole camera without the effects of diffraction. In other words, it is an image taken from a theoretically infinite depth-of-field camera without noise or diffraction. At first, for simplicity, consider only planar objects which are perpendicular to the camera at an unknown distance z from the front of the camera. Later, this will be extended to scenes having more realistic spatially-varying depths. Also for purposes of illustration, assume the following about the point spread function (PSF) defining the blurring matrix H. First, assume that the PSF is spatially invariant. Such an assumption is reasonable for expensive optical lens systems or for narrow field of views. This spatial invariance property allows one to conveniently characterize the blurring in the frequency domain using the optical transfer function (OTF) H(w,v) where w,v are the spatial frequencies in the horizontal and vertical directions. In other words, the matrix H is diagonalized by the FFT operator, producing a diagonal matrix whose elements along the diagonal are the system's OTF. Second, assume that the lens system's OTF is dominated by the defocus aberration. The defocus aberration induces optical transfer functions H(w,v,δ) where δ captures the amount of defocus in the optical system. The defocus is proportional to
where ƒ is the focal length of the camera, b is the back focal distance, z is the object distance, and A is the diameter of the aperture. This equation comes from the lens-makers equation combined with a geometric characterization of the PSF width. The amount of defocus is a nonlinear function of z and b, and a linear function of A. To simplify the estimation problem, transform the estimation problem into that of estimating the distance in diopters or inverse meters ζ=1/z and build a corresponding inverse focal function β=1/ƒ−1/b. Using this reformulation, Eq. 2 can be rewritten as
δ=A(β−ζ). (3)
For a given estimate of the inverse depth ζ or inverse focal setting β, the transformation can be inverted to obtain the actual depth estimate z or back focal distance b. One advantage of this formulation is that units of ζ hand β can be normalized into the range [0,1]. Performance will generally be reported on this normalized scale.
Also assume that the total additive noise n includes two types of noise components. The first is a thermal read noise associated with the sampling circuitry. This noise is independent of the image and has a noise power σr2. The second is a signal-dependent noise related to shot noise. This noise has power which is linearly related to the signal power. Assume that this noise is a function of the average signal value μS=(Σmsm)/M where m indexes the pixels and M is the total number of pixels. This noise power is given by σS2=μSσ02 where σ02 is a baseline power. Notice that as the signal strength increases, this second type of noise can dominate the noise in the captured image. This model suggests that the SNR of the camera improves linearly for weak signals where the read noise dominates, and as the square root of the signal energy for stronger signals.
In many imaging systems, the strength of the signal depends on the number of photons captured in each pixel well. The number of photons captured by the detector is a quadratic function of the aperture diameter A and a linear function of the exposure time T. If the signal is normalized into a preset range (say [0, 1]), then the noise power for the normalized signal is given by
The SNR of the captured image is a function of both the exposure time and the aperture setting. In real systems, the pixels of a sensor can hold only a finite number of photons, so the aperture settings and exposure settings preferably are selected to ensure that the signal is just strong enough to saturate the detector for maximum dynamic range and SNR of the individual frames. The exposure could be varied such that certain image regions are saturated to improve the dynamic range in the dark regions.
In the following first example, assume that the exposure time T is fixed but the aperture setting A is adjustable. Given this model, there is an inherent tradeoff between contrast and SNR as a function of the aperture setting A. For example, suppose that an object is located near the camera while the back focal length is set to focus at infinite. By increasing the aperture, one can improve the SNR at the expense of increasing the amount of focus blur.
The forward model of Eq. 1 can be used to construct a statistically optimal multiframe estimation algorithm based on the Maximum-Likelihood (ML) principle. Express the ML cost function in the frequency domain as
where yk(w,v) and s(w,v) are the frequency domain expressions for the kth captured image and the ideal source image, respectively. This is the squared error between the observed kth image yk and the ideal image s filtered by the OTF using the kth acquisition setting φk. When computing the ML cost function, consider only spatial frequency values up to the Nyquist sampling frequency defined by the pixel pitch, and ignore the effects of aliasing artifacts.
Because the unknown image is linearly related to the observed images, the ML estimate for the unknown image if the inverse depth ζ is known, is given by the multiframe Wiener solution
where Ps(w,v) is the power spectral density of the ideal source image s(w,v). Substituting this estimate of the high-resolution image back into the cost function yields the following nonlinear cost function as a function of the unknown inverse distance ζ:
Now minimize this cost function using standard gradient descent to estimate the unknown inverse distance ζ. The value of ζ that minimizes the cost function is used as the current estimate for ζ. To perform gradient descent, calculate the analytic derivatives of this cost function with respect to the unknown depth parameter. In general, this search may be performed very quickly as the cost function is one dimensional. Other descent algorithms could be used as well.
One advantage of this multiframe approach is the ability to reproduce a sharp, in-focus image from a set of out-of-focus images if the set of defocused MTFs have non-overlapping zero-crossings. For example, the OTF for an optical system having a square pupil with only defocus aberration can be approximated as a separable MTF taking the form
H(ρ,g)=Λ(ρ)sinc(δρ(1/−|ρ|)),ρε[−1,1] (8)
where ρ is either the horizontal or vertical component normalized spatial frequency coordinates normalized by the Nyquist sampling rate (ρ=1). These frequency coordinates are a function of the F/# and the wavelength. The function Λ(x) is defined as Λ(x)=max {1−|x|,0}; and defines the diffraction limit MTF envelope. The defocus MTF for such a system produces zero crossings where δρ(1−|ρ|) is close to integer values. In between these spatial frequency regions, the phase is inverted, but contrast is preserved. Multiframe reconstruction can take multiple such defocused images and extract the contrast if none of the zero crossings overlap.
The previous section described one example of multiframe reconstruction as a depth estimation problem. Continuing this example, this section describes a dynamic framework for selecting the acquisition setting based on previously acquired images. In the following example, the criterion for the image acquisition is based on predictions of mean-square-error (MSE) performance after multiframe reconstruction. Given that this example implements the ML algorithm, a predictor of performance is the Cramer-Rao (CR) performance bound. The CR bound not only provides a fundamental bound on MSE performance, but also provides a reasonable prediction of MSE performance for ML estimators. The ability to predict MSE performance is based on the asymptotic optimality of the ML estimator. As SNR approaches infinity, or the number of observed frames increases, the ML estimator will asymptotically approach the CR bound. Furthermore, the error distribution on the estimates will also become Gaussian.
The CR bound is defined as the inverse of the Fisher information matrix (FIM). The Fisher information matrix (FIM) for the multiframe reconstruction problem is given by
The matrix Hk is shorthand notation representing the kth frame blur matrix H (φk,ζ). The term σk2 is the noise power associated with the kth frame which is a function of the acquisition settings. The matrix Gk is defined as the derivative of the blur matrix with respect to the inverse object distance ζ, that is Gk≡∂/∂ζH(φk,ζ). This derivative filter is essentially a band-pass filter over the spatial frequencies sensitive to perturbations in the inverse focal distance. Note that the information related to image reconstruction is independent of the object signal.
To compute the CR bound, apply the block matrix inversion lemma on the partitioned FIM to obtain bounds on the MSE of the form
M
ζ(ζ,s,Φ)≧(Jζζ−JζsJss−1Jsζ)−1 (13)
M
s(ζ,sΦ)≧Tr[Jss−1]+Mζ(JζsJss−2Jsζ) (14)
In this representation, the MSE performance bound (either Mζ or Ms) is a function of the image signal s, the inverse depth ζ, and the set of acquisition settings Φ. Consider the image reconstruction MSE performance predicted by Eq. 14. The predicted MSE in Eq. 14 comprises two terms. The first term is the MSE bound if the depth were known a priori. The second term describes the loss in MSE performance when the inverse depth cis estimated from the data. Eq. 14 will become the merit function in this example adaptive frame capture optimization. As with the multiframe reconstruction, these terms can be computed efficiently in the frequency domain.
Generally speaking, the information content decreases and the RMSE increases, as the aperture is reduced. This behavior is expected as optical systems should become less sensitive to defocus with slower F/#. Also, the information is maximal and RMSE is minimal when the object distance is halfway between the captured frames ζ=(β1+β2)/2. The amount of information does not, however, monotonically increase with focus separation.
If no information is known a priori, to provide an initial estimate of both the image s and the inverse depth ζ requires at least two different frames (in this example, taken at different focal setting b and/or aperture A). These initial frames can also be used to approximate the average signal strength μs. The acquisition settings for these initial frames can be determined in a number of ways. For example, the initial acquisition settings can be optimized based on statistical priors placed on the unknown inverse depth ζ and the image signal s. For the following example, however, assume that the initial frames are captured by perturbing the initial aperture and focal setting.
After obtaining the k≧2 initial frames, apply the multiframe reconstruction algorithm to the image set to obtain an estimate of the image ŝk and the inverse depth {circumflex over (ζ)}k. For example, Eq. 6 can be applied to estimate the image ŝk, and minimization of the cost function of Eq. 7 can be used to estimate the inverse depth {circumflex over (ζ)}k. In this notation, the subscript k signifies the estimate of the image and the inverse depth for a set with k images. Use the asymptotic properties of the CR bound to construct a posterior distribution on the depth location. Since the depth estimation error becomes approximately Gaussian asymptotically, suppose that distribution of the estimate {circumflex over (ζ)}k for a given inverse depth ζ is also Gaussian
p({circumflex over (ζ)}k,ζ)˜N(ζ,Mζ(Φk)). (15)
In other words, optimistically suppose that the variance achieves the CR bound. Then construct a posterior distribution on ζ given the estimate {circumflex over (ζ)}k according to
where p(ζ) is some prior on the inverse depth. For sake of example, assume that this is a flat prior. Then compute the posterior distribution via integration. This one-dimensional integration is numerically tractable.
Now construct a cost function that will maximize imaging performance given the initial estimates of the object and inverse depth. One example cost function is
C
s(φk+1)=Tr∫ζMs(ŝk,{circumflex over (ζ)}k,Φk+1)p(ζ|{circumflex over (ζ)}k)dζ. (17)
This cost function reflects the expected reconstruction MSE over the distance posterior distribution. In this way, the confidence in the depth estimate {circumflex over (ζ)}k is balanced with the reconstruction MSE penalty. Now minimize the cost function with respect to φk+1 to estimate the acquisition setting for capture of the (k+1)st image.
Based on the two initial images, the adaptive acquisition module estimates the object depth {circumflex over (ζ)}2 and determines suggested acquisition setting for focus β3 and aperture A3 for the next image to be acquired, based on minimizing the cost function of Eq. 17. In each of the figures, curve 320 graphs the suggested focal setting β3 as a function of the estimated object depth {circumflex over (ζ)}2 and curve 310 graphs the suggested aperture A3 as a function of the estimated object depth {circumflex over (ζ)}2. In all these figures, the inverse depth ζ, aperture A, and inverse focal setting β are all normalized to the range [0, 1].
As a point of reference, the dashed line 330 shows the focal setting β3 for an overly optimistic autofocus algorithm. For curve 330, the estimate {circumflex over (ζ)}2 is trusted completely. The back focus is chosen to focus exactly on the estimated depth and the aperture is set to a full aperture.
Note that in this example, there is a certain symmetry to the optimized acquisition setting with respect to the location of the initial frames. When the initial image pairs are closely spaced (e.g., β=[0.45, 0.55]), the adaptive acquisition module decides that there is insufficient information to reliably estimate depth and encourages sampling away from the current frames. The focal setting for the third image is chosen far from the previous estimates regardless of the depth estimate. The algorithm chooses a location either much closer or much farther from the current sampled locations depending on {circumflex over (ζ)}2. If the depth estimate is near the previously acquired frames, then the adaptive acquisition module assumes that the previous frames will be sufficient for reconstruction and encourages sampling a new depth space while increasing the SNR by opening the aperture.
At the other extreme, when the frames are widely separated (e.g., β=[0.15, 0.85]), the adaptive acquisition module trusts the estimates in between the two frames and chooses β3={circumflex over (ζ)}2 approximately but shrinks the aperture to account for estimated uncertainty. As the depth estimates approach the previously sampled depth locations, the algorithm encourages sampling a new depth plane to acquire more information and opens the aperture to improve SNR. This optimization algorithm produces nonlinear, yet explainable acquisition setting for the third frame.
This example illustrates the relationship of the signal texture on adaptation of the acquisition setting. Signal texture is important to estimating depth from a pair of frames. In this example, the performance is computed using an image signal with a power spectral density given by
As γ increases, the signal becomes smoother, reducing the amount of texture needed for estimating the depth.
In many applications, considerations other than maximizing reconstruction performance can also be important. For example, energy conservation and extending battery life is important for consumer digital cameras and other portable devices. Accordingly, consider an example cost function that combines a predictor of performance as well as a cost function associated with changing the aperture and focal settings (e.g., since changing focus or aperture size may require mechanical movement that drains a battery). This example cost function has the form
C(Φk+1)=Cs(ŝk,{circumflex over (ζ)}k+1)+E(Φk+1) (18)
The first term accounts for the RMSE performance and the second term E(Φk+1) captures the penalty on changing the acquisition setting. This penalty function combines the cost associated with the energy required to change the acquisition setting as well as those reflecting the time lag required to change the acquisition setting. In a simplified model, the cost function might take the form
E(Φk+1)=CA|Ak+1−Ak|α
where cA, cb, αA and αb are constants. In the simulations presented below, αA=αb=2. Since moving a lens system requires much more energy and time than changing the aperture setting, a relative weighting of cA/cb=50 was used. The actual coefficients should be tuned for the particular SNR values associated with the imaging system in order to combine the different dimensions of MSE and energy.
After acquiring a new image yk using the adapted acquisition setting, multiframe reconstruction can be applied to the larger set of images. The previous estimate of the depth can be used as the initial starting point for optimizing the cost function of Eq. 19. This process repeats until sufficient image quality is achieved, or the maximum number of exposures are acquired, or some total energy consumption has been reached.
The description above assumed that the object was planar and located at a single depth. This was assumed for purposes of clarity and is not a limitation. In more complicated scenes having variable depths, the adaptation of acquisition setting can consider different depths for different field locations. In other words, the depth can be modeled as a function of the spatial location z(x1, x2). The object can be modeled as a spatially-varying or multi-depth object. In some cases, each row of the PSF matrix H(z(x1, x2)) may change.
One alternative is to apply the algorithm described above to different tiles over the image field where the depth is assumed to be constant within the tile. In this case, the cost function will use a weighted sum of the predicted MSE computed via Eq. 18 over the set of tiles. Another approach uses only the maximum MSE over the tiles in a greedy approach to minimizing global MSE.
Estimating depth is important to the example described above. In the above example, it is estimated by minimizing the cost function of Eq. 7. However, depth can be estimated using different techniques, for example, using filter banks. In one approach, the images are filtered by a bank of bandpass filters. The energy at the outputs of the filters is used to estimate the depth. This can even be done on a per-pixel basis. The filter outputs can then be combined on a weighted basis according to the depth estimate for that pixel. Depth segmentation can be added to improve accuracy and reduce complexity. One advantage of the filter bank approach is that it is not as computationally intensive as the approaches described above.
In one approach, instead of building a model of the image as a function of defocus, and hence depth, a model of the filtered image is constructed as a function of defocus. Assume that the OTF of the system is mostly rotationally symmetric. Now use a bank of rotationally-symmetric bandpass filters. Such filters capture the image spectral content within a rotationally symmetric region in frequency space. Denote the set of filters used as Fj(p), j=1 . . . P where j identifies the filter band pass radial frequency. For simplicity, consider a set of bandpass filters in which the center frequency of the bandpass filter is given by ρj=j/P+1. The output of these filters is equivalent to projecting the two-dimensional image spectrum onto a one-dimensional subspace defined by the rotationally symmetric filters. In doing this, the computational complexity of the nonlinear depth estimation process can be greatly reduced by lowering the dimensionality of the data.
In this example, estimate the inverse depth for the ith pixel using a nonlinear cost function of the form
where gj(ζ,φk) is the output function for the jth filter as a function of depth z; and ckji is the measured filter output for the ith pixel, jth filter, using the acquisition settings for the kth acquired image. The value of σjk is the noise associated with the jth filter with the kth acquisition settings. N2 represents the size of the image. This is defined as
The term Psj is the expected filter output statistical prior defined by
The terms σk2 and Ps(w, v) are as previously defined.
The filter-based depth estimation is based on modelling the filter output as a function of the filter set. This involves a calibration process to model the filter output gain functions. One choice for modelling the filter output is a Gaussian function, with mean as a function of inverse focus setting and the variance as a function of aperture setting and the focus setting according to:
In this formulation, the b terms are tuning parameters for this particular gain function chosen at calibration time. When calibrating, use the ground true inverse depth as input and estimate the parameter settings for each filter. Other functional forms of the filter output (23) can be used. The ideal filter output model represents the filter output as a function of inverse depth for a wide range of signals.
The specific adaptive acquisition strategy described above was simulated based on the imaging system described in Table. 1. The simulated test image is a traditional spoked target pattern. The image grayscale values are normalized such that the maximum grayscale value is one. This provides a general SNR at full aperture of 26 dB. The image is 120×120 pixels in size.
The object is assumed to be a planar object at a depth of z=2 m from the front of the camera. The initial camera acquisition settings are A0=0.6, d0=12.00 mm and A1=1.0, d1=12.006 mm. These back focal distances correspond to a camera focused at infinity for the first frame and at 24 meters from the front of the camera for the second frame. The multiframe reconstruction algorithm of Eq. 7 yields a poor initial estimate of the depth to n0=3.73 m. The reconstructed image using this poor depth estimate is itself quite poor.
The acquisition settings were determined using the cost function of Eq. 18 with a strong penalty on changing the acquisition settings. Consequently, after k=4, the adaptive acquisition module chooses not to incur the penalty of changing the acquisition setting further even though the acquired image y5 is obviously still out of focus. The reconstructed image, however, shows reasonable quality. After acquiring the third frame, the algorithm correctly estimates the depth at ö2=1.99 m. This estimate improves with continued iteration. The dynamically determined acquisition settings for this first experiment are shown in Table 2. The acquisition settings stop changing after k=4 as the energy penalty required to improve the performance prevents the algorithm from further change. At k=5, the camera is focused at a depth plane corresponding to 4.8 m from the camera.
Optional pre-processing module 1130 receives data representing the 3D scene to be rendered and adapts it to rendering requirements. For example, pre-processing module 1130 may perform functions such as magnifying, cropping and sharpening. Focal plane placement module 1140 analyzes the content of the 3D scene and selects the locations of the focal planes based on the content analysis. The selection can also be based on rendering requirements. Scene separation module 1150 separates the 3D scene into the constituent 2D images to be rendered. This typically involves depth blending, as will be described below. The content of each 2D image will depend on the focal plane locations. Rendering engine 1160 then renders the 2D images onto the display, in coordination with adjustment of the optical element 1120 to effect the different focal planes. Additional post-processing can also be performed. For example, smoothing constraints (temporal and/or spatial) may be applied, or occlusion edges may be processed to further improve perceived quality.
In
MFD technology can represent a 3D scene by a series of 2D images at different focal planes due to a concept known as depth blending. By illuminating two adjacent focal planes simultaneously, a focus cue may be rendered at any axial distance between the planes. Since the two focal planes lie along a line of sight, the luminance provided by each of the adjacent focal planes determines where the cue will be highest (where the eye perceives the highest visual quality, or where the area under the modulation transfer function (MTF) observed by the eye is highest).
A simple form of luminance weighting used for depth blending is a linear interpolation of the luminance values observed by each pixel for the adjacent focal planes, which we will use as an example although other types of depth blending can also be used. Let wn and wƒ respectfully denote the luminance weights given to the near and far focal planes. These values, which sum to 1 to retain the correct luminance perceived by the eye, are computed as follows:
where zn and zƒ are the locations of the near and far focal planes and z is the actual location of the object in the 3D scene, which is between zn and zƒ In this linear formulation, if z=zn (object point at the near focal plane), then wƒ=0 and wn=1, meaning that all of the luminance is allocated to the near focal plane. Conversely, if z=zƒ (object at the far focal plane), then w1=1 and wn=0, and all of the luminance is allocated to the far focal plane. For an intermediate position such as z=(zn+zƒ)/2, then wƒ=½ and wn=½ so luminance is split between the far and near focal planes. In this way, a virtual object can be rendered at any position z between zn and zƒ by splitting its luminance between the two images rendered at focal planes zn and zƒ.
We first formulate the problem of placement of focal planes based on a given objective function, and then show two examples of different objective functions. The objective function typically is a type of distortion metric that measures a distortion between an ideal rendering of the 3D scene versus the rendering by the MFD.
Let (x,y,z) denote the two transverse dimensions and the axial dimension of the 3D space rendered by the MFD. In practice, what we are typically given are the following quantities:
To estimate the best positions of focal planes, we formulate the following optimization problem:
where the objective function D(S, q) denotes a distortion error metric for representing a 3D scene S on M focal planes positioned at q=(q1, q2, . . . , qM). This can in general be any metric that minimizes the error compared to a perfect rendering.
Alternately, we can pose the optimization problem such that it finds a solution for focal plane placement that maximizes the quality of the 3D scene rendering Q(S, q):
In the following, we show two specific examples of automatic focal plane placement. In the first example, we use an error metric D(S,q) and minimize it to obtain q. In the second example, we use a quality metric Q(S,q) that can be used for focal plane placement. Other distortion metric functions, including other error or quality metrics, can be used as well.
The first example of an objective function can be derived by considering the problem of focal plane placement as a clustering problem. Given the z-coordinates of all 3D data points in a scene. That is, given z1, z2, . . . , zN, we can use the K-means algorithm to find the best placement of M focal planes. In this case, our optimization problem becomes:
Solving this problem using the K-means algorithm gives a placement of focal planes such that the focal planes used to represent 3D data are close to the actual location of the data. Hence, in most cases this optimization problem will give a solution different from the conventional strategy of uniform focal plane spacing. Note that in the optimization above, instead of distance z in meters, we can also use distance in diopters (inverse meters) or other measures of optical power, in order to take into account for the decreasing sensitivity of depth perception with increasing distance.
Spatial frequencies of the content also impact accommodative response when depth blending is used. For low-frequency stimuli (for example, 4 cycle per degree or cpd), linear depth blending can drive accommodation relatively accurately between planes. But for high-frequency stimuli (for example, 21 cpd) and broadband stimuli (for example, 0-30 cpd), accommodation is almost always at or near a focal plane no matter how the luminance weights wƒ, wn are distributed. Therefore, a weighted K-means algorithm can be used to take this spatial frequency dependency into account. For example, if the spatial frequency or spatial gradient value near a point is higher than a threshold, it can be assigned a large weight, otherwise it can be assigned a small weight. Denote
Table 4 below shows the focal plane positions using uniform focal plane spacing, using K-means focal plane spacing and using weighted K-means focal plane spacing.
These focal plane locations are also shown by the arrows above the graph in
K-means is used just as an example. Other clustering techniques can be applied, for example clustering based on Gaussian Mixture Models (GMM) or support vector machines (SVM).
When a given 3D scene with continuous depth values is displayed on a multi-focal display with a finite number of focal planes, human eyes will perceive it with a certain amount of defocus compared to an ideal continuous 3D rendering. We describe here a model of that defocus, which we then use within our objective function for focal plane placement. Namely, our objective function will place the focal planes such that it maximizes the quality of the 3D scene rendering by minimizing the defocus.
Optical defocus is typically modeled through Fourier optics theory, in a continuous waveform domain. Therefore, assume that a given 3D scene is a set of samples from a continuous 3D function ƒ(x,y,z), where we have that In=ƒ(xn,yn,zn) for n=1, 2, . . . , N given points in our 3D scene. We first provide a Fourier derivation of a human eye's sensitivity to defocus and then use the derived theory to define a quality metric for a given 3D scene.
Let primed coordinates (x′,y′) denote the retinal coordinates. When the eye accommodates to a distance ze, a 2D retinal image g(x′,y′) may be expressed as a convolution of the 3D object with the 3D blur kernel h(x, y, z) evaluated at a distance ze−z, followed by integration along the axial dimension:
g(x′,y′,ze)=∫∫∫ƒ(x,y,z)h(x−x′,y−y′,ze−z)dxdydz. (32)
Note that in the case of in-focus plane-to-plane imaging (ze−z=0), the convolution kernel h reduces to the eye's impulse response. This configuration yields maximum contrast, where contrast is defined in the conventional way in the spatial frequency domain. Deviations from that in-focus imaging result in a reduction in contrast. The severity of the lost contrast depends on the amount of defocus.
To quantify the effects of defocus, we turn to the pupil function of the eye's optical system. For a rotationally-symmetric optical system with focal length F and circular pupil of diameter A, the lens transmittance through the exit pupil is modeled as:
where the pupil function P is given by
In our system, the pupil diameter A may vary between ˜2-8 mm based on lighting conditions. Though the eye is, in general, not rotationally symmetric, we approximate it as such to simplify formulation in this example.
In the presence of aberrations, the wavefront passing through the pupil is conventionally represented by the generalized pupil function G(x,y)=P (x,y)exp(iΦ(x,y)), where the aberration function Φ is a polynomial according to Seidel or Zernike aberration theory. The defocus aberration is commonly measured by the coefficient w20 of Φ. Defocus distortion can alternatively be modeled by including a distortion term θz in the pupil function and defining the pupil function of a system defocused by distance θz in axial dimension as
P
θ
(x,y)=exp(πi(θzλ)(x2+y2))P(x,y), (34)
where θz=1/z+1/zr−1/F with Zr being the distance between the pupil and the retina. The relationship between θz and the conventional defocus aberration coefficient w20 is given by θz=2w20/A2. Using this formulation, we can formulate the defocus transfer function, which is the optical transfer function of the defocused system, as the auto-correlation of the pupil function of the defocused system as follows:
Now we replace the defocus distortion distance θz with 1/ze−1/z and define the normalized defocus transfer function (DTF) of the eye as
Optical aberrations of the eye and/or the MFD system can be modeled into the DTF as well.
The image as formed on the retina is described by the multiplication of the defocus transfer function and the Fourier transform of the function ƒ(u,v,z) describing the object displayed at distance z from the eye by
{circumflex over (g)}(u,v,z,ze)={circumflex over (H)}(u,v,z,ze){circumflex over (ƒ)}(u,v,z). (37)
In a MFD system, we can typically display only a small number of focal planes fast enough to be perceived as simultaneously displayed by the human eye. For the case that two objects are being displayed at two focal planes located at distances q1 and q2 away from the eye, the eye integrates the two objects as imaged through the eye's optical system. That is, it integrates over the light emitting from the two objects after passing through the eye's optical system described by the defocus transfer function. We derive this image formation at the retina plane by the following formula
ĝ
r(u,v,q1,q2,ze)={circumflex over (H)}(u,v,q1,ze){circumflex over (ƒ)}(u,v,z)+{circumflex over (H)}(u,v,q2,ze){circumflex over (ƒ)}(u,v,z). (38)
If linear depth blending is applied to the input scene ƒ(x,y,z), using coefficients w1 and w2, then the Fourier transform of perceived image on the retina is described by
ĝ
r(u,v,q1,q2,ze)=w1Ĥ(u,v,q1,ze){circumflex over (ƒ)}(u,v,z)+w2Ĥ(u,v,q2,ze)ƒ(u,v,z). (39)
Using this observation, we define the depth-blended defocus transfer function of the entire system as
Ĥ
blend(u,v(q1,q2),ze)=w1Ĥ(u,v,q1,ze)+w2Ĥ(u,v,q2,ze), (40)
We can also generalize this blending function using all display planes q1, . . . , qM to derive an effective or blended transfer function for the multi-focal display as:
for q=(q1, . . . , qM).
Depth blending drives the accommodation of the eye to a focal plane with a Ĥblend(u,v,q,ze) closest to the ideal DTF curve. We can see from
The eye will accommodate to a distance that maximizes the area under the DTF. However, since that distance depends on the spatial frequency, we further assume that the eye will accommodate to the distance that maximizes a certain quality metric QDM(S, q) based on this defocus measure (area under the DTF). Since this distance varies with each patch, we seek a solution that incorporates all of the patches into a single metric.
In one approach, we partition the displayed image ƒ(x,y,z) into Np patches ƒ(x,y,zi), i=1, . . . , Np, where zi is a scalar representing the ith patch's mean object distance. Overlapping patches may be used. We may compute each patch's Fourier transform and multiply it with the depth-fused DTF to find the information transferred from a stimulus to the eye according to a placement of focal planes located at q={q1, q2, . . . , qM} and a local stimulus located at distance zo to compute the scalar value βi for each patch:
βi(zi,q)=∫u
where [u0, u1] and [v0, v1] denote the frequency interval of interest. Other metrics describing the object's information content, such as measures of contrast, entropy, or other transformative metrics could be used to define βi(zi,q) as well.
If we store the metrics from all of the patches into a vector β we can alter the focal plane placement for up to M focal planes. We seek to solve the following optimization problem to find q*, the optimal set of dioptric distances to place the available focal planes:
which can be relaxed or adjusted if not solvable in realistic time.
The resulting entries of q* signify where best to place the set of M focal planes. For example, optimizing 2 focal planes to represent 3 objects clustered about dioptric distances of 1/z1=0.6D, 1/z2=1.5D; 1/z3=2.0D might result in the optimal focal plane placement of 1/q1=1.1D, 1/q2=1.8D.
The solution for q could begin with an initial guess of uniform focal plane spacing based on the available focal planes. For example, a 6-plane system seeking a workspace between 0 and 3 diopters could start with {0, 0.6, 1.2, 1.8, 2.4, 3.0}D. As the optimization algorithm iterates through iterations k, the entries of q would change until |QDMk(S,q)−QDMk+1(S,q)|≦ε, where ε is a tolerance parameter telling the algorithm when to stop. Extra specifications could be incorporated into the optimization algorithm to constrain the feasible solution set, as well.
Finally, note that the metric QDM(S, q) quantifies the quality of the rendering of a given 3D scene, with respect to defocus. Therefore, in addition to focal plane placement, this metric can be also used for rendering quality assessment in MFDs.
The eye's accommodation was varied in increments of 0.1D between these two focal planes. The accommodation is between −0.3 and +0.3D, where +0D corresponds to the dioptric midpoint of the focal planes at q1 and q2.
That is, the top left square is an image of a 9 cpd image where the eye accommodates to −0.3D. For the top middle square, the eye accommodates to −0.2D, and so on. The bottom middle and bottom right squares are not used, so they are left blank.
Although the detailed description contains many specifics, these should not be construed as limiting the scope of the invention but merely as illustrating different examples and aspects of the invention. It should be appreciated that the scope of the invention includes other embodiments not discussed in detail above. For example, acquisition parameters other than focus and aperture can be used. Exposure time T is one example. Other examples include wavelength filtering, polarization filtering, illumination control, and camera orientation. The adaptive techniques described above can be used to also determine the acquisition setting for these parameters. As another example, the initial set of images in the examples above was acquired based on predetermined acquisition settings. In alternate embodiments, these acquisition settings may also be optimized, for example based on signal and/or depth prior information. As another variation, different optimization techniques based on the CR bound might be used. For example, rather than using a local search technique, optimization could be based on a maximum ΔΦ search range and computing optimal settings via exhaustive search. Functions other than the CR bound or ML estimation could also be used. Fast filter approximations can also be used to solve the multiframe reconstruction and/or the depth estimation algorithms.
As another example, acquisition settings may be determined based on acquiring multiple next frames rather than just a single next frame. In the examples above, an initial set of two images was acquired. Based on this two-frame set, the acquisition setting for a third frame were then determined, but without taking into account the possibility that a fourth or fifth frame might also be acquired. In an alternate approach, the acquisition settings are determined with the goal of increasing overall performance over several next frames, or for the entire final set of images. Thus, after the first two frames, the third frame may be selected based on also acquiring a fourth frame, or assuming that there will be a total of six frames (i.e., three more frames after the third frame).
As another example,
In another aspect, in addition to selecting the locations of the renderable volumes, the multi-focal display also selects the number of renderable volumes. In the original example with six focal planes, the multi-focal display might determine the number M of focal planes where M can be up to six. Less than the maximum number may be selected for various reasons, for example to reduce power consumption.
In yet another aspect,
In yet another aspect, the adaptive image acquisition may be combined with the multi-focal display. In one approach, the selection of the set of images or of the next image takes into account that the reconstruction from the acquired set of images will occur on a multi-focal display. That is, the multiframe reconstruction accounts for the constraints and characteristics of reconstruction by a multi-focal display: given a certain multi-focal display, determine the best set of images to acquire for that display. The converse approach can also be formulated: given a certain set of acquired images, determine the best set of focal settings for the multi-focal display.
If there is flexibility in both the image acquisition and the multi-focal display, then a hybrid approach can be adopted. For example, optimization may alternate between the two cases. First, optimize the image acquisition given a certain multi-focal display. Then optimize the multi-focal display given the image acquisition. Continue to alternate between the two until both are optimized.
In yet another approach, the image acquisition and multi-focal display may be linked to each other. For example, the multi-focal display may display using N focal locations and the image acquisition may be assumed to acquire N images at the same focal locations. Then the two optimizations may be combined using this constraint. In one approach, a weighted merit function M is derived:
M=w
1
C
s
+w
2
Q (44)
where Cs is the cost function of Eq. 17, Q is the quality metric of Eq. 29, and w1 and w2 define the relative weights of the two terms.
Various other modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus of the present invention disclosed herein without departing from the spirit and scope of the invention as defined in the appended claims. Therefore, the scope of the invention should be determined by the appended claims and their legal equivalents.
This application is a continuation-in-part of U.S. patent application Ser. No. 14/551,998, “Adaptive Image Acquisition For Multiframe Reconstruction,” filed Nov. 24, 2014; which is a continuation of U.S. patent application Ser. No. 12/079,555, “Adaptive Image Acquisition For Multiframe Reconstruction,” filed Mar. 26, 2008. This application is also a continuation-in-part of U.S. patent application Ser. No. 14/642,095, “Content-Adaptive Multi-Focal Display,” filed Mar. 9, 2015; which claims priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application Ser. No. 62/084,264, “Content-Adaptive Multi-Focal Display,” filed Nov. 25, 2014. This application also claims priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application Ser. No. 62/180,955, “Adaptive Image Acquisition and Display Using Multi-Focal Display,” filed Jun. 17, 2015. The subject matter of all of the foregoing is incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
62084264 | Nov 2014 | US | |
62180955 | Jun 2015 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12079555 | Mar 2008 | US |
Child | 14551998 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14551998 | Nov 2014 | US |
Child | 15061938 | US | |
Parent | 14642095 | Mar 2015 | US |
Child | 12079555 | US |