This invention relates generally to rendering virtual images, and more particular to modeling and estimating errors produced by rendering virtual images.
In three-dimensional video (3DV), videos include texture images acquired by cameras at different configurations, and associated depth images. The per-pixel depths in the depth images enables synthesis of virtual images for selected viewpoints via depth-image-based rendering (DIBR), see MPEG Video and Requirement group, “Call for proposals on 3D video coding technology,” Tech. Rep., MPEG, 2011 MPEG N12036, and Tanimoto et al., “View synthesis algorithm in view synthesis reference software 2.0 (VSRS2.0),” Tech. Rep., MPEG, 2009, MPEG M16090.
Depths are typically acquired by a ranging device, such as time-of-flight sensors. Alternatively, the depths can be estimated from the texture images using triangulation techniques.
In many 3DV applications, it is imperative that the quality of the virtual images for synthesized views is comparable to the images in the acquired video. However, the rendering quality typically depends on several factors, and complicated interactions between the factors.
In particular, texture and depth images often contain errors. Herein, errors, which degrade the quality, are generally characterized as noise. Noise includes any data that do not conform with the acquired video of the scene. The errors can be texture and depth errors.
The errors can be due to imperfect sensing or lossy compression. It is not clear how these errors interact and affect the rendering quality. Unlike the texture errors, which cause distortion in luminance and chrominance level, the depth errors cause position errors during the synthesis, and the effect is more subtle.
For example, the impact of the depth errors can vary with the contents of the texture images. Simple texture images tend to be more resilient to depth errors, while complex texture images are not. The impact of depth errors also depends on the camera configuration, as this affects magnitudes of the position errors. Along the rendering pipeline, depth errors are also transformed in different operations complicating an understanding of the effects.
An accurate analytical model to estimate the rendering quality is very valuable for the design of 3DV systems and methods. As an example, the model can help understand under what conditions reducing the depth error would substantially improve the synthesis output. Then, 3DV encoders can use this information to determine when to allocate more bits to encode the depth images.
As another example, the model can be used to estimate how much improvement can be achieved by reconfiguring the cameras, e.g., closer to each other, given other factors such as the errors in the texture images.
One model is based on an analysis of the rendering quality of image-based rendering (IBR), and uses Taylor series expansion to derive an upper bound of the mean absolute error (MAE) of the view synthesis.
An autoregressive model estimates the synthesis distortion at the block level and is effective for rate-distortion optimized mode selection. A distortion model as a function of the position of the viewpoint is also known for bit allocation.
The embodiments of the invention provide an analytical model and method for estimating a rendering quality, in virtual images for virtual viewpoints, in a 3D video (3DV). The model relates errors to the rendering quality, taking into account texture image characteristics, texture image quality, camera configuration, i.e., real viewpoints, and the rendering process.
Specifically, we derive position errors from depth errors, and a probability distribution of the position errors is used to determine a power spectral density (PSD) of the rendering errors.
The model can accurately estimate synthesis noise up to a constant offset from the real viewpoints. Thus, the model can be used to evaluate a change in rendering quality for systems and methods of different designs.
We analyze how depth errors relate to the rendering quality, taking into account texture image characteristics, texture image quality, camera configuration and the rendering process. In particular, depth errors are used to determine the position errors, and the probability distribution of the position errors is in turn used to estimate the synthesis noise power at the image level.
We use the power spectral density (PSD) to analyze the impact of depth errors, in terms of mean square errors (MSE). This relates to prior art work, which used the PSD only to analyze the effect of motion vector inaccuracy, and disparity inaccuracy.
However, while previous work applied PSD to analyze the efficiency of the motion and disparity compensated predictors in predictive coding, we use the PSD to quantify the noise power in virtual images produced by a rendering pipeline.
Although we focus on texture and depth errors due to predictive coding, we make no assumption on how information was distorted to produce the errors. We focus on the transformation and interaction of the texture and depth errors in the synthesis pipeline.
View Synthesis Pipeline Model
First, pixels are copied 101 from Xl position (m′,n) to position (m,n) to produce an intermediate let image Ul. If the cameras are arranged linearly, then the horizontal disparity is
f is the focal length, bl is the (baseline) distance between the left and virtual camera centers, and znear and zfar are the nearest and farthest depths, and 255 is the numbers of possible depth values (28−1). Likewise, pixels are copied from Xr position (m″,n) to position (m,n) to produce an intermediate right image Ur with horizontal disparity m−m″.
Then, Ul and Ur are merged 102 to generate the virtual image U using a linear combination
U(m,n)=αU(m,n)+(1−α)Ur(m,n), (2)
where a weight α is determined by distances between the position of the virtual camera, and the positions of the left and right (real) reference cameras.
Some virtual pixel locations, Ul(m,n), Ur(m,n) or both, can be missing due to position rounding errors, disocclusions or outside of the field-of-view of the reference cameras. Nevertheless, if the distances between the reference and virtual cameras are small, then the number of missing pixels is usually small, and does not cause a significant discrepancy in the model. Other nonlinear blending techniques can be used. However, linear blending is a good approximation to more complex blending techniques.
Noise Analysis
In practice, the texture and depth images are lossy encoded, and reconstructed (^) versions ({circumflex over (X)}l,{circumflex over (X)}r,{circumflex over (D)}l,{circumflex over (D)}r) are processed by the synthesis pipeline to produce the left and right intermediate images Wl and Wr, which are then merged to generate the virtual image W.
The quality of the virtual image is usually measured, as in MPEG 3DV, between the rendering output with the acquired texture and depth images and the reconstructed texture and depth images, i.e., between U and W. The synthesis noise V=U−W is in the virtual image is due to encoding errors in the texture and depth images.
To facilitate the analysis as shown in
The additional distortion due to errors in the depth images is Z=Y−W. Note that V=N+Z. If N, Z are uncorrelated, then E[N]=0, and
Eqn. (4) indicates that the synthesis noise power due to texture image encoding (E[N2]) and depth image encoding (E[Z2]) can be estimated independently. This simplifies the estimation of each components, and the total power of the noise can be approximated by summing the two noise components.
In the following, we describe the estimation of the two components of the noise power in Eqn. (4), i.e., texture noise and depth noise, in greater detail.
Estimating the Noise Power Due to Texture Encoding
The noise caused by lossy encoding of the texture image is described by referencing
Therefore,
N(m,n)=α(Xl(m′,n)−{circumflex over (X)}l(m′,n))+(1−α)(Xr(m″,n)−{circumflex over (X)}r(m″,n)). (8)
In Eqn. (6), the pixel in Xl at location (m′,n) is copied to the intermediate image Ul, location (m,n). Likewise, in Eqn. (7), the pixel in {circumflex over (X)}l at location (m′,n) is copied to intermediate image Yl location (m,n).
Importantly, pixels in Xl and {circumflex over (X)}l are involved in determining that N(m,n) are spatially collocated at (m′,n). Similarly for the right camera, as we select to decouple the estimation into two steps, the same acquired depth information is used in both Eqns. (6-7) to determine the disparity.
The pixels involved in determining that N(m,n) are collocated. This simplifies the estimation
E[N2]=α2E[(Xl−{circumflex over (X)}l)2]+(1−α)2E[(Xr−{circumflex over (X)}r)2]+2α(1−α)ρNσX
where Xl−{circumflex over (X)}l and Xr−{circumflex over (X)}r are the texture encoding noise for the left and right texture images, and ρN is the correlation coefficient between Xl−{circumflex over (X)}l and Xr−{circumflex over (X)}r. The correlation coefficient ρN tends to be small, and depends on the quality of the encoding of the texture image.
In particular, if the texture images are encoded at a low quality, then there is considerable structural information remaining in Xl−{circumflex over (X)}l and Xr−{circumflex over (X)}r, and the images are more correlated.
We train the model to estimate the correlation coefficient ρN, which is parameterized by the average of E[(Xl−{circumflex over (X)}l)2] and E[(Xr−{circumflex over (X)}r)2]). The same model is used for video all sequences and encoding conditions.
Estimating the Noise Power Due to Depth Encoding
We describe the noise caused by error in the depth images by referencing
Z(m,n)=Y(m,n)−W(m,n), (10)
Y(m,n)=αYl(m,n)+(1−α)Yr(m,n), and (11)
W(m,n)=αWl(m,n)+(1−α)Wr(m,n). (12)
Substituting Eqn. (11) and (12) into Eqn. (10). With Zl=Yl−Wl, Zr=Yr−Wr, we have
Z(m,n)=αZl(m,n)+(1−α)Zr(m,n), and (13)
E[Z2]=α2E[Zl2]+(1−α)2E[Zr2]+2α(1−α)ρZσZ
Eqn. (14) indicates that the noise power due to the depth error can be estimated from the left and right error components Zl,Zr, respectively. To estimate E[Zl2], and likewise E[Zr2],
Here, the depth error causes a horizontal position error Δml. From Eqn. (16), the PSD Φ of Zl is
ΦZ
where ω is the frequency after a fast Fourier transform (FFT).
Because the horizontal position error Δml is random, we take the expectation in Eqn. (17) with respect to the probability distribution of Δml, p(Δml)
where P(ω1) is the Fourier transform of p(Δml), and Re indicates a real number.
Eqn. (19) can be derived by
cos(Δml·ω1)=(ejΔm
If we approximate the PSD ΦY
ΦZ
Eqn. (20) indicates that the PSD of the error due to lossy encoding of the (left) depth image is the product of the PSD of the texture image and the frequency envelop 2(1−Re{P(ω1)}), which depends on the distribution p(Δml). The distribution p(Δml) for the left camera depends on the depth error and the camera set-up, and can be obtained from Dl, {circumflex over (D)}l and binning Δml, similarly for the right camera,
ΔDl(m,n)=Dl(m,n)−{circumflex over (D)}l(m,n), and (21)
Δml(m,n)=kl·ΔDl(m,n), (22)
where kl is a spatially invariant constant depending only on the camera configuration
We integrate ωZ
Probability Density Function and Frequency Envelope
As indicated by
Model Summary
We summarize the modeling process, which estimates the noise power in the virtual image from Xl, Xr, {circumflex over (X)}l, {circumflex over (X)}r, Dl, Dr, {circumflex over (D)}l, {circumflex over (D)}r analytically.
First, the mean squared errors (MSEs) between the acquired texture images Xl,Xr and the reconstructed texture images {circumflex over (X)}l, {circumflex over (X)}r, are determined and used in Eqn. (9) to determine E[N2]. The FFT of the reconstructed texture image {circumflex over (X)}l is used to determine Φ{circumflex over (X)}
The depth noise power for the left depth image E[Zl2] can then be estimated by integrating ΦZ
In addition, the correlation coefficient ρZ between Zl and Zr depends on the variances of the horizontal position errors Δml and Δmr. In particular, the correlation decreases as the variances of the position error increase. We train the model to estimate correlation coefficient ρZ, parameterized by the average of the variances of the horizontal position errors Δml, and Δmr.
The same model is used for all sequences and conditions. Finally, E[N2] and E[Z2] are summed to estimate the noise power in the virtual image, following Eqn. (4).
Note that some approximation of PSDs Φ{circumflex over (X)}
The depth errors can also be modeled as Gaussian or Laplacian distributed random variables with variances E[(Dl−{circumflex over (D)}l)2] and E[(Dr−{circumflex over (D)}r)2], and p(Δml) and p(Δmr) can be derived according Eqn. (22).
Although the model may requires a constant adjustment to be accurate, the constant adjustment is the same for all encoding conditions, but different for different sequences. We believe that this is due to the fact that the distribution of the depth error is not entirely random in transform encoding. In particular, the errors tend to occur more frequently along edges of the depth images. When the depth edge errors coincide with some strong texture edges, the resulting rendering errors could be a bias on the overall synthesis noise. Such bias tends to be video sequence specific, as depends on how often depth edges collocate with strong texture edges.
However, our model is accurate to evaluate a change in quality with different encoding conditions and situations. For many practical applications, this is sufficient.
Quality Estimation Method
It is assumed that each image includes (stereoscopic) left and right images, and the processes shown and described operate similarly on the left and right images.
The steps can be performed in a processor connected to memory and input output interfaces as known in the art. In a typical application the processor can be an encoder and/or decoder (codec), so that the quality of the virtual image can be evaluated during encoding and decoding processes.
The embodiments of the invention provide an analytical model to estimate a rendering quality in a 3D video. The model relates errors in depth images to the rendering quality, taking into account texture image characteristics, texture image quality, camera configuration, and the rendering process.
The estimation of the power of the synthesis noise is decoupled into two steps. One step focuses on the error due to texture encoding, and the other step focuses on the error due to depth encoding.
According to the embodiments, the PSD of the rendering errors due to the depth encoding is the product of the PSD of texture data and a frequency envelope depending on the probability distribution of position errors. The model can accurately estimate the synthesis noise up to a constant offset. Thus, the model can be used to predict a change in rendering quality for different rendering methods and systems.
In contrast with the prior art, the PSD is used to estimate a value of a mean squared error (MSE).
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
6084979 | Kanade et al. | Jul 2000 | A |
20060039529 | Tsubaki et al. | Feb 2006 | A1 |
20070122027 | Kunita et al. | May 2007 | A1 |
20070206008 | Kaufman et al. | Sep 2007 | A1 |
20080074648 | Lampalzer | Mar 2008 | A1 |
20120206451 | Tian et al. | Aug 2012 | A1 |
20120306876 | Shotton et al. | Dec 2012 | A1 |
20130117377 | Miller | May 2013 | A1 |
20130127844 | Koeppel et al. | May 2013 | A1 |
20130156297 | Shotton et al. | Jun 2013 | A1 |
Entry |
---|
Woo-Shik Kim et al. “Depth Map Coding with Distortion Estimation of Rendered View,” Signal and Image Processing Institute, Univ. of Southern California, Los Angeles, CA 90089, Thomson Corporate Research, 2 Independence Way, Princeton, NJ 08540. |
Ha Thai Nguyen et al. “Error Analysis for Image-Based Rendering with Depth Information,” IEEE Transactions on Image Processing, vol. 18, No. 4, Apr. 2009 pp. 703-716. |
Woo-Shik Kim et al. “Depth Map Coding with Distortion Estimation of Rendered View,” Signal and Image Processing Institute, Univ. of Southern California, Los Angeles, CA 90089, Thomson Corporate Research, 2 Independence Way, Princeton, NJ 08540, Jan. 18, 2010. |
Number | Date | Country | |
---|---|---|---|
20130201177 A1 | Aug 2013 | US |