This invention relates generally to efficient representations of depth videos, and more particularly, to coding depth videos accurately for the purpose of synthesizing virtual images for novel views.
Three-dimensional (3D) video applications, such as 3D-TV and free-viewpoint TV (FTV) require depth information to generate virtual images. Virtual images can be used for free-view point navigation of a scene, or various other display processing purposes.
One problem in synthesizing virtual images is errors in the depth information. This is a particular problem around edges, and can cause annoying artifacts in the synthesized images, see Merkle et al., “The Effect of Depth Compression on Multiview Rendering Quality,” 3DTV Conference: The True Vision—Capture, Transmission and Display of 3D Video, Volume, Issue, 28-30 May 2008 Page(s):245-248.
The embodiments of this invention provide a multi-layered coding scheme for depth images and videos. The method guarantees that the maximum error for each reconstructed pixel is not greater than an error limit. The maximum error can vary with each coding layer to enable a successive refinement of pixel values in the image. Within each coding layer, the error limit can also be adapted to account for local image characteristics such as edges that correspond to depth discontinuities.
Virtual View Synthesis
Our virtual image synthesis uses camera parameters, and depth information in a scene to determine texture values for pixels in images synthesized from pixels in images from adjacent views (adjacent images).
Typically, two adjacent images are used to synthesize a virtual image for an arbitrary viewpoint between the adjacent images.
Every pixel in the two adjacent images is projected to a corresponding pixel in a plane of the virtual image. We use a pinhole camera model to project the pixel at location (x, y) in the adjacent image c into world coordinates [u, v, w] using
[u, v, w]T=Rc·Ac−1·[x, y, 1]T·d[c, x, y]+Tc, (1)
where d is the depth with respect to an optical center of the camera at the image c, and A, R and T are the camera parameters, and the superscripted T is a transpose operator.
We map the world coordinates to target coordinates [x′, y′, z′] of the virtual image, according to:
X
v
=[x′, y′, z′]
T
=A
v
·R
v
−1
·[u, v, w]
T
−T
v. (2)
After normalizing by z′, a pixel in the virtual image is obtained as [x′/z′, y′/z′] corresponding to the pixel [x, y] in the adjacent image.
For texture mapping, we copy the depth and the corresponding texture I[x, y] from the current adjacent image (c) into the corresponding location [x′/z, y′/z′] in virtual image depth and texture buffers. Depth and texture buffers are maintained for each adjacent image to generate the synthesized image.
Due to quantization of the projected location in the virtual buffers, the values for some pixels in the virtual image buffers are missing or undefined. To render the virtual image, we scan through each location in the two virtual image depth buffers and apply the following procedure.
If both depths are zero, then there is no texture information. This causes a hole in the synthesized image.
If one depth is non-zero, then use the texture value corresponding to the non-zero depth.
If both depths are non-zero, then we take a weighted sum of the corresponding texture values. To improve the quality of the final rendered image, filtering and in-painting can be applied. We prefer a 3×3 median filter to recover undefined areas in the synthesized image.
A direct transformation from a current camera to a virtual camera can be obtained by combining Equations (1) and (2):
X
v
=[x′, y′, z′]
T
=M
1
·d·X
c
+M
2, (3)
where M1=Av·Rv−1·Rc·Ac−1, and M2=Av·Rv−1·{Tc−Tv}.
Analysis of Depth Error for Virtual View Synthesis
If there is a depth-coding error Δd, then the corresponding error in the location in the virtual camera ΔXv is
ΔXv=M1·Xc·Δd. (4)
Both Xv and Xv+ΔXv are normalized to determine corresponding coordinates of the virtual camera. After the normalization, the texture-mapping error is
Using conventional coding schemes, larger depth coding errors can occur along object boundaries. The texture-mapping errors are also larger around the same boundaries.
Equation (5) indicates that the texture-mapping error depends on the depth coding error and other parameters, such as camera configurations and the coordinate of the point to be mapped.
If the camera parameters and depth information are sufficiently accurate, then a strict control on the depth is beneficial because the depth represents the geometrical distance in the scene. This is especially true near depth edges, which typically determine the boundary of an object.
In a multi-view video, a depth image is estimated for each view. A pixel in the depth image represents a distance to a 3D point in the scene. The distance must be accurate because a quality of virtual image synthesis is highly dependent on the depth. Therefore, it is crucial to balance the quality of the depth image and an associated bandwidth requirement.
Computer System and Method Overview
Therefore, the embodiments of the invention provide a multi-layered coding scheme for depth images and videos. The method guarantees that the maximum error for each reconstructed pixel is limited. The maximum error varies with each coding layer allowing a successive refinement of pixel values in the image. Within each coding layer, the error limit can also be adapted to account for local image characteristics such as edges that correspond to depth discontinuities.
System Overview
For encoding, an input depth image (or video) I 101 is encoded as a base layer bitstream L0 102, and a set of one or more enhancement layer bitstreams L2-Ln 103. The enhancement layer bitstreams are arranged in a low to high order. The number of enhancement layer bitstreams depends on a bandwidth requirement for transmitting the depth image bit stream. For example, a low bandwidth can only support a small number of enhancement layer bitstreams. As the bandwidth increases, so can the number of enhancement layer bitstreams.
The encoding can be a lossy encoder 110 for the base layer L0. For images, this could be a conventional encoding scheme, such as JPEG or JPEG 2000, which exploit spatial redundancy. For videos, the lossy encoder can be any conventional video encoding scheme such as MPEG-2 or H.264/AVC, which employs motion-compensated prediction to exploit temporal redundancy.
Then, a difference between the input and the base layer reconstructed image is obtained and provided as input to the first-level L-∞ layer bitstream encoder to produce a first layer bitstream. Then, a difference between the input and the first layer reconstructed image, i.e., the sum of the base layer reconstructed image and the first-layer residual reconstruction, is obtained and provided as input to the second-level L-∞ layer bitstream encoder 111 to produce a second layer bitstream. This process continues for N layers until the N-th layer bitstream is produced.
The multi-layer decoding process inverts the encoding operation. As shown in
The number of layers in the set of enhancement layer bitstreams is usually fixed for a given video or application, i.e., not varying over time. However, it can vary with the available bandwidth as described above. A larger number of layer bitstreams provide a greater flexibility in scaling the rate for coding the depth, while ensuring a minimum level of quality for pixels in the depth image. Fewer layers are desirable to minimize overhead that is typical of most scalable coding schemes. Our studies indicate that 2-3 layers are suitable for depth image coding.
This invention describes several embodiments of the method, which vary in the way that enhancement layer bitstream encoding and decoding is performed.
Enhancement Layer bitstream with Inferred Side Information
Embodiments of an enhancement layer bitstream encoder 210 and decoder 202 are shown in
For the reconstruction 205, the encoder determines 210 a significance value for each pixel in the i-th layer residual 211, which is the difference between the input image and the (i−1)-th reconstructed image, based on an uncertainty interval. The uncertainty interval defines an upper and lower bound for the current pixel value in order to limit errors.
A residual value is significant if it falls outside the uncertainty interval. The uncertainty interval 220 indicates a maximum allowable error for the pixel to be decoded, which can vary for the different layers 221, as specified by a layer identifier. The error limits 222 can also vary for different parts of the image. For example, edge pixels can have a lower error limit than non-edge pixels.
An edge map 223 is used to determine the uncertainty interval for each pixel of the image in the current layer. In this particular embodiment of the invention, the edge map is inferred only from reconstructed data available at the decoder in the form of a context model. In this way, no additional side information is needed by the decoder to determine the uncertainty interval. The reconstructed data that can be used includes the (i−1)-th layer reconstructed image and i-th layer residual.
In order to guarantee every pixel in the reconstruction is within the uncertainty interval, a new reconstruction pixel value within the uncertainty interval is assigned to the significant pixel. In “A Wavelet-Based Two-Stage Near-Lossless Coder with L-inf-Error Scalability,” SPIE Conference on Visual Communications and Image Processing, 2006, Yea and Pearlman describe means for assigning the new reconstruction pixel value for a significant pixel. An alternative reconstruction process that is capable of more efficient coding is described below.
The process of assigning a new reconstruction value requires the coding of a sign bit in addition to the significance bit. Depending on the sign bit, a certain value is added or subtracted from the current pixel value. Hence, for a significant pixel, both the significance (value=1) and the sign bits are entropy encoded 230.
For a non-significant pixel, there is no need to assign a new reconstruction value as the value already lies within the uncertainty interval. Therefore, only the significance bit (value=0) needs to be entropy encoded for a non-significant pixel.
In order to efficiently compress the significance and the sign bits, the context model 240 is maintained by entropy encoding to produce an i-th layer bitstream. The use of context model converts the entropy encoding process to a conditional entropy coding process, which reduces the output coding rates by utilizing statistics of the data being coded.
In this embodiment, the maintaining of the context model is based on the statistics of significance bits in a given coding layer. In a preferred embodiment, the statistics of causal neighbors of a current pixel, i.e., data associated with neighboring pixels that have already been encoded or decoded, is considered. The context model also considers whether the current pixel is an edge pixel or non-edge pixel.
As shown in
Enhancement Layer Bitstream with Explicit Side Information
Another embodiment of the enhancement layer bitstream encoder 301 and decoder 302 is shown in
Another embodiment of the enhancement layer bitstream encoder 401 and decoder 402 is shown in
As shown in
In
Coding Procedure
In the following, methods for determining significance and performing reconstruction are described.
The input image to the i-th layer bitstream encoder is img(i, j) and, and the output reconstructed from the i-th layer bitstream decoder at (i, j) is rec(i, j).
A difference image is
diff(i, j)=img(i, j)−rec(i, j).
The reconstruction rec(i,j) is initially set to zero for every pixel (i, j).
A region of 2Lv by 2Lv pixels in img(,) is QT(i, j, Lv), with the upper-left corner coordinate at (i*2Lv, j*2Lv). We call this a quadtree at (i, j) at level Lv. Assume the input image to the i-th layer bitstream encoder is partitioned into a succession of non-overlapping quadtrees at level Lv following the raster-scan order, i.e., left-to-right and top-to-bottom.
A List of Insignificant Sets (LIS) initially contains every QT(i,j,Lv) as its element. The quadtree is said to be significant against the uncertainty level δ(n) when the following is true:
where (x, y) refers to a pixel in QT(i, j, Lv), and max is a function that returns a maximum value. A maximum uncertainty for the n-th layer bitstream encoder is δ(n).
The n-th layer bitstream encoder is performed in two phases, a first Significance-Phase and a second Refinement-Phase.
Significance-Phase
The Significance-Phase operates as follows:
diff(i, j)=img(i, j)−rec(i, j)
Refinement-Phase
The Refinement-Phase refines the pixels in LSP(k)'s (k=1,2, . . . ,n) until the maximum uncertainty becomes less than or equal to the uncertainty level δ(n).
The Refinement-Phase operates as follows:
Find the maximum uncertainty interval (Gap) of a pixel in LSP(k).
Gap=min{└(δ(k−1)−δ(k))/2┘, δ(n−1)}.
diff(i, j)=img(i, j)−rec(i, j)
Multi-Resolution Depth Coding
The input depth image I 101 in
The input depth image I 101 in
The embodiments of the invention provide a multi-layered coding method for depth images to complement edge-aware techniques, similar to those based on piecewise-linear functions (platelets). The method guarantees a near-lossless bound on the depths near edges by adding extra enhancement layer bitstreams to improve a visual quality of synthesized images. The method can incorporate in any lossy coder for the base layer bitstream and can be extended to videos. This is a notable advantage over platelet-like techniques which are not applicable to videos.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
This Application is related to U.S. application Ser. No. 12/405,864, “Depth Reconstruction Filter for Depth Coding Videos,” filed by Yea et al., on Mar. 17, 2009.