This invention relates generally to image processing and compression, and more particularly to up-sampling and reconstruction filters applied to depth images.
Depth Images
Depth images represent distances from a camera to a 3D scene. Efficient encoding of depth images is important for 3D video, and free view television (FTV). FTV enables a user to interactively control the view and generate new virtual images of a dynamic scene from arbitrary view point.
Most conventional image-based rendering (IBR) methods use the depth images, in combination with stereo or multi-image videos, to enable 3D and FTV. The multi-image video coding (MVC) extension of the H.264/AVC standard supports inter-view image prediction for improved coding efficiency for the multi-view images and videos. However, MVC does not specify any particular encoding for the depth images.
There is prior art that describes formats comprised of multi-view images and videos with corresponding depth images. The compression of these formats could be achieved with future extensions to AVC and HEVC (High Efficient Video Coding), an emerging standard for the next generation of video compression. In such a framework, the texture and depth can be compressed jointly. A scene is acquired with multiple cameras, and for each view, the corresponding depth image is obtained. With the use of multiple views, the depths, and the scene geometry, a higher quality can be obtained for a synthesized virtual view, generated with depth-image based rendering (DIBR) procedures.
There is a substantial redundancy between the texture images and the corresponding depth images, because both the texture and depth images depict the same objects in the 3D scene. Nevertheless, depth images usually have less entropy than texture images. Texture and depth image redundancies can be also determined between views.
Unlike conventional images, depth images are spatially monotonous except at depth discontinuities. Thus, decoding errors tend to be concentrated near depth discontinuities, and failure to preserve the depth discontinuities significantly compromises the quality of virtual images.
Encoding a reduced resolution depth image can reduce the bit rate substantially, but the loss of resolution also degrades the quality of the depth images, especially in high frequency regions, such as at depth discontinuities. Artifacts in the virtual images are visually annoying. Conventional down/up samplers either use a low-pass filter or an interpolation filter to minimize the quality degradation. That is, the conventional filters combine the depths of several pixels covered by the filter in some way for each filtered pixel. That filtering “smears” or blurs depth discontinuities because the filtering depends on multiple depths.
Prior art approaches have been developed to overcome the limitations of conventional down/up-sampling techniques with approaches that explicitly attempt to maintain edge quality, see for example U.S. patent application Ser. No. 12/405,884, “Method for Up-Sampling Depth Images,” filed by Yea, et al., on Mar. 17, 2009. Such methods only rely on the down-sampled depth image data itself to recover the high resolution depth image.
Depth images can be obtained by range cameras. The images obtained from range cameras can have a lower resolution than the corresponding texture images, and an up-sampling procedure is necessary for the synthesis of virtual views from the scene geometry.
Because the depth video and image rendering results are sensitive to variations in space and time, especially at depth discontinuities, the conventional depth reconstruction methods are insufficient, especially for virtual image synthesis.
The embodiments of the invention provide a method for interpolating and filtering a low resolution depth image to construct a high resolution depth image using information associated with depth discontinuities, e.g., edges. Each depth image includes an array of pixels at locations (x, y), and each pixel has an associated depth.
First, the low resolution depth image is up-sampled. Missing depths arc interpolated by duplicating nearest-neighboring depths.
Next, a moving window is applied to the pixels in the up-sampled depth image. A size of the window covers a set of pixels centred at each pixel.
The pixels covered by each window are selected according to their relative position to a depth discontinuity, and only pixels that are on the same side of the discontinuity of the center pixel are used for the filtering. The discontinuity information can be from the correspondent texture image, explicitly sent from an encoder, implicitly obtained through analysis of the low resolution depth image, or from a high resolution side view depth image, after warping.
A single representative depth from the set of selected pixel in the window is assigned to the pixel to generate the high resolution depth image.
As shown in
The embodiments of the invention concentrate on filtering of the depth images and generating high resolution depth images from the low resolution depth images and depth discontinuity information, e.g., edges, extracted from the texture images.
Alternatively, the edge information can be obtained from other sources, e.g., by using warped depth images from other views, such as a high resolution side view depth image, after warping, or by explicitly sending the edge information from an encoder. The high resolution depth images can be used for virtual image synthesis for either display purpose or view synthesis prediction.
In
The decoder outputs reconstructed texture images 105 and reconstructed depth images 104, which are used as input to a view synthesis module 113 to produce a synthesized virtual texture image 106.
Four embodiments are described below.
For some embodiments, the depth images can have a resolution lower than the resolution of the texture image. One embodiment down-samples the input depth image before encoding to improve encoding efficiency.
The input includes one or more texture images 201, and corresponding depth images 202. The texture images 201 are encoded 210, passed through a channel 213 and decoded 215.
Before the depth encoding 212, the high resolution depth image 202 is down-sampled 211 to reduce the resolution of the depth image. The input depth image can already be a low resolution depth image. Nevertheless, the depth image still needs to be up-sampled for view synthesis.
The low resolution depth image is coded 212 and passes through the channel 213 to a depth decoder 214. Because the decoded depth image 204 has a lower resolution, an up-sampling and reconstruction filter 217 is applied.
In this embodiment, besides the decoded low resolution depth image, the up-sampling and reconstruction filter 217 uses edge information (generally—depth discontinuities), which is extracted 216 from the decoded texture image 203, and the decoded low resolution depth image 204. The details on the process of extracting edge information 216 are described below.
The reconstructed depth images 205 and texture images 203 can then be used for virtual image synthesis 113, as known in the art.
In both embodiment 1 and 2, the reconstruction process filters after the decoding.
As shown in
A modified H.264/AVC codec includes an encoder and a decoder for multi-view texture and the other for multi-view depth. The depth encoder and decoder use a depth up-sampling reconstruction filter according to embodiments of our invention and described herein.
Input to the encoder includes the multi-view texture input video and the corresponding sequence of multi-view depth images. Output includes encoded bitstreams. For each frame of the input video of a selected view, there is a corresponding depth image.
Input to the decoder includes the multi-view texture bitstream and the corresponding multi-view depth bitstreams. Output includes decoded multi-view texture in full resolution and depth image in low resolution, as well as the reconstructed multi-view depth in high resolution. For each frame of the decoded video of a selected view, there is a corresponding depth image.
The current texture image of a basis view (or equivalently, the current low resolution depth image of a basis view), which is the first view to be encoded, is predicted either by motion estimation (ME) followed by motion compensation prediction (MCP), or by intra-prediction according to a selector. A difference between the current texture (or depth image) and the predicted texture (or depth image) is transformed, quantized, and entropy encoded to produce a bitstream. For the case of depth image, the input assumed here is already in low resolution. Otherwise, a pre-processing block for depth down-sampling is necessary.
The output of the quantizer is inverse quantized and inverse transformed. The inverse transform is followed by a deblocking filter producing the reconstructed texture (or depth image) in low resolution, which is stored in a frame buffer structure, to be used by subsequent frames of the input texture (or depth images) video as a reference image.
For virtual view synthesis, the full resolution texture and depth images are necessary to perform the warping operation of texture from the base view to the target view. The up-sampling reconstruction filter produces the reconstructed depth image in high resolution, and can be realized outside the decoding loop.
For the coding of the subsequent views, a similar process is realized, with the fact that texture from the base view (or any other already encoded view), can be added to the frame buffer structure, to perform interview prediction. If a side view is used as reference, the motion vectors acts as a disparity vector between views, and this disparity compensated frame can be selected as a prediction for encoding the auxiliary view.
As shown in
In the coding depicted in
The high resolution texture image of an auxiliary view can be predicted either by MC, by intra-prediction, or by a warped frame using VSP, according to a selector. To implement the view synthesis prediction, the full resolution depth image is used, and the up-sampling and reconstruction filter 227 is placed in-loop.
Assuming the in-loop structure described above, in this embodiment, the edge information of the high resolution depth images from a side view, which is already encoded, is warped and used by the up-sampling and reconstruction filter.
With this embodiment, no explicit transmission of edge information for the current view or edge detection is necessary. The edge information from the side view can be warped by using DIBR techniques.
In an alternative implementation, the depth image of a side view can be warped to the target position using DIBR techniques and then the edge will be detected from the warped depth image. The edge information obtained in the above ways will then be utilized in the depth up-sampling and reconstruction.
Down/Up Sampling
Above, we described embodiments that use depth up-sampling and reconstruction filtering based on edge information.
Now, we describe known techniques that can be used for depth down-sampling and up-sampling according to embodiments of the invention.
For down-sampling a 2D image, a representative depth among the pixel depths of pixels in a window are selected. We select a median depth
imgdown(x,y)=median[img((x−1)·d+1:x·d, (y−1)·d+1:y·d)],
where d represents a down sampling factor, and
For up-sampling a 2D image, pixels for the dropped positions will be interpolated. A straight-forward technique for pixel interpolation is simply repeating the nearest neighboring pixel. However, other techniques may also be used, such as linear or bicubic interpolation. Notice that such techniques can introduce artifacts in the reconstructed image.
Edge-Aware Filtering
Edge-aware filtering assists the up-sampling and reconstruction of depths at a higher resolution, which can be used in the four example embodiments described above.
Our filtering selects a single representative depth within a sliding window to recover missing or distorted depths, considering the edge information provided either indirectly from the correspondent texture, or from a warped view, or even explicitly sent by the encoder.
The low resolution depth image is interpolated with nearest neighboring values 716, and the image is processed in overlapping blocks of size 6×6, where only the middle 2×2 block values is be modified.
For each 6×6 block, if there is one pixel marked for post-filtering 711, than edge-aware region-based median filtering is performed, otherwise the block is copied to the output. The filtering procedure includes color-based edge magnitude estimation 715 using texture 702, followed by a watershed segmentation procedure 712.
The regions generated by the segmentation procedure are merged 713 into two disjoint regions. For each region, the median value of the corresponding region substitutes the depth values of the region, generating a constant-valued region, and filtering the center values of the region-based median filter 714, resulting in the high resolution filtered depth image 703, whose depths arc in accordance with the obtained edge. Next, we describe important blocks in the process.
Detection of Edge Discontinuity
Depth differences 812 between two intermediate images produced by the dilation and erosion have high values near edges. Therefore, a threshold 813 can determine the areas of the image where the edge is located. The mask is then up-sampled 814 to produce a depth mask 802, which indicates whether a block of the interpolated decoded high resolution depth image should be post-processed, or not.
Dilation and Erosion
Morphological dilation and erosion are well known terms in the art of image processing. The state of any pixel in the output image is determined by applying rules to the corresponding pixel, and its neighbors in the input image.
For the dilation rule, the depth of the output pixel is the maximum depth of all the pixels in the neighborhood of the input pixel. Dilation generally increases the sizes of objects, filling in holes and broken areas, and connecting areas that are separated by small spaces. In gray-scale images, dilation increases the brightness of objects by taking the neighborhood maximum. With binary images, dilation connects areas that are separated by distance smaller than a structuring element, and adds pixels to the perimeter of each image object.
Erosion
For the erosion rule, the depth of the output pixel is the minimum depth of all the pixels in the neighborhood. Erosion generally decreases the sizes of objects and removes small anomalies by subtracting objects with a radius smaller than the structuring element. In grays-scale images, erosion reduces the brightness, and therefore the size, of bright objects on a dark background by taking the neighborhood minimum.
Color-Edge Magnitude
Edge information extracted from color images can be more reliable. We extract the edge magnitude from each color channel by first applying a smoothing Gaussian filter, and then a differential filter to the smoothed input. The maximum magnitude of the three channels is retained. The resulting edge magnitude is used to determine the boundaries of objects, using watershed segmentation.
Watershed Segmentation
The watershed segmentation procedure considers the edge magnitude input image as a terrain, and uses a geophysical model of rain falling in the terrain to segment the image. The concept of the watershed transform is based on the idea that a raindrop falling on a surface follows the path of steepest descent to a minimum. A catchment basin is the set of points on the surface that lead to the same minimum, and borders between catchment basins are the divisions between regions, also known as watershed lines.
A know issue with watershed transform is over-segmentation. Therefore, the watershed transform is usually followed by a clustering or merging operation. In our case, the transform is applied in a block-by-block basis, where blocks of size 6×6 that contain an edge pixel are selected for segmentation.
Region Clustering
Because the watershed transform usually generate more regions than necessary, we apply a clustering procedure that is based on the average color information in each region. For each region, the average value of all the color pixels present in the region is determined. For all neighboring regions, we determine the average color value of the union of these two regions using a weighted sum of their respective color values, and their areas as weighting factors.
Then, the cost of uniting two regions is given by the difference between the actual color and the color resultant from the union, weighted also by the area of each region.
For example, in
Region-Based Median Filtering
In
The watershed segmentation (
For each region, the median value of the depth values is determined. The pixels in the central 2×2 block have the corresponding median value of the region to which the pixels belongs.
Our depth up-sampling and reconstruction filter includes an edge-aware region-based median filter. The filter is non-linear, and takes into consideration characteristics of depth images to reduce coding errors, as well as edge information to recover the depth information that is lost in the down-sampling and coding procedure. By using the edge information, the up-sampled reconstructed depth image has a higher quality, and generates synthetic views with higher quality.
When edge-aware depth up-sampling is used as an in-loop filter and combined with view synthesis prediction, the coding efficiency is improved because a higher quality synthetic reference can be generated using our depth up-sampling technique.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.