The present invention relates to video compression, and more particularly to overcomplete wavelet video coding using adaptive motion compensated temporal filtering.
Current video coding algorithms are mainly based on hybrid-coding schemes with motion compensated predictive coding. In such hybrid schemes, temporal redundancy is reduced using motional compensation and spatial resolution is reduced by transform coding the residue of motion compensation. These hybrid-coding schemes, however, are prone to error propagation and lack flexibility in terms of providing true scalable bitstream, i.e., the ability to decompress to different quality, resolution, and frame-rate layers from the same compressed bitstream.
In contrast, 3D sub-band/wavelet coding can provide very flexible scalable bitstream and higher error resilience. Wavelet-based scalable video coding schemes permit great flexibility in terms of the different scalability types allowed. Hence, they are especially useful for video transmission over heterogeneous wireless and wired networks, to various devices with different capabilities.
Currently, there are two wavelet-based video coding schemes: overcomplete wavelet and interframe wavelet. In overcomplete (OW) wavelet video coding, the spatial wavelet transform for each frame is performed first, followed by exploitation of interframe redundancy by predicting the wavelet coefficient values, or by defining temporal contexts in entropy coding. In interframe wavelet video coding, wavelet filtering is performed along the temporal axis followed by a 2D spatial wavelet transform.
Present interframe wavelet video coding schemes use motion compensated temporal filtering (MCTF), to reduced the temporal redundancy. MCTF is performed in the temporal direction of motion before spatial decomposition is performed. Such video coding schemes are referred to herein as spatial domain MCTF (SDMCTF). However, the quality of the matches provided by the motion estimation algorithm inherently limit SDMCTF video coding schemes. For example, some of the interframe wavelet-coded sequence appears to be slightly blurred, because imperfect motion estimation causes movement of frame details into the temporal high frequency sub-bands, and from there, to spatial high frequency sub-bands. These artifacts lead to degraded visual performance for unquantized and spatially scaled sequences. Further tests have indicated that decreasing the number of temporal decomposition levels can reduce the artifacts.
In present OW video coding schemes, wavelet filtering is used to spatially decompose each of the video frames into multiple sub-bands, and temporal correlation for each sub-band is removed using motion estimation.
There have been many attempts to predict the wavelet coefficients by motion compensation in the wavelet domain. However, motion compensation in the wavelet domain is highly dependent on the alignment of the signal and the discrete grid chosen for the analysis. There exist very large differences between the wavelet coefficients of the original image and the one-pixel-shifted image. This shift-variant property happens frequently around the image edges, so motion compensation of the wavelet coefficients can be difficult.
Existing OW video coding schemes overcome the inefficiency of motion estimation in wavelet domain by utilizing the odd-phase wavelet coefficients in the prediction as well. A convenient method of obtaining the odd phase coefficients is to perform band shifting. Since the decoded previous frame is also available at the decoder, prediction from overcomplete expansion does not require any additional overhead. Moreover, the computational complexity of searching both optimal phase and motion vectors in wavelet domain is comparable to that of conventional motion estimation in spatial domain with fractional pel accuracy.
However, due to the motion estimation/compensation, the conventional OW framework suffers from drift, which results in performance loss in SNR scalability. Furthermore, only limited range of temporal scalability can be achieved using B frames.
Accordingly, a wavelet-based video-coding scheme with improved SNR and temporal scalability is needed.
The present invention is directed to a method and device for coding video. According to a first aspect of present invention, a video signal is spatially decomposed into at least two signals of different frequency sub-bands. An individualized motion compensated temporal filtering scheme is applied to each sub-band signal. Texture coding is then applied to each of the motion compensated temporally filtered subband signals. According to a second aspect of the invention, a signal including at least two encoded motion compensated temporally filtered, different frequency sub-band signals of a video signal, is decoded. Inverse motion compensated temporal filtering is independently applied to each of the decoded at least two sub-band signals. The at least two sub-band signals are spatially recomposed and the video signal is reconstructed from at least one of the at least two spatially recomposed sub-band signals.
The present invention is a fully scalable three-dimensional (3-D) overcomplete wavelet video coding scheme that utilizes a novel inband motion compensated temporal filtering (IBMCTF) method. The IBMCTF method of the present invention overcomes the drawbacks of previous IBMCTF coding methods, and demonstrates coding efficiency comparable or better than conventional interframe wavelet coding methods that utilize spatial domain motion compensated temporal filtering.
The video encoder 100 further includes a partitioning unit 120a, 120b, 120c for each sub-band generated by the wavelet transform unit 110. Each partitioning unit 120a, 120b, 120c divides the wavelet coefficients of its associated sub-band into groups of frames (GOFs) for encoding as a group.
The video encoder 100 also includes a motion compensated temporal filtering (MCTF) unit 130a, 130b, 130c for each sub-band, that contains a motion estimator 131a, 131b, 131c and a temporal filter 132a, 132b, 132c. Each MCTF 130a, 130b, 130c separately removes temporal correlation or redundancy from the GOFs of each sub-band using a motion compensated temporal filtering (MCTF) process. In accordance with the present invention, the use of a discrete MCTF unit for each sub-band allows the motion compensated temporal filtering process to be tailored for each sub-band independently of the other sub-bands. In addition, the temporal filtering process selected for a particular sub-band may be based on different criteria.
The encoder additionally includes a texture encoder 140a, 140b, 140c for each sub-band that allows the residual signal and motion information (motion vectors) generated by the MCTF units 130a, 130b, 130c for each sub-band to be independently texture coded using any optimized texture coding process. The texture coded residual signals and motion information are then combined into a single bitstream by a multiplexer 150. Another embodiment of texture coding is a gobal transform of a full size residual frame, which is applied after the all residual signals and motion information generated by the MCTF units 130a, 130b, 130c for each sub-band are combined to generate the full size residual frame.
As one of ordinary skill in the art will appreciate, the critical-sampled wavelet decomposition in known IBMCTF methods is only periodically shift-invariant. Therefore, performing motion estimation and compensation in the wavelet domain is inefficient and may incur a coding penalty. To address this problem, each motion compensated filtering unit 130a, 130b, 130c utilizes an adaptive higher order interpolation filter 200, as shown in
The interleaving process, performed by the interleaving unit 220, combines the different phase information provided by the overcomplete wavelet coefficients to generate an extended reference frame. Accordingly, there is no need to encode the phase information separately as in previous IBMCTF based video coding methods. Due to the interleaving process of the present invention, the phase information is coded inherently as part of the higher accuracy motion vectors.
From the extended reference frame, the interpolation unit 230 generates a fractional pel, such as ½, ¼, ⅛, 1/16 pels, which is used by the motion estimator 131a, 131b, 131c for motion estimation. Interpolation may be implemented with a conventional one-dimensional interpolation filter. In order to maximize the performance of the motion estimation and MCTF, independently optimized interpolation filters with a different tap can be used for each subband.
The IBMCTF based 3-D overcomplete wavelet video coding method of the present invention provides improved spatial scalability performance as compared with known spatial domain motion compensated temporal filtering (SDMCTF) based video coding methods. This is because the temporal filtering is performed per sub-band (resolution) and hence, loss of information from the finer resolution sub-bands does not incur any drift in the temporal direction.
As mentioned earlier, the use of a discrete MCTF unit 130a, 130b, 130c for each sub-band allows different temporal filtering techniques to be used at the various resolutions. For example, in one embodiment, a bi-directional temporal filtering technique can be used for low resolution sub-bands, while a forward temporal filtering technique can be used for higher resolution sub-bands. The temporal filtering technique can be selected based on minimizing a distortion or a complexity measure (e.g. the low resolution sub-bands have less pixels and hence bi-directional and multiple reference temporal filtering can be employed, while for the high resolution sub-bands that have a larger number of pixels, only forward estimation is performed). Such a flexible choice of temporal filtering options makes moves the invention away from the strict 1D+2D decomposition schemes as performed by MCTF, to a more general 3-D decomposition scheme with spatial size reduction throughout the temporal levels, where the higher spatial frequency sub-bands are omitted from longer-term temporal filtering.
The use of a discrete partitioning unit 120a, 120b, 120c for each sub-band allows the GOFs to be adaptively determined per sub-band. For instance, the LL-sub-bands might have a very large GOF, while the H-sub-bands can use limited GOFs. The GOF sizes can be varied based on the sequence characteristics, complexity or resiliency requirements. As mentioned earlier, the decomposition scheme for conventional MCTF, as shown in
The number of temporal decomposition levels for the various sub-bands can be determined either based on content, or to reduce a specific distortion metric or simply based on the desired temporal scalability in each resolution. For instance, if 30, 15 and 7.5 Hz frame-rates are desired at CIF (352×288) size resolution, and only 30 and 15 at SD (704×576) size resolution, then for the LL spatial-sub-band, three levels of temporal decomposition are used, while only two levels of temporal decomposition can be applied for LH, HL, and the HH sub-bands.
As also mentioned earlier, the use of discrete texture coding unit 140a, 140b, 140c for each sub-band allows adaptive texture coding of the various spatial sub-bands. For example, wavelet or DCT-based texture coding schemes may be used. If DCT-based texture coding is used, intra-coded blocks can be advantageously inserted anywhere within the GOF to deal efficiently with covering and uncovering situations. Also, “adaptive intra-refresh” concepts from MPEG-4/H.26L can be easily employed to provide improved resiliency, and different refresh rates can be used for the various sub-bands to obtain different resiliencies. This is especially beneficial since the lower resolution sub-bands can be used for concealing the higher resolution sub-bands and hence, their resiliency is more important.
Another advantage of the present invention relates to the complexity scalability of the decoder. If there are many decoders with different computation power and displays, the same scalable bitstream can be used to support all those decoders through SNR/spatial/temporal scalability. For example, the scalable bitstream generated by the encoder of the invention can be decoded with a decoder with low complexity that can decode only low resolution spatial and temporal decomposition level, which incurs only small computational burden. Similarly, the scalable bitstream generated by the encoder of the invention can also be decoded with a decoder having sophisticated decoding power that can decode the whole bit stream to achieve the full spatial and temporal resolution.
A first texture decoder 420 texture decodes the wavelet coefficients, according to the inverse of the texture coding technique performed on the encoding side, into their separate sub-bands 1, 2, . . . and N. The wavelet coefficients of a sub-band produced by the first texture decoder 420 correspond to each GOF of that sub-band. A motion vector decoder 430 decodes the motion information for each sub-band according to the inverse of the texture coding technique performed on the encoding side. Using the decoded motion vectors and residual texture information, inverse MCTF is applied by MCTF units 440a, 440b, 440c for each sub-band independently and an inverse wavelet transform unit 450 spatially recomposes each sub-band to reconstruct the low, medium, and high level images. The low-band-shifting block reads the recomposed sub-band images to assemble a full size image and then the low band shifted wavelet decomposition is applied to provide the extended reference frames for the inverse MCTF units 440a, 440b, 440c. Depending on the display resolution, a video reconstruction unit (not shown) may use one of the sub-bands to generate the low resolution video, or use two sub-bands to generate a medium resolution video, or use all of the sub-bands to generate a high resolution, full quality video.
The various processes utilized in the video scheme of the present invention will now be described in greater detail below.
The decimation process performed in a wavelet transform generates wavelet coefficients that are no longer shift-invariant. Hence, translation motion in the spatial domain cannot be accurately estimated from the wavelet coefficients, which in turn produces a significant loss in coding efficiency. The LBS algorithms utilized in the present invention provide a method for overcoming the shift-variant property of the wavelet transform. At a first level, the original and shifted signals are decomposed into low-sub-band and high-sub-band signals. Subsequently, the low-sub-band signal is further decomposed in the same way as for the first level.
The novel interleaving scheme of the present invention stores the overcomplete wavelet coefficients differently from that depicted in
The interleaving scheme can be used recursively at each decomposition level and can be directly extended for 2-D signals.
As is well known in the art, in a wavelet decomposition, every coefficient at a given scale, with the exception of those in the highest frequency sub-bands, can be related to a set of coefficients of the same orientation at finer scales. In many wavelet coders, this relationship is exploited by representing the coefficients as a data structure called a wavelet tree. In the LBS algorithm, the coefficients of each wavelet tree rooted in the lowest sub-band are rearranged to form a wavelet block, as shown in
In the spatial domain, the block-based motion estimation usually divides an image into small blocks and then finds the block of the reference frame that minimizes the mean absolute different (MAD) to each block of the current frame. The motion estimation of the LBS algorithm finds the motion vector (dx, dy) that generates the minimum MAD between the current wavelet block and the reference wavelet block. As an example, if an input image is decomposed up to the third level (i.e. the input image can be decomposed to a total of ten sub-bands), and the displacement vector is (dx,dy), then the MAD of the k-th wavelet block in
where xi,k=x0,k/2i and yi,k=y0,k/2i; and (x0,k, y0,k) denotes the initial position of the k-th wavelet block in the spatial domain, as shown in
However, in the IBMCTF method of the present invention, the interleaving process enables the MAD calculation to be performed similarly as in SDMCTF video coding schemes, even for the sub-pixel accuracy. More specifically, the MAD for the displacement vector (dx,dy) for the IBMCTF method of the present invention is computed as follows:
where, for example, LBS_HLref(i)(x, y) denotes the extended HL sub-band of reference frame using interleaving process of the present invention. Note that even if (dx,dy) are non-integer values, the same interpolation technique used for SDMCTF can be easily used for each extended sub-band to generate the MAD for the non-integer displacement. Therefore, the IBMCTF video coding scheme of the present invention provides more efficient and indeed optimal sub-pixel motion estimation compared to the existing IBMCTF coding schemes. Also, in the IBMCTF video coding scheme of the present invention with the wavelet block structure does not incur any motion vector overhead because the number of the motion vector to be coded is the same as that of SDMCTF. Since the motion estimation is closely aligned with the residual coding, a more sophisticated motion estimation criterion (such as the entropy of the residual signal) may be used to improve the coding performance.
In order to verify that motion estimation and motion compensation in accordance with the present invention in the overcomplete wavelet domain yields lower residual energy in the wavelet domain, we use a one level temporal decomposition and compute the MAD for both IBMCTF and SDMCTF. Note that in interframe wavelet coding, the MAD is computed in the spatial-domain, but actually what needs to be minimized is the residual energy in the wavelet domain.
The input/output devices 502, processor 503 and memory 504 may communicate over a communication medium 505. The communication medium 505 may represent, e.g., a bus, a communication network, one or more internal connections of a circuit, circuit card or other device, as well as portions and combinations of these and other communication media. Input video data from the source(s) 501 is processed in accordance with one or more software programs stored in memory 504 and executed by processor 503 in order to generate output video/images supplied to a display device 506.
In a preferred embodiment, the coding and decoding principles of the present invention may be implemented by computer readable code executed by the system. The code may be stored in the memory 504 or read/downloaded from a memory medium such as a CD-ROM or floppy disk. In other embodiments, hardware circuitry may be used in place of, or in combination with, software instructions to implement the invention. For example, the functional elements shown in
While the present invention has been described above in terms of specific embodiments, it is to be understood that the invention is not intended to be confined or limited to the embodiments disclosed herein. For example, other transforms besides DCT can be employed, including but not limited to wavelets or matching-pursuits. These and all other such modifications and changes are considered to be within the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
60418961 | Oct 2002 | US | national |
60483796 | Jun 2003 | US | national |
This application claims the benefit under 35 USC 119(e) of U.S. provisional application Ser. No. 60/418,961, filed on Oct. 16, 2002, which is incorporated herein by reference.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB03/04452 | 10/8/2003 | WO | 4/13/2005 |