The present invention relates generally to the field of video coding. More particularly, the present invention relates to spatial scalability in scalable video coding (SVC).
This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Digital video includes ordered sequences of images produced at a constant rate (for example, 15 or 30 images/second). The resulting amount of raw video data is therefore extremely large. Consequently, video compression is particularly necessary to efficiently code the video data prior to storage or transmission. The compression process is a reversible conversion of video data into a compact format that can be represented with fewer bits.
Video coding commonly exploits the spatial and temporal redundancies inherent in the video sequences for intra and interframe coding. During interframe coding, the encoder attempts to reduce the temporal redundancies between consecutive video frames by predicting the current frame based on its neighboring frames. In intraprediction, the spatial redundancies are reduced by predicting blocks that constitute a frame from their neighboring blocks. After prediction, a residual frame, which is the difference between the predicted and the original frame, is produced alongside some supporting parameters. This residual frame is often compressed prior to transmission, where a transformation, such as the Discrete Cosine Transform (DCT), is applied, followed by variable length coding methods such as Huffman coding.
To allow for more flexibility and adaptation to a variety of applications and transmission bandwidth, scalable video coding extends the basic (single-layer) video coding to multi-layer video coding. Essentially, a base layer is coded together with different enhancement layers at different spatial, temporal and quality resolutions. In addition to inter and intra frame prediction techniques, scalable video coding develops interlayer prediction mechanisms that exploit the redundancies among layers and reuse information from the lower layers.
For the purpose of re-using the information from the reconstructed lower spatial resolution base layer into the higher spatial resolution enhancement layer, an up-sampling of the base layer picture is required. The up-sampling process involves interpolating the pixel values using a finite impulse response filter to generate the higher resolution picture. The quality of the interpolated picture, and therefore the fidelity of the prediction, is clearly influenced by the choice of the up-sampling filter.
JVT's MPEG's Scalable Video Coding project is a scalable extension of H.264/AVC which is currently in the development stage. The corresponding reference encoder is described in ISO/IEC JTC1/SC29/WG11, “Draft of Joint Scalable Video Model JSVM-4 Annex G”, JVT document JVT-Q201, Poznan, July 2005, incorporated herein by reference in its entirety. In the current JSVM, the up-sampling of base layer frames is carried out using the advanced video coding (AVC) filter. Additionally, new optimal filters have been proposed as alternatives to the AVC filter. Such filters are discussed, for example, in Andrew Segall, “Adaptive Study of Up-sampling/Down-sampling for Spatial Scalability”, JVT-Q083, Nice, France, October 2005 (incorporated herein by reference). Each of these competing filters yield relatively good performance at certain bit rates while under-performing at others.
In the current JSVM software, the AVC filter with filter taps [0 0 1 0-5 0 20 32 20 0-5 0 1 0 0]/32 is utilized to up-sample the base layer frames. An optimal filter with filter taps that vary according to the base layer QP (for example when QP_base=20, the taps are given by [0 3 3-8-8 21 42 21-8-8 3 3 0]/32) has previously been proposed as an alternative to the AVC filter in order to further enhance the quality of the interpolated picture. The enhancement achieved by the alternative filter is however limited to the low bit rate cases. Moreover, a decline in performance is observed at high bit rates.
The present invention enhances the existing base layer image up-sampling system for usage in scalable video coding. The present invention involves the use of a filter switching mechanism to take advantage of the best performance of each of the filters in a collaborative manner. The switching process of the present invention can be generalized to more filter choices and potentially relieve the computational complexity due to the added freedom and flexibility of filter choices. In the event that the base layer quantization parameter (QP) (QP_base) is fixed, the present invention can be implemented using QP-based switching, rate-distortion-based switching, or filter training based switching. If the base layer QP (QP_base) at the decoder side is not exactly known, then the switching process can be implemented based upon QP thresholds either at a sequence level or at a frame level.
From a performance point of view, the present invention enables the encoder to combine the advantages of the several alternative filters in a collaborative fashion. This performance advantage is illustrated in
Additionally, because the usage of a single filter irrespective of the data rate may mandate a larger number of filter taps to achieve a decent performance (such as in the case of the optimal filters), the computational complexity of the up-sampling operation can be reduced by using a switching filter mechanism that employs filters with a fewer number of taps. The invention can be implemented directly in software using any common programming language, e.g., C/C++ or assembly language. The present invention can also be implemented in hardware and used in consumer devices.
These and other advantages and features of the invention, together with the organization and manner of operation thereof, will become apparent from the following detailed description when taken in conjunction with the accompanying drawings, wherein like elements have like numerals throughout the several drawings described below.
The invention enhances the existing base layer image up-sampling mechanism for usage in scalable video coding. The present invention involves the use of a filter switching mechanism to take advantage of the best performance of each of the filters in a collaborative manner. The switching process of the present invention can be generalized to more filter choices and potentially relieve the computational complexity due to the added freedom and flexibility of filter choices.
To understand the nature of the present invention, it is helpful to consider a lower spatial resolution layer (referred to herein as a spatial base layer), possibly alongside its associated fine grain SNR (FGS) scalable layers. In up-sapling the base layer resolution to obtain the higher spatial resolution (up-sampling QCIF resolution to obtain CIF resolution, for example), the present invention provides for different up-sampling filter switching mechanisms. Some of these mechanisms target the case where the effective QP, at which the lower spatial resolution layer is upsampled at the decoder side, is not exactly known. Others are utilized in the case where this effective QP is exactly known.
In SVC, spatial scalability requires the up-sampling of a lower spatial layer resolution so that its signal can be exploited to predict the upper spatial layer. As discussed above, a single filter is currently used irrespective of the quality level (bit rate) at which the coding is taking place. However, two filters may have different performance strengths at different bit rates. In order to take advantage of best performance of candidate filters, the present invention uses a process that switches between different up-sampling filters.
For describing the present invention in detail, the case of a lower spatial layer (base layer), possibly in conjunction with its different FGS layers, is discussed as follows. The up-sampling can take place either at a fixed lower spatial layer QP, for example when the lower spatial does not have FGS layers, or at an arbitrary lower spatial layer QP. The following, with a known base layer QP and unknown base layer QP, are two basic scenarios for implementing the switched up-sampling process.
Rate-Distortion-Based Switching: Basically, for each enhancement layer frame to be coded, the encoder up-samples the corresponding reconstructed base layer frame using each of the up-sampling filter candidates. The resulting up-sampled frames are individually utilized to code the enhancement layer frame. Subsequently, a rate distortion cost associated with each of the up-sampling filters is calculated. The filter yielding the least rate-distortion cost (and hence its corresponding enhancement layer coded bit stream) is chosen as the best (i.e., final) candidate. The index of the filter of choice is coded into the bit stream. Such a coding may be performed on a per-frame basis, per-macroblock, or other periodic basis. In some cases, signaling may be conditioned on temporally varying characteristics of the video sequence, such as the spectral composition, on spatially varying characteristics, such as spectral differences between one macroblock and an adjacent macroblock, or on other information previously coded into the bit stream, such as the base layer QP value. Such a conditioning may involve selecting a context for entropy coding of the filter index. It may also involve not coding the filter index in some circumstances, for example when the spectral characteristics of one macroblock are similar to the spectral characteristics of a neighboring macroblock for which the filter index is known.
QP-Based Switching. While the previous switching relies on the final coding process outcome corresponding to each of the up-sampling filters to choose the best candidate for a particular enhancement-layer frame, the QP-based switching system selects the best filter among the candidates according to QP thresholds. Essentially, one or more pre-defined constant QP thresholds for QP_base and QP_enhance are set, creating a QP grid of the type shown in
Filter Training Based Switching. In filter training-based switching, the encoder calculates a set of optimal filter coefficients, for example (but not limited to) by optimizing an error signal between the original enhancement resolution frame and the up-sampled frame. The training may be performed independently for a pair of base layer and enhancement layer QP values, or pairs of QP values may be grouped into “classes” with training performed independently for each “class”. While training is generally expected to be performed on a per-frame basis, it may also be performed over other intervals, such as a group of frames or a collection of frames with like type (for example, a set of I-frames or P-frames). The resulting filter taps are then coded into the bit stream. This may be done on a sequence basis, frame basis, or other periodic interval. It may also be triggered by fields in a slice header (such as the slice type), or conditionally coded based upon information previously coded into the bit stream.
When the FGS layer at which the decoder will be decoding the bit stream is not known, the switching mechanism discussed above is modified. A QP-based switching between different filter choices is utilized in two variations—QP-based switching at a sequence level and QP-based switching at a frame level.
For the QP-based switching method at a sequence level, the encoder signals a set of threshold values for QP_base and QP_enhance (clearly at a sequence level). As in the case of a “known base layer QP”, a QP grid is formed based on these threshold values. This QP grid is used to map a given pair of QP_base and QP_enhance to one up-sampling filter choice. Unlike the “known base layer QP” scenario, the encoder and decoder may be using different up-sampling filter if the FGS layer of a lower resolution spatial layer at which the up-sampling is carried is different between both sides of the codec.
In the QP-based switching method at a frame level, because the enhancement layer QP (QP_QP_enhance) is known to both the encoder and the decoder, the encoder signals a set of thresholds for QP_base only on a frame basis. Accordingly, the decoder sets regions for QP_base only, and maps these regions to a vector of up-sampling filters. Depending upon where the effective QP (at which the decoder will be up-sampling the lower spatial layer resolution) falls on the QP regions, the decoder selects an up-sampling filter.
From a performance point of view, the present invention enables the encoder to combine the advantages of the several alternative filters in a collaborative fashion. The present invention can achieve the collective performance gains of the participating filters with the proper switching decisions. As a simple example, FIG. 3 illustrates the performance of the present invention for the football sequence (at 15 fps) using the rate-distortion-based switching between the AVC filter and an optimal filter. The base layer resolution is QCIF (173×144) whereas the enhancement layer resolution is the CIF (352×288). Additionally, because the usage of a single filter, irrespective of the data rate, may mandate a larger number of filter taps to achieve a decent performance (such as the case of the optimal filters), the computational complexity of the up-sampling operation can be reduced by using a switching filter mechanism that employs filters with a fewer number of taps.
For exemplification, the system 10 shown in
The exemplary communication devices of the system 10 may include, but are not limited to, a mobile telephone 12, a combination PDA and mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, and a notebook computer 22. The communication devices may be stationary or mobile as when carried by an individual who is moving. The communication devices may also be located in a mode of transportation including, but not limited to, an automobile, a truck, a taxi, a bus, a boat, an airplane, a bicycle, a motorcycle, etc. Some or all of the communication devices may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the Internet 28. The system 10 may include additional communication devices and communication devices of different types.
The communication devices may communicate using various transmission technologies including, but not limited to, Code Division Multiple Access (CDMA), Global System for Mobile Communications (GSM), Universal Mobile Telecommunications System (UMTS), Time Division Multiple Access (TDMA), Frequency Division Multiple Access (FDMA), Transmission Control Protocol/Internet Protocol (TCP/IP), Short Messaging Service (SMS), Multimedia Messaging Service (MMS), e-mail, Instant Messaging Service (IMS), Bluetooth, IEEE 802.11, etc. A communication device may communicate using various media including, but not limited to, radio, infrared, laser, cable connection, and the like.
The present invention is described in the general context of method steps, which may be implemented in one embodiment by a program product including computer-executable instructions, such as program code, executed by computers in networked environments. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.
Software and web implementations of the present invention could be accomplished with standard programming techniques with rule based logic and other logic to accomplish the various database searching steps, correlation steps, comparison steps and decision steps. It should also be noted that the words “component” and “module,” as used herein and in the claims, is intended to encompass implementations using one or more lines of software code, and/or hardware implementations, and/or equipment for receiving manual inputs.
In terms of encoding and decoding, it should be understood that, although text and examples contained herein may specifically describe an encoding process, one skilled in the art would readily understand that the same concepts and principles also apply to the corresponding decoding process and vice versa. Additionally, it should be noted that a bitstream to be decoded can be received from a remote device located within virtually any type of network. Additionally, the bitstream can be received from local hardware or software.
The foregoing description of embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the present invention to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the present invention. The embodiments were chosen and described in order to explain the principles of the present invention and its practical application to enable one skilled in the art to utilize the present invention in various embodiments and with various modifications as are suited to the particular use contemplated.
Number | Date | Country | |
---|---|---|---|
60757819 | Jan 2006 | US |