This invention relates generally to the field of video transcoding, and more particularly to the transcoding of scalable multi-layer videos to single layer videos.
Compression enables storing, transmitting, and processing of videos with fewer storage, network, and processor resources. The most widely used video compression standards include MPEG-1 for storage and retrieval of moving pictures, MPEG-2 for digital television, and H.263 for video conferencing, see ISO/IEC 11172-2:1993, “Information Technology—Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/s—Part 2: Video,” D. LeGall, “MPEG: A Video Compression Standard for Multimedia Applications,” Communications of the ACM, Vol. 34, No. 4, pp. 46-58, 1991, ISO/IEC 13818-2:1996, “Information Technology—Generic Coding of Moving Pictures and Associated Audio Information—Part 2: Video,” 1994, ITU-T SG XV, DRAFT H.263, “Video Coding for Low Bitrate Communication,” 1996, and ITU-T SG XVI, DRAFT13 H.263+Q15-A-60 rev.0, “Video Coding for Low Bitrate Communication,” 1997.
These standards are relatively low-level specifications that deal primarily with spatial compression of images or frames, and the spatial and temporal compression of a sequence of frames. As a common feature, these standards perform compression on a per frame basis. These standards achieve high compression ratios for a wide range of applications.
Video coding standards, such as MPEG-4 for multimedia applications provide several new coding tools, including tools to improve the coding efficiency, and tools that support object-based coding and error-resilience, see ISO/IEC 14496-2:1999, “Information technology—coding of audio/visual objects, Part 2: Visual.”
One of the main problems in delivering video content over networks is adapting the content to meet particular constraints imposed by users, networks and devices. Users require playback with minimal variation in perceived quality. However, dynamic network conditions often make this difficult to achieve.
Fine granular scalable (FGS) coding has been adopted by the MPEG-4 standard. The tools that support FGS coding are specified in an amendment of the MPEG-4 standard, “ISO/IEC 14496-2:1999/FDAM4, “Information technology—coding of audio/visual objects, Part 2: Visual.” An overview of FGS coding is described by Li in “Overview of Fine Granularity Scalability in MPEG-4 video standard,” IEEE Trans. Circuits and Systems for Video Technology, March 2001.
FGS coding is a radical departure from traditional scalable coding. With traditional scalable coding, the video is coded into a base layer bitstream and possibly several enhancement layer bitstreams, where the granularity is only as fine as the number of enhancement layer bitstreams that are formed. The resulting rate-distortion (R-D) curve resembles a step-like function.
In contrast, FGS coding provides an enhancement layer bitstream that is continuously scalable. Providing a continuous scalable enhancement layer bitstream is accomplished by a bit-plane coding method that uses discrete cosine transform (DCT) coefficients. Bit-plane coding allows the enhancement layer bitstream to be truncated at any point. In that way, the quality of the reconstructed video is proportional to the number of decoded bits of the enhancement layer bitstream.
An enhancement layer bitstream 105 is generated by subtracting reconstructed frames of the base layer bitstream 103 from the input video. This yields an FGS residual signal in the spatial domain. Enhancement layer encoding is then applied to the residual signal. The enhancement encoder 101 includes a DCT 190, followed by bit-plane shifting 192, a maximum operation 194, and bit-plane VLC coding 196 to produce the enhancement layer bitstream 105.
A selective enhancement method to control the bit-plane shifting in the enhancement layer bitstream of the FGS coded video bitstream was described in U.S. Pat. No. 6,263,022, “System and method for fine granular scalable video with selective quality enhancement,” issued on Jul. 17, 2001 to Chen, et al. There, the quantization parameter used for coding the base layer video also determined the corresponding shifting factor. The bit-planes associated with macroblocks that were deemed more visually important were shifted higher.
A key point to note is that the bit rate of the base layer bitstream is some predetermined minimum. The enhancement layer bitstream covers the range of rates and distortions from a minimum to near lossless reconstruction. Also, after the enhancement layer bitstream has been generated, it can be stored and re-used many times. According to e.g., network characteristics, an appropriate number of bits can be allocated to a frame and transmitted over the network, taking into consideration current network conditions. It is important to note however that there is no quantization parameter to adjust in that scheme.
The MPEG-4 standard does not specify how rate allocation, or equivalently, the truncation of bits on a per frame basis is to be done. The standard only specifies how the scalable bitstream is decoded. Additionally, traditional methods that have been used to model the rate-distortion characteristics, e.g., methods based on quantization parameters, no longer hold with a bit-plane coding scheme used by the FGS coding. As a result the perceived quality of the reconstructed video can vary noticeably.
Because differential sensitivity is key to human visual perception, it is important to minimize the variation in the perceived quality rather than overall distortion. Optimal rate allocation can be done by minimizing a cost according to an exponential R-D model. This leads to a constant quality among the decoded frames, see Wang, et al., “A new rate allocation scheme for progressive fine granular scalable coding,” Proc. International Symposium on Circuits and Systems, 2001. However, the prior art rate allocation methods typically use exhaustive searches not suitable for real-time applications, and do not work on low bit-rate signals. In U.S. patent application Ser. No. 09/961,987, “Transcoder for Scalable Multi-Layer Constant Quality Video Bitstreams,” filed on Sep. 24, 2001 by Zhang, et al., an FGS-based transcoder that extracts R-D labeling points has been described to provide an output bitstream that has a constant quality.
The FGS coding method and rate allocation techniques described above are useful for transmission over dynamic channels. The key assumption made is that the receiving device has a decoder that can process both base and enhancement layer bitstreams. In practice, this may not always be true, especially for today's low-power mobile devices, such as cellular telephones and personal digital assistants (PDAs).
The only existing way to overcome this problem is to simply transmit the base layer bitstream to the receiving device. The main drawback of that approach is that the base layer bitstream is usually coded with a minimum constant bit-rate resulting in a very low quality decoded video. Therefore, if there is additional bandwidth available, then the connection and the device capabilities are under-utilized. In order to convert the FGS coded video to a single layer bitstream with higher quality than the base layer bitstream only, some other means of transcoding is required.
As shown in
A prior art method for transcoding and scaling a bitstream has been described by Sun et al., in “Architectures for MPEG compressed bitstream scaling,” IEEE Transactions on Circuits and Systems for Video Technology, April 1996. There, four methods of rate reduction, with varying complexity and architecture, were described.
To obtain the correction DCT coefficients 430, the re-quantized DCT coefficients are inverse quantized 470 and subtracted 480 from the original partially decoded DCT coefficients. That difference is transformed to the spatial domain via an inverse DCT (IDCT) 490 and stored into a frame memory 495. The motion vectors associated with each incoming block are then used to recall the corresponding difference blocks, such as in motion compensation 496. The corresponding blocks are then transformed via the DCT 430 to yield the correction component. A derivation of the method shown in
Assuncao et al. also described an alternate method for the same task. In the alternative method, they used a motion compensation (MC) loop operating in the frequency domain for drift compensation. Approximate matrices were derived for fast computation of the MC blocks in the frequency domain. A Lagrangian optimization was used to calculate the best quantizer scales for transcoding. That alternative method removed the need for the IDCT/DCT components.
The prior art clearly teaches methods of transcoding compressed single layer bitstreams for bit-rate reduction. However, the prior art does not teach a method of transcoding a multi-layer bitstream to a single layer stream. In particular, there is a need for efficient transcoding methods that convert the multi-layer bitstream consisting of a base and FGS enhancement layers to a single layer bitstream with a higher quality than the base layer bitstream.
A method transcodes a compressed multi-layer video bitstream that includes a base layer bitstream and an enhancement layer bitstream. The base and enhancement layers are first partially decoded, and then the partially decoded signals are combined with a motion compensated signal yielding a combined signal.
The combined signal is quantized into an output signal according to a quantization parameter, and the output signal is variable length encoded as a single layer bitstream. In a preprocessing step, the enhancement layer can be truncated according to rate control constraint, and the same constraints can also be used during the quantization.
a is a high-level block diagram of a prior art multi-layer bitstream to single layer bitstream transcoder;
b is a low-level block diagram of a prior art multi-layer to single layer bitstream transcoder;
Introduction
The present invention provides a system and method for transcoding a compressed multi-layer video bitstream, such as an FGS coded video bitstream, to a single layer bitstream. The transcoding according to the invention enables devices that only support single layer decoding to receive a higher quality video than produced by only decoding a base layer bitstream of the multi-layer bitstream.
Transcoding Architecture
If Xn-1 represents the reconstructed reference frame in the base layer decoder that is stored in frame store 510 at frame n, then the prediction residual of a next P-frame Xn is defined as,
ΔXn=Xn−MC(Xn-1), (1)
where MC(.) denotes a motion compensation process.
In the DCT domain, this prediction residual is also expressed in terms of signals present in the base and enhancement layer decoders,
DCT(ΔXn)=B*+E*, (2)
where B* and E* correspond to the DCT coefficients reconstructed from the base layer bitstream and enhancement layer bitstream, respectively.
In the base layer encoder, R* denotes the DCT coefficients reconstructed from the output bitstream, and Yn-1 denotes the reconstructed reference frame in the base layer encoder that is stored in frame store 508. The DCT coefficients corresponding to R* are given by,
R*=DCT(Xn−MC(Yn-1))+Δ, (3)
where Δ denotes the quantization error induced by quantization 535 and inverse quantization 511. Substituting Equations (1) and (2) into (3) yields,
where
C*=DCT(MC(Xn-1)−MC(Yn-1)). (5)
Thus, the intermediate architecture shown in
To simplify this intermediate architecture further, we assume that MC(.) is a linear operation, and expanding the definitions of Xn-1 and Yn-1 according to
In order to transform transcoder 500 into a transcoder 600 according to the invention, as shown in
R*=B*+E*+DCT(MC(IDCT(B*n-1−R*n-1))). (7)
Equation (7) equivalently represents the architecture illustrated in
Rate Controlled Encoding
To perform rate control as shown in
The transcoder shown in
The same inputs to the rate control process are used, however rather than only selecting an output quantization parameter to control the bit-rate produced by the quantizer 535, the rate control is also jointly responsible for selecting the truncation point of the enhancement layer bitstream on a picture-by-picture basis using a switch 700. Truncating bit-plane data is similar to the process of quantization or smoothing, but with significantly less processing.
With the truncating operation, the enhancement layer signal, E* is modified to a lower quality signal. The new signal is denoted by {tilde over (E)}, and the transcoder shown in
R*=B*+{tilde over (E)}+DCT(MC(IDCT(B*n-1−R*n-1))). (8)
In most encoding and transcoding systems, the input signals are pre-processed to yield a more usable source for the quantization blocks. For example, the spectrum of the source can be lowpass filtered to remove information during the quantization process. However, this results in unpleasant blocking artifacts. Thus, a way of removing such information becomes important to achieve robust and efficient encoding and transcoding systems.
In many commercial encoders, the pre-processing is adjusted manually to achieve the best visual quality. In our case of a transcoder based on FGS, it is beneficial to select the important information based on bit-planes because the information has been organized in a prioritized way. Thus, the rate control process can select 700 the number of bit-planes that results in a source with lower entropy for compression. The number of bit-planes are selected such that the quality is slightly higher than the target simple profile quality. In this way, the transcoder can achieve a seamless pre-processing, which is not possible with prior art transcoding systems.
Application System
The system includes a camera 820 to acquire video content 801. The content is FGS encoded 830 and archived 840 as an FGS coded bitstreams to enable scalable video delivery 850 to devices 860 that support the FGS profile via a network 870. Terminal capabilities or constraints 802 are communicated to a server 880 to control format conversion, while network conditions 803 are used as input to the rate control at the server 880.
For devices that support FGS profile, the server 880 truncates 850 the enhancement layer bitstream 805 in order to adapt to a variable transmission rate. However, for devices 861 that only provide support for simple profile, the FGS bitstreams is transcoded 845 to an FGS to simple profile bitstream 806.
Because the FGS enhancement layer bitstream can be truncated at any location, the reconstructed video quality is proportional to the number of bits actually decoded. Thus, with the enhancement video, the supplied quality is higher than that of traditional single layer bitstream transcoding methods.
The described FGS-to-simple transcoding according to the invention has substantially the same performance as the cascaded transcoding methods but with much less processing.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
5623312 | Yan et al. | Apr 1997 | A |
6043838 | Chen | Mar 2000 | A |
6263022 | Chen et al. | Jul 2001 | B1 |
6519285 | Yamaguchi et al. | Feb 2003 | B2 |
6785334 | van der Schaar et al. | Aug 2004 | B2 |
6920179 | Anand et al. | Jul 2005 | B1 |
20020034248 | Chen | Mar 2002 | A1 |
20030058931 | Zhang et al. | Mar 2003 | A1 |
20030058936 | Peng et al. | Mar 2003 | A1 |
20030081673 | Peng et al. | May 2003 | A1 |
20030118097 | Chen et al. | Jun 2003 | A1 |
20030142744 | Wu et al. | Jul 2003 | A1 |
20040071083 | Li et al. | Apr 2004 | A1 |
20040264791 | Jiang et al. | Dec 2004 | A1 |
Number | Date | Country | |
---|---|---|---|
20030202579 A1 | Oct 2003 | US |