The invention relates generally to transcoding of compressed videos, and more specifically, to transcoding of compressed videos based on different transformation kernels.
MPEG-2 is a video-coding standard developed by the Motion Picture Expert Group (MPEG) of ISO/IEC. It is currently the most widely used video coding standard. Its applications include digital television broadcasting, direct satellite broadcasting, DVD, video surveillance, etc. The transform used in MPEG-2, as well as a variety of other video coding standards, is a discrete cosine transform (DCT). Therefore, an MPEG encoded video uses DCT coefficients.
Advanced video coding according to the H.264/AVC standard is intended to significantly improve compression efficiency over earlier standards, including MPEG-2. This standard is expected to have a broad range of applications including efficient video storage, video conferencing, and video broadcasting over a digital subscriber link (DSL). The AVC standard uses a low-complexity integer transform, hereinafter referred to as HT. Therefore, an encoded AVC video uses HT coefficients.
With the deployment of H.264/AVC, e.g., for mobile broadcasts, there is a need to convert video in the MPEG-2 format to videos in the H.264/AVC format. This would enable more efficient network transmission and storage. In addition, there is also a need to convert from H.264/AVC videos to MPEG-2 videos so that legacy MPEG-2 devices can process videos encoded according to the later H.264/AVC format.
A transcoder simply decodes an encoded input video in an input format to reconstruct the image pixels of the original video and then reencodes the decoded video in an ouput format. This is referred to as pixel-domain transcoding. With this pixel-domain transcoding, the transform coefficients must be mapped from the source format to the destination format.
The 8×8 block of pixels 102 is divided evenly into four 4×4 blocks (x1, x2, x3, x4) 103. Each of the four blocks 103 is passed to a corresponding HT 120 to generate four 4×4 blocks 104 of transform coefficients Y1, Y2, Y3 and Y4. The four blocks of transform coefficients are combined to form a single 8×8 block (Y) 105. This is repeated for all blocks of the video.
This is repeated for all blocks of the video.
It is desired to perform the transcoding entirely in the compressed or transform-domain, then reconstructing the image pixels is avoided. Transform-domain transcoding could be more efficient than the prior art pixel-domain transcoding because complete decoding and reencoding are not required.
Transform-domain transcoding requires conversion between input and output transform coefficients of the input and output video formats. This conversion is trivial when the input and output formats are identical because both formats are based on the same transformation kernel.
However, up to now, transform-domain transcoding between different input and output formats with different transformation kernels has not been possible because a method that directly converts transform coefficients that are based on different transformation kernels does not exist.
Therefore, there exists a need to provide a direct conversion between transform-coefficients of videos having different transformation kernels.
The invention transcodes an input video based on a first transformation kernel to an output video based on a second transformation kernel. The first and second transformation kernels are different, and the transcoding is performed entirely in a transform-domain. Coefficients of a single transform kernel matrix are determined. Then, input coefficients of the input video are converted to output coefficients of the output video using only the single transform kernel matrix.
The input video can be based on DCT coefficients, and the output video can be based on HT coefficients. Alternatively, the input video can be based on HT coefficients, and the output video can be based on DCT coefficients. In addition, the output video can have a reduced a spatial resolution from the input video.
Our invention provides a method and system for transcoding an input video format based on a first transformation kernel to an output video format based on a second transformation kernel, where the first and second transformation kernels are different and the transcoding is performed entirely in the transform domain. Such a transcoding can be applied to the transcoding between MPEG-2 and H.264/AVC formats.
We describe a method for direct DCT-to-HT conversion, a method for direct HT-to-DCT conversion, as well as a method for direct DCT-to-HT conversion with down sampling to a lower resolution. In addition, fast algorithms and integer approximations to compute these various conversions are described.
We describe several transcoding systems that employ each of these conversions.
DCT-to-HT Conversion
The S-transform can be represented by a transform kernel matrix S, which is an 8×8 matrix:
Y=S×X×ST, (1)
where ST is the transpose of S. This transform is referred to as S-transform, and is described in further detail below.
The notation used in the derivation is as follows:
The derivation of the S-transform is described below.
The HT transforms of x1, x2, x3, and x4 are Y1, Y2, Y3, and Y4, i.e.,
Y1=H×x1×HT (3.1)
Y2=H×x2×HT (3.2)
Y3=H×x3×HT (3.3)
Y4=H×x4×HT. (3.4)
If
then we can rewrite equations (3.1) through (3.4) into a single equation
Y=HH×x×HHT, (4)
where x is the IDCT of X, i.e.,
x=T8T×X×T8. (5)
It then follows that
Y=HH×T8T×X×T8×HHT. (6)
Comparing equation (6) with equation (1), we have
S=HH×T8T (7)
The direct DCT-to-HT transform is given by equation (1) and its transform kernel matrix S, rounded off to four decimal places, is:
HT-to-DCT Conversion
XX=R×YY×RT (8)
This transform is referred to as R-transform in this invention.
The R-transform is not the inverse of the S-transform, i.e., the matrix R is not equal to the matrix S−1, which is the inverse of S. The reason is that the transform kernel matrix of the inverse-HT is a not the inverse of the HT transform kernel matrix, H, but rather a scaled version of H−1 to facilitate integer implementation. Therefore, we use the R-transform instead of the inverse S-transform to maintain this distinction.
The following are some additional notations:
The derivation of the R-transform is described below.
Let {tilde over (H)}inv be the inverse-HT transform kernel matrix, i.e.,
Then, it follows that
xx=HHinv×YY×HHinvT. (11)
The “scale” operation between the inverse HT and the DCT can be approximated by a divide operation. Therefore, we have
By comparing equation (12) with equation (8), we obtain
R=(T8×HHinv)/8. (13)
The direct HT-to-DCT transform is given by equation (8) and its transform kernel matrix R, rounded off to four decimal places, is:
Fast DCT-to-HT Conversion
The sparseness and symmetry in S can be exploited to perform fast computation of the S-transform. Let values a, . . . , s be
a=1.4142, b=1.2815, c=0.45, d=0.3007, e=0.2549, f=0.9236, g=2.2304, h=1.7799, i=0.8638, j=0.1585, k=0.4824, l=0.1056, m=0.7259, n=1.0864, o=0.5308, p=0.1169, q=0.0922, r=1.0379, s=1.975.
As suggested by equation (1), the 2D S-transform is a separable transform. Therefore, it can be achieved through 1D transforms, i.e., column transforms followed by row transforms. Hence, we described only the computation of the 1D transform.
Let z be an 8-point column vector, and a matrix Z be the 1D S-transform of z. The following steps provide a method to determine Z efficiently from z.
m1=a×z[1]
m2=b×z[2]−c×z[4]+d×z[6]−e×z[8]
m3=g×z[3]−j×z[7]
m4=f×z[2]+h×z[4]−i×z[6]+k×z[8]
m5=a×z[5]
m6=−1×z[2]+m×z[4]+n×z[6]−o×z[8]
m7=j×z[3]+g×z[7]
m8=p×z[2]−q×z[4]+r×z[6]+s×z[8]
Z[1]=m1+m2
Z[2]=m3+m4
Z[3]=m5+m6
Z[4]=m7+m8
Z[5]=m1−m2
Z[6]=m4−m3
Z[7]=m5−m6
Z[8]=m8−m7
This method needs twenty-two multiplications and twenty-two additions. It follows that the 2-D S-transform needs 352 (16×22) multiplications and 352 (16×22) additions, for a total of 704 operations.
The pixel-domain implementation, as illustrated in
Thus, the fast S-transform according to the invention saves about 30% of the operations when compared to the prior art pixel-domain implementation. In addition, the S-transform can be implemented in just two stages, whereas the prior art pixel-domain processing using the reference IDCT requires six stages.
Fast HT-to-DCT Conversion
Similar to the case of S-transform, let
aa=0.1768,bb=0.1602, cc=0.0562, dd=0.0376, ee=0.0319 ff=0.0577, gg=0.1394, hh=0.1112, ii=0.0540, jj=−0.0099, kk=0.0301, ll=0.0132, mm=0.0907, nn=0.1358, oo=0.0663, pp=0.0073, qq=0.0058, rr=0.0649, ss=0.1234.
As can be seen from equation (8), the 2D R-transform is also separable. It can be computed through 1D transforms, i.e., column transforms followed by row transforms. Hence, we show only the computation of the 1D transform. Let ZZ be an 8-point column vector, and zz be the 1D R-transform of ZZ. The following steps are for a method to determine zz from ZZ.
m1=ZZ[1]+ZZ[5]
m2=ZZ[1]−ZZ[5]
m3=ZZ[2]−ZZ[6]
m4=ZZ[2]+ZZ[6]
m5=ZZ[3]+ZZ[7]
m6=ZZ[3]−ZZ[7]
m7=ZZ[4]−ZZ[8]
m8=ZZ[4]+ZZ[8]
zz[1]=aa×m1
zz[2]=bb×m2+ff×m4−ll×m6+pp×m8
zz[3]=gg×m3+jj×m7
zz[4]=−cc×m2+hh×m4+mm×m6−qq×m8
zz[5]=aa×m5
zz[6]=dd×m2−ii×m4+nn×m6+rr×m8
zz[7]=−jj×m3+gg×m7
zz[8]=−ee×m2+kk×m4−oo×m6+ss×m8
Integer Approximation of Fast DCT-to-HT Conversion
Floating-point operations are generally more expensive to implement than integer operations. Therefore, we also provide an integer approximation for the S-transform.
We multiply S by an integer that is a power of two, and use the integer transform kernel matrix to perform the operation using integer-arithmetic. Then, the resulting coefficients are scaled down by shifting. In video transcoding applications, the shifting operations can be absorbed during quantization. Therefore, no additional computations are required to use integer arithmetic.
The larger the integer we select, the more accuracy we may achieve. In many applications, the number is limited by the microprocessor on which the transcoding is performed. We describe how to choose the number such that the computation can be performed using 32-bit arithmetic, which is within the capability of most microprocessors.
For the case of the DCT-to-HT conversion, the DCT coefficients lie in the range of [−2048 to 2047]. This is a dynamic range of 4096, which needs 12 bits to represent. The gain of 2D S-transform is at most 42, which needs log2(42)=5.4 bits. Therefore, 17.4 bits are needed to represent the final S-transform results. To be able to use 32-bit arithmetic, the scaling factor is made smaller than the square root of (2(32-17.4)). The maximum integer satisfying this condition while being a power of two is 128.
Therefore, the integer transform kernel matrix is
Comparing SI with S, we notice that the number of zero elements and the symmetry remain the same. Therefore, the method and flow-graph derived for the S-transform are also applicable to the integer approximation, as long as the values a through s are replaced with the corresponding elements of the matrix SI, instead of S.
Integer Approximation of Fast HT-to-DCT Conversion
We also provide the integer approximation of the method for the R-transform. We multiply an integer that is a power of two, and use the integer transform kernel to perform the operation using integer-arithmetic. Then, the resulting coefficients are scaled down through shifting.
For the case of HT-to-DCT conversion, the HT coefficients have a 12-bit dynamic range. The gain of 2D R-transform is at most 0.3416, which actually reduces the dynamic range to 11-bit. To be able to use 32-bit arithmetic, the scaling factor must be smaller than square root of (2(32-11)). The maximum integer satisfying this condition while being a power of 2 is 1024.
Therefore, the integer transform kernel matrix is
Comparing RI with R, we notice that the number of zero elements and the symmetry remain the same. Therefore, the method and flow-graph derived for R-transform are also applicable to the integer approximation, as long as the values aa through ss are replaced with the corresponding elements of the matrix RI, instead of R.
DCT-to-HT Down Sampling Conversion
For MPEG-2 to H.264/AVC transcoding with spatial resolution reduction, the DCT-to-HT coefficient conversion with down sampling is useful.
Yd=Sd×X1×SdT (14)
This transform is referred to as Sd-transform, and is described in further detail below.
Some notations used in the derivation are as follows:
The derivation of the Sd-transform is provided below.
The inverse DCT of X1 is x1, i.e.,
x1=T4T×X1×T4. (15)
The HT transform of x1 is Yd, i.e.,
Comparing equation (15) with equation (14), we have
Sd=H×T4T. (16)
The down sampling DCT-to-HT transform is given by equation (14) and its transform kernel matrix Sd, rounded off to four decimal places, is:
where α=2, β=3.1543, and γ=0.2242.
Following the same principle of the S-transform, we derive the method based on the sparseness of symmetry and the transform kernel matrix Sd.
The DCT coefficients have a 12-bit dynamic range. The gain of 2D Sd-transform is at most 11.42, which increases the dynamic range to 15.52-bit. To be able to use 32-bit arithmetic, the scaling factor must be smaller than square root of (2(32-15.52)). The maximum integer satisfying this condition while being a power of two is 256.
Therefore, the integer transform kernel matrix considering 32-bits arithmetic is given as follows:
The method for Sd-transform is also applicable to the integer approximation, as long as the values α through γ are replaced with the corresponding elements of the matrix SId, instead of Sd.
Transcoding
FIGS. 10A-C show how the transforms described in this invention are used for transcoding intra-frames.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
This application is related to U.S. patent application Ser. No. ______, “Selecting Macroblock Coding Modes for Video Encoding,” co-filed herewith by Xin et al., on Jun. 1, 2004, and incorporated herein by reference.