Digital video such as DirecTV and DVD applications has been growing in popularity. Digitizing a video signal generates huge amounts of data. Frames of pixels are generated many times per second, and each frame has many pixels. Each pixel has a plurality of bits which defines it luminance (brightness) and two different sets of bits which define its color.
A digital video signal is often represented in a YCbCr format, which follows the human visual perception model. Y is the luminance (or luma) information and Cb and Cr is the chrominance (or chroma) information. The human eye is most sensitive to the luminance information as that is where the detail of edges is found; the chrominance information plays less importance. For this reason, Cb and Cr channels are often subsampled as by a factor of 2 in the horizontal and vertical dimensions in order to save on the representation. Such a format is referred to as YCbCr 4:2:0.
The huge amount of data involved in representing a video signal cannot be transmitted or stored practically because of the sheer volume and limitations on channel bandwidth and media storage capacity; compression is therefore necessary. Because a video has high spatial and temporal redundancy (the first relating to the fact that neighbor pixels within a frame are similar, and the second relating to the fact that two subsequent frames are similar), getting rid of such redundancy is the basis of modern video compression approaches. Compression generally speaking tries to predict a frame from the previous frames exploiting temporal redundancy, and tries to predict parts of a frame from other parts of the same frame exploiting spatial redundancy. Only the difference information is transmitted or stored. MPEG2 and MPEG4 are examples of compression which are familiar today.
In the last few years, High Definition (HD) television formats have been gaining popularity. HD complicates the data volume problem because HD formats use even more pixels than the standard NTSC signals most people are familiar with.
The H.264 Advanced Video Codec (AVC) is the most recent standard in video compression. This standard was developed by the Joint Video Team of ITU-T and MPEG groups. It offers significantly better compression rate and quality compared to MPEG2/MPEG4. The development of this standard has occurred simultaneously with the proliferation of HD content. The H.264 standard is very computationally intensive. This computational intensity and the large frame size of HD format signals pose great challenges for real-time implementation of the H.264 codec.
To date some attempts have been made in the prior art to implement H.264 codecs on general purpose sequential processors. For example, Nokia, Apple Computer and Ateme have all attempted implementations of the H.264 standard in software on general purpose sequential computation computers or embedded systems using Digital Signal Processors. Currently, none of these systems is capable of performing real time H.264 encoding in full HD resolutions.
Parallel general purpose architectures such as Digital Signal Processors (DSPs) have been considered in the prior art for speeding up computationally-intensive components of the H.264 code. For example, DSPs were used for the motion estimation and deblocking processes in papers by H. Li et al., Accelerated Motion Estimation of H.264 on Imagine Stream Processor, Proceedings of ICIAR, p. 367-374 (2005) and J. Sankaran, Loop Deblocking of Block Coded Video in a Very Long Instruction Word Processor, U.S. Patent Application Publication 20050117653, (June 2005 Texas Instruments). DSPs are well adapted for performing one dimensional filtering, but they lack the capability of processing two-dimensional data as required in digital video processing and coding applications.
There also exist in the prior art hardware implementations custom tailored for H.264 codecs including chips by Broadcom, Conexant, Texas Instruments and Sigma Designs. Special architectures were proposed for some computationally-intensive components of the H.264 codec. There follows some examples.
1) Intra-prediction schemes are taught by Drezner, D, Advanced Video Coding Intra Prediction Scheme, U.S. Patent Application 20050276326 (December 2005 Broadcom), and Dottani et al., Intra 4×4 Modes 3, 7 and 8 Availability Determination Intra Estimation and Compensation, U.S. Pat. No. 7,010,044 (March 2006 LSI Logic);
2) Inverse transform and prediction in a pipelined architecture is taught in Luczak et al., A Flexible Architecture for Image Reconstruction in H.264/AVC Decoders, Proceedings ECCTD (2005). This paper presents a pipelined architecture to do image reconstruction using bit serial algorithms on a pipeline using an intra 4×4 predictor architecture, adder grid and plane predictor and a 1-D inverse transformation engine of
3) Video data structures are taught by Linzer et al., 2-D Luma and Chroma DMA Optimized for 4 Memory Banks, U.S. Pat. No. 7,015,918 (March 2006 LSI Logic).
4) Basic operations such as scan conversion are taught by Mimar, Fast and Flexible Scan Conversion and Matrix Transpose in SIMD Processor, U.S. Pat. No. 6,963,341 (November 2005).
For the in-loop deblocking filter in the H.264 standard, several special architectures were proposed:
1) V. Venkatraman et al., Architecture for Deblocking Filter in H.264, Proceedings Picture Coding Symposium (2004). proposed a hardware accelerator which is optimized for H.264 deblocking computations and requires a general purpose processor and addition components to implement the entire codec.
2) A pipelined deblocking filter is taught by Kim, Y.-K et al., Pipeline Deblocking Filter, U.S. Patent Application Publication 20060115002 (June 2006 Samsumg Electronics).
3) Parallel processing of the deblocking filter is taught by Dang, P. P., Method and Apparatus for Parallel Processing of In-Loop Deblocking Filter for H.264 Video Compression Standard, U.S. Patent Application Publication 20060078052 (December 2005)
4) J. Li, Deblocking filter process with local buffers, U.S. Patent Application 20060029288 (February 2006) teach a memory buffer architecture for deblocking filter.
Several companies are mass-producing custom chips capable of decoding H.264/AVC video. Chips capable of real-time decoding at high-definition picture resolutions include these:
Such chips will allow widespread deployment of low-cost devices capable of playing H.264/AVC video at standard-definition and high-definition television resolutions.
Many other hardware implementations are deployed in various markets, ranging from inexpensive consumer electronics to real-time FPGA-based encoders for broadcast. A few of the more familiar hardware product offerings for H.264/AVC include these:
There still does exist however a need for a highly parallel architecture and processes for using the data independency of macroblocks in video frames in highly parallel computer architectures which are adapted to efficiently do operations on two dimensional signals expressed in the form of 4×4 matrices of integers which can be used both for H.264 compression and other compression standards such as MPEG2/MPEG4 etc.
The present invention is a method and apparatus to perform deblocking filtering on any parallel processing platform to speed it up. The general notion here is to speed up the deblocking process by dividing the problem up into sub-problems which are data independent of each other such that each sub-problem can be solved on a separate computational path in any parallel processing architecture.
The genus of the invention is defined by the following characteristics which all species within the genus will share:
1) simultaneous deblocking of vertical luma edges during at least some of a plurality of iterations on at least some of a plurality of computational units of a parallel processing architecture computer, and simultaneous deblocking of both vertical and horizontal luma edges during at least some of a plurality of iterations on at least some of a plurality of computational units of a parallel processing architecture computer;
2) the order of deblocking of both horizontal and vertical edges is determined by both raster scan order and data dependency;
3) if there are enough computational units available such that some are idle during some iterations, then idle computational units are used to deblock vertical and/or horizontal chroma channel edges simultaneously with deblocking of vertical and/or horizontal luma edges or simultaneous deblocking of multiple vertical chroma edges alone during at least some of a plurality of iterations on at least some of a plurality of computational units of a parallel processing architecture computer and simultaneous deblocking of multiple horizontal chroma edges alone during at least some of a plurality of iterations on at least some of a plurality of computational units of a parallel processing architecture computer, wherein the order of deblocking of chroma vertical and horizontal edges is determined by raster scan order and data dependency, and wherein whether or not simultaneous deblocking of luma and chroma edges occurs on some of said plurality of edges depends upon the number of computational units available;
4) simultaneous filtering of several lines of pixels in the blocks for deblocking of each edge
In the preferred class of embodiments the luma and chroma edges are divided into six sets. The vertical luma edges form the first set of edges. The horizontal luma edges form the second set of edges. The vertical Cb chroma edges form the fourth set of edges, the vertical Cr chroma edges form the fifth set of edges, and the horizontal Cb chroma edges form the sixth set of edges. The processing of each of these sets of edges is carried out on a plurality of computational units referred to herein as clusters, in a set of iterations determined by the data dependency between a set of edges and other sets of edges. The processing is carried out such the first set of edges is deblocked by a first set of clusters in a first set of iterations, and so on for the rest of the sets of edge, mutatis mutandis. During this processing, the set of clusters and set of iterations may be partially or completely overlapping or completely disjoint depending upon the number of clusters available. Overlap of sets of iterations implies simultaneous processing of parts or entire sets of edges. Overlap of sets of clusters implies that processing of different parts of sets of edges is allocated to the same computational units.
Digital video is a type of video recording system that works by using a digital, rather than analog, representation of the video signal. This generic term is not to be confused with DV, which is a specific type of digital video. Digital video is most often recorded on tape, then distributed on optical discs, usually DVDs.
Video compression refers to making a digital video signal use less data, without noticeably reducing the quality of the picture. In broadcast engineering, digital television (DVB, ATSC and ISDB) is made practical by video compression. TV stations can broadcast not only HDTV, but multiple virtual channels on the same physical channel as well. It also conserves precious bandwidth on the radio spectrum. Nearly all digital video broadcast today uses the MPEG-2 standard video compression format, although H.264/MPEG-4 AVC and VC-1 are emerging contenders in that domain.
MPEG-2 is the designation for a group of coding and compression standards for Audio and Video (AV), agreed upon by MPEG (Moving Picture Experts Group), and published as the ISO/IEC 13818 international standard. MPEG-2 is typically used to encode audio and video for broadcast signals, including direct broadcast satellite (DirecTV or Dish Network) and Cable TV. MPEG-2, with some modifications, is also the coding format used by standard commercial DVD movies.
H.264, MPEG-4 Part 10, or AVC, for Advanced Video Coding, is a digital video codec standard which is noted for achieving very high compression ratios. A video codec is a device or software module that enables video compression or decompression for digital video. The compression usually employs lossy data compression. In daily life, digital video codecs are found in DVD (MPEG-2), VCD (MPEG-1), in emerging satellite and terrestrial broadcast systems, and on the Internet.
The H.264 standard was written by the ITU-T Video Coding Experts Group (VCEG) together with the ISO/IEC Moving Picture Experts Group (MPEG) as the product of a collective partnership effort known as the Joint Video Team (JVT). The ITU-T H.264 standard and the ISO/IEC MPEG-4 Part 10 standard (formally, ISO/IEC 14496-10) are technically identical. The final drafting work on the first version of the standard was completed in May of 2003.
The need for video compression stems from the fact that digital video always requires high data rates—the better the picture, the more data is needed. This means powerful hardware, and high bandwidth when video is transmitted. However, much of the data in video is either redundant or easily predicted—for example, successive frames in a movie rarely change much from one to the next—this makes data compression work well with video. Such compression is referred to as lossy, because the video that can be recovered after such a process is not identical to the original one.
In computer science and information theory, data compression or source coding is defined as the process of encoding information using fewer bits (or other information-bearing units) than a raw (prior to coding) representation would use. The forward process of creating such a representation is termed encoding, the backward process of recovering the information is termed decoding. The entire scheme comprising an encoder and decoder is called a codec, for coder/decoder.
If the original data can be recovered precisely by the decoder, such a compression is termed lossless. Video compression can usually make video data far smaller while permitting a little loss in quality. For example, DVDs use the MPEG-2 compression standard that makes the movie 15 to 30 times smaller, while the quality degradation is not significant.
Video is basically a three-dimensional array of color pixels. Two dimensions serve as spatial (horizontal and vertical) directions of the moving pictures, and one dimension represents the time domain. A frame is a set of all pixels that correspond to a single point in time. Basically, a frame can be thought of as an instantaneous still picture.
Video data is often spatially and temporally redundant. This redundancy is the basis of modern video compression methods. One of the most powerful techniques for compressing video is inter-frame prediction. In the MPEG and H.264 video compression lexicon, this is called P mode compression. Each frame is divided into blocks of pixels, and for each block, the most similar block is found in adjacent reference frame by a process called motion estimation. Due to temporal redundancy, the blocks will be very similar, therefore, one can transmit only the difference between them. The difference, called residual macroblock, undergoes the process of transform coding and quantization, similarly to JPEG. Since inter-frame relies on previous frames, by loosing part of the encoded data, successive frames cannot be reconstructed. Also, prediction errors to be accumulated, especially if the video content changes abruptly (e.g. at scene cuts). To avoid this problem, I frames are in MPEG compression. I frames are basically treated as JPEG compressed pictures.
Compression of residual macroblocks and the blocks in I frames is based on the discrete cosine transform (DCT), whose main aim is spatial redundancy reduction. The discrete cosine transform (DCT) is a Fourier-type transform similar to the discrete Fourier transform (DFT), but using only real numbers. It is equivalent to a DFT of roughly twice the length, operating on real data with even symmetry (since the Fourier transform of a real and even function is real and even), where in some variants the input and/or output data are shifted by half a sample. (There are eight standard variants, of which four are common.) The most common variant of discrete cosine transform is the type-II DCT, which is often called simply “the DCT”; its inverse, the type-III DCT, is correspondingly often called simply “the inverse DCT” or “the IDCT”.
Two related transforms are the discrete sine transform (DST), which is equivalent to a DFT of real and odd functions, and the modified discrete cosine transform (MDCT), which is based on a DCT of overlapping data.
The H.264 video compression standard requires that a Modified Integer Discrete Cosine Transfer be used, and its particular implementation with integer arithmetic, and that is what is used in the preferred embodiments of H.264 video codec implementations according to the teachings of the invention doing compression. However, the term “Discrete Cosine Transform” if used in the claims, should be interpreted to cover the DCT and all its variants that work on integers.
Further compression is achieved by quantization. In digital signal processing, quantization is the process of approximating a continuous or very wide range of values (or a very large set of possible discrete values) by a relatively small set of discrete symbols or integer values. Basically, it is truncation of bits and keeping only a selected number of the most significant bits. As such, it causes losses. The number of bits kept is programmable in most embodiments but can be fixed in some embodiments.
The quantization can either be scalar quantization or vector quantization; however, nearly all practical designs use scalar quantization because of its greater simplicity. Quantization plays a major part in lossy data compression. In many cases, quantization can be viewed as the fundamental element that distinguishes lossy data compression from lossless data compression, and the use of quantization is nearly always motivated by the need to reduce the amount of data needed to represent a signal.
A typical digital video codec design starts with conversion of camera-input video from RGB color format to YCbCr color format, and often also chroma subsampling to produce a 4:2:0 (or sometimes 4:2:2 in the case of interlaced video) sampling grid pattern. The conversion to YCbCr provides two benefits: first, it improves compressibility by providing decorrelation of the color signals; and second, it separates the luma signal, which is perceptually much more important, from the chroma signal, which is less perceptually important and which can be represented at lower resolution.
Many different video codec designs exist in the prior art. Of these, the most significant recent development is video codecs technically aligned with the standard MPEG-4 Part 10 (a technically aligned standard with the ITU-T's H.264 and often also referred to as AVC). This emerging new standard is the current state of the art of ITU-T and MPEG standardized compression technology, and is rapidly gaining adoption into a wide variety of applications. It contains a number of significant advances in compression capability, and it has recently been adopted into a number of company products, including for example the PlayStation Portable, iPod, the Nero Digital product suite, Mac OS X v10.4, as well as HD DVD/Blu-ray Disc.
H.264 encoding and decoding are very computationally intensive, so it is advantageous to be able to perform them on a parallel processing architecture to speed the process up and enable real time encoding and decoding of digital video signals even if they are High Definition format. To do H.264 encoding and decoding on a parallel processing computing platform (any parallel processing platform with any number of parallel computing channels will suffice to practice the invention), it is necessary to break the encoding and decoding problems down into parts that can be computed simultaneously and which are data independent, i.e., no dependencies between data which would prevent parallel processing.
In the main profile of the H.264 codec, compression is usually performed on video in the YCbCr 4:2:0 format with 8 bits per channel representation. The luminance component of the frame is divided into 16×16 pixel blocks called luma macroblocks and the chrominance Cb and Cr channels are divided into 8×8 Cb and Cr blocks of pixels, collectively referred to as chroma macroblocks.
Referring to
Like the previous MPEG standards, H.264 codec employs temporal redundancy. The H.264 has introduced the following main novelties:
1) macroblock-based prediction: each macroblock is treated as a stand-alone unit, and the choice between I and P modes is on macroblock rather than entire frame level, such that a single frame can contain both I and P blocks. Macroblocks can be grouped into slices.
2) an additional level of spatial redundancy utilization was added by means of inter-prediction. The main idea is to predict the macroblock pixel from neighbor macroblocks within the same frame, and apply transform coding to the difference between the actual and the predicted values.
3) P macroblocks, even within the same frame, can use different reference frames.
The residual macroblock is encoded in encoder 30 and the encoded data on line 32 is transmitted to a decoder elsewhere or some media for storage. Encoder 30 does a Discrete Cosine Transform (DCT) on the error image data to convert the functions defined by the error image samples. The integer luminance difference numbers of the error image define a function in the time domain (because the pixels are raster scanned sequentially) which can be transformed to the frequency domain for greater compression efficiency and fewer artifacts. The DCT transform outputs integer coefficients that define the amplitude of each of a plurality of different frequency components, which when added together, would reconstitute the original time domain function. Each coefficient is quantized, i.e., only some number of the most significant bits are kept of each coefficient and the rest are discarded. This cause losses in the original picture quality, but makes the transmitted signal more compact without significant visual impairment of the reconstructed picture. For the coefficients of the higher frequency components, more aggressive quantization can be performed (fewer bits kept) because the human eye is less sensitive to the higher frequencies. More bits are kept for the DC (zero frequency) and lower frequency components because of the eye's higher sensitivity.
All the circuitry inside box 34 is the encoder, but the predicted frame on line 22 is generated by a decoder 36 within the encoder.
In inter-prediction mode, each P-block (or each subdivision thereof) has a motion vector which points to the same size block of pixels in a previous frame using a Cartesian x,y coordinate set which are the closest in luminance values to the pixel luminance values of the macroblock. The differences between the reference macroblock luminance values and the reference block luminance values are encoded as a macroblock of error values which are integers which range from −255 to +255. The data transmitted for the compressed macroblock is these error values and the motion vector. The motion vector points to the set of pixels in the reference frame which will be the predicted pixel values in the block being reconstructed in the decoder. This P-block encoding is the form of compression that is used most because it uses the fewest bits.
The differences between the luma values of the block being encoded and the reference pixels are then encoded using DCT and quantization. In the preferred embodiment, the macroblock of error values is divided into four 4×4 blocks of error numbers. Each error number is the number of bits it takes to represent an integer ranging from −255 to +255. Chroma encoding is slightly different because the macroblocks are only half the resolution of the luma macroblocks.
The DCT, and in particular the DCT-II, is often used in signal and image processing, especially for lossy data compression, because it has a strong “energy compaction” property: most of the signal information tends to be concentrated in a few low-frequency components of the DCT. This allows compression by quantization because more bits of the less significant high frequency components can be removed and more bits of the more significant low frequency components can be kept. For example, suppose 16 bits are output for every frequency component coefficient. For the less significant higher frequency components, only two bits might be kept, whereas for the most significant component, the DC component, all 16 bits might be kept. Typically, quantization is done by using a quantization mask which is used to multiply the output matrix of the DCT transform. The quantization mask does scaling so that more bits of the lower frequency components will be retained.
The discrete cosine transform is defined mathematically as follows.
As an example of a DCT transform, a DCT is used in JPEG image compression, MJPEG, MPEG, and DV video compression. In these compression schemes, the two-dimensional DCT-II of N×N blocks is computed and the results are quantized and entropy coded. In this example, N is typically 8 so an 8×8 block of error numbers is the input to the transform, and the DCT-II formula is applied to each row and column of the block. The result is an 8×8 transform coefficient array in which the (0,0) element is the DC (zero-frequency) component and entries with increasing vertical and horizontal index values represent higher vertical and horizontal spatial frequencies. The DC component contains the most information so in more aggressive quantization, the bits required to express the higher frequency coefficients can be discarded.
In H.264, the macroblock is divided into 16 4×4 blocks, each of which is transformed using a 4×4 DCT. In some intra prediction modes, a second level of transform coding is applied to DC coefficients of the macroblocks, in order to reduce the remaining redundancy. The 16 DC coefficients are arranged into a 4×4 matrix, which is transformed using the Hadamard transform.
Also, only luminance values will be discussed unless otherwise indicated although the same ideas apply to the chroma pixels as well.
Referring to
The predicted frame macroblock is generated either as an I-block by intraframe prediction circuit 66 or motion compensation circuit 68.
The resulting per pixel error in luminance results in a stream of integers on line 70 to a transformation, scaling and quantization circuit 72. There a Discrete Cosine Transform is performed on the error numbers and scaling and quantization is done to compress the resulting frequency domain coefficients output by the DCT. The resulting compressed luminance data is output on line 74.
A coder control block 76 controls the transformation process and the scaling and quantization and outputs control data on line 78 which is transmitted with the quantized error image transform coefficients. The control data includes which mode was used for prediction (I or P), how strong the quantization is, settings for the deblocking filter, etc.
For each macroblock either intra-frame prediction (which generates an I-block macroblock) is used or interframe prediction (which generates a P-block macroblock) is used to generate the macroblock. A control signal on line 80 controls which type of predicted macroblock is supplied to summer 62.
To generate a predicted macroblock, a reference frame is used. The reference frame is the just previous frame and is generated by an H.264 decoder within the encoder. The H.264 decoder is the circuitry within block 82. Circuit 84 dequantizes the compressed data on line 74, and does inverse scaling and an inverse DCT transformation.
The resulting pixel luminance reconstructed error numbers on line 86 are summed in summer 88 with the predicted pixel values in the predicted macroblock on line 64. The resulting reconstructed macroblocks are processed in deblocking filter 90 which outputs the reconstructed pixels of a video frame shown at 92. Video frame 92 is basically the previous frame to the frame being encoded and serves as the reference frame for use by motion estimation circuit 94 which generated motion vectors on line 96.
The motion estimation circuit 94 compares each macroblock of the incoming video on line 61 to the macroblocks in the reference frame 92 and generates a motion vector which is a vector to the coordinates of the origin of a macroblock in the reference frame whose pixels are the closest in luminance values to the pixels of the macroblock to which the motion vector pertains. This motion vector per macroblock on line 96 is used by the motion compensation circuit 68 to generate a P-block mode predicted macroblock whose pixels have the same luminance values at the pixels in the macroblock of the reference frame to which the motion vector points.
The intraframe prediction circuit 66 just uses the values of neighboring pixels to the macroblock to be encoded to predict the luminance values of the pixels in the I-block mode predicted macroblock output on line 64.
A particularly computationally intensive part of the H.264 codec is the deblocking filter, also referred to as the in-loop filter, whose main purpose is the reduction of artifacts (referred to as the blocking effect) resulting from transform-domain quantization, often visible in the decoded video and disturbing the viewer. In the H.264 ecoder, the deblocking filter also allows improving the accuracy of inter-prediction, since the reference blocks are taken after the deblocking filter is applied.
In state-of-the-art implementation of the H.264 decoder, the deblocking filter can take up to 30% of the computational complexity. The H.264 standard defines a specific deblocking filter, which is an adaptive process acting like a low pass filter to smooth out abrupt edges and does more smoothing if the edges between 4×4 blocks of pixels are more abrupt. The deblocking smoothes the edges between macroblocks so that they become less noticeable in the reconstructed image. In MPEG2 and MPEG4, deblocking filter is not part of the standard codec, but can be applied as a post-processing operation on the decoded video.
The H.264 standard introduced the deblocking filter as part of the codec loop after the prediction. In the decoder of
Because the DCT transform in the H.264 standard is done on 4×4 blocks, the boundaries between 4×4 blocks inside a macroblock and between neighbor macroblocks may be visible (edge refers to a boundary between two blocks). In order to reduce this effect, a filter must be applied on the 16 vertical edges and 16 horizontal edges for the luma component and on 4 vertical and 4 horizontal edges for each of the chroma components.
When we say edge filtering, we refer to changing the pixels in the blocks on the left and the right of the edge. For vertical edge, each of the 4 lines of 4 pixels in the 4×4 block on the left and each of the 4 lines in the block on the right from the edge must undergo filtering. Each filtering operation affects up to three pixels on either side of the edge. The amount of filtering applied to each edge is governed by boundary strength ranging from 0 to 4, and depending on the current quantization parameter and the coding modes of the neighboring blocks. This setting applies to the entire edge (i.e. to four rows or columns belonging to the same edge). Two 4×4 matrices with boundary strengths for vertical and horizontal edges are computed for this purpose. The actual amount of filtering also depends on the gradient of intensities across the edge, and is decided for each row or column of pixels crossing the edge.
In the H.264 standard, in order to account for the need to do filtering of different strength, two different filters may be applied to a line of pixels. These filters are referred to as a long filter (involving the weighted sum of six pixels, three on each side of the edge) and the short filter (involving the weighted sum of four pixels, two on each side of the edge). The decision of which filter to use is separate for each line in the block. Each line can be filtered with the long filter, the short filtered, or not filtered at all.
The H.264 does not prescribe any parallelization of the deblocking filter. It only requires that the vertical luma and chroma edges are deblocked prior to the horizontal ones. However, parallelization to accomplish this order of calculation is an implementation detail left up to the designer, and that is essential to achieving the advantages the invention achieves.
In several prior art implementations of the deblocking filter for the H.264 codec, this data dependency was used to some extent in order to improve the computational efficiency of the deblocking filter. Here, we refer to the following prior art:
1. Y.-W. Huang et al., Architecture design for deblocking filter in H.264/JVT/AVC, Proceeding IEEE International Conference on Multimedia and Expo (2003), proposed an architecture in which two adjacent blocks are stored in a 4×8 pixel array, from which the lines of pixels are fed into a one-dimensional filter, which performed the processing of pixels. The processing is first applied to the vertical luma edges in raster scan order, then to the horizontal luma edges, in raster scan order. Afterwards, the chroma edges are filtered. The order in which the horizontal and luma vertical and horizontal edges are deblocked is as shown in
2. V. Venkatraman et al., Architecture for Deblocking Filter in H.264, Proceedings Picture Coding Symposium (2004) showed a pipeline architecture, which is an improvement to the architecture of Huang et al., in which two one-dimensional filters are operated in parallel, processing vertical edges in raster scan order and simultaneously, with a delay of two iterations, horizontal edges, in pipeline manner in the order shown in
3. Another pipelined deblocking filter is taught by Y.-K. Kim et al., Pipeline Deblocking Filter, U.S. Patent Application Publication 20060115002 (June 2006 Samsumg Electronics). The vertical and horizontal edges are filtered in the order presented in
4. A multi-stage pipeline architecture is taught by P. P. Dang, Method and Apparatus for Parallel Processing of In-Loop Deblocking Filter for H.264 Video Compression Standard, U.S. Patent Application Publication 20060078052 (December 2005). In this approach, sequential filtering of luma and chroma edges takes 30 iterations. The filtering order is presented in
The invention claimed herein is a method and apparatus to do deblocking filtering on any parallel processing platform, utilizing in the best way the data dependency.
We identify three levels of parallelization in the deblocking filter process:
All luma and chroma horizontal and vertical edges are filtered in 8 iterations in the order shown in
The general notion here is to speed up the deblocking process by dividing the problem up into sub-problems which are data independent of each other such that each sub-problem can be solved on a separate computational path in any parallel processing architecture.
A possible order of edge processing according to our invention, utilizing in the best way the data dependency is shown in
A schematic filter unit for vertical edge filtering is depicted in
The selection of which result to chose for each line of pixels is defined in the H.264 standard and depends both on the boundary strength and the pixel values in the line. For example, for boundary strength equal to 4, the long filter is selected for the first line of four pixels p13 . . . p10 in
The operation of the long filter 600 and the short filter 602 and the selection 604 thereof can be represented as a sequence of tensor operations on 4×4 matrices (where by tensor we imply a matrix or a vector, and tensor operations refer to operations performed on matrix, its columns, rows, or elements), and carried out on appropriate processing units implemented either in hardware or in software. This approach is used in the preferred embodiment, but computational units with tensor operation capability are not required in all embodiments.
A one-dimensional filter, used in prior art implementation of the deblocking filter, processes the lines in the 4×4 blocks sequentially in four iterations. For example, referring to the notation in
The filtering unit 502 used for horizontal edge filtering can be obtained from the vertical edge filtering unit 500 by means of a transposition operation 700, as shown in
The preferred platform upon which to practice the parallel deblocking filter process is a massively-parallel, computer with multiple independent computation units capable of performing tensor operations on 4×4 matrix data. An example of a parallel computer having these capabilities is the AVIOR (Advanced VIdeo ORiented) architecture, shown in
In this architecture, the basic computational unit is a cluster 820 (depicted in
A plurality of clusters form a group 810, depicted in
The entire architecture consists of a plurality of groups. In the configuration shown in
In the embodiment presented here, we employ only one group for the deblocking filter, while the other groups are free to carry out other processing needed in the H.264 codec or perform deblocking of other video streams in a multiple stream decoding scenario.
In the preferred embodiment of this invention, the parallel deblocking filter employing the parallelization described in
In actual implementation, the order of edge deblocking may differ from the one presented in
1. Vertical luma edges (Y 10|11, . . . , Y 43|44);
2. Horizontal luma edges (Y 01-11, . . . , Y 34-44);
3. Vertical Cb edges (Cb 10|11, . . . , Cb 21|22);
4. Horizontal Cb edges (Cb 01-11, . . . , Cb 12-22);
5. Vertical Cr edges (Cr 10|11, . . . , Cr 21|22);
6. Horizontal Cr edges (Cr 01-11, . . . , Cr 12-12).
Each of the 6 sets of edges is allocated to a set of computational units, on which it is deblocked in a set of iterations. If two sets of edges are executed in overlapping sets of iterations, this implies that they are executed in parallel. If a set of edges is allocated to a non-singleton set of computational units (i.e., more than a single computational unit), this implies an internal parallelization in the processing of said set of edges.
The actual allocation and order of processing is subject to availability of computational units and data dependency. In the following, we show allocation of processing in the preferred embodiments of the invention, though other allocations are also possible.
The most computationally-efficient embodiment of the parallel deblocking filter according to our invention is possible on a parallel architecture consisting of at least eight independent processing units. In the AVIOR architecture, this corresponds to one group in the eight-cluster configuration.
In the first iteration, four vertical luma edges (Y 10|11, . . . , Y 40|41 according to the edge numbering convention presented in
In the second iteration, the next four vertical luma edges to the right (Y 11|12, . . . , Y 41|42) are deblocked in parallel on clusters 0-3. On the remaining clusters 4-7, vertical chroma edges (Cb 11|12, Cb 21|22 and Cr 11|12, Cr 21|22) are deblocked.
In the third iteration, the next four vertical luma edges to the right (Y 12|13, . . . , Y 42|43) are processed in parallel on clusters 0-3. On one of the remaining clusters, e.g. 4, a data independent horizontal luma edge Y 01-11 can be deblocked.
In the fourth iteration, the last four vertical luma edges (Y 13|14, . . . , Y 43|44) are deblocked in parallel on clusters 0-3. On two of the remaining clusters, e.g. 4-5, data independent horizontal luma edges Y 11-21 and Y 02-12 can be deblocked.
In the fifth iteration, luma edges Y 21-31, Y 12-22, Y 03-13 and Y 04-14 are deblocked in parallel on clusters 4-7. On the remaining clusters 0-3, horizontal chroma edges (Cb 01-11, Cb 02-12 and Cr 01-11, Cr 02-12) are deblocked.
In the sixth iteration, luma edges Y 31-41, Y 22-32, Y 13-23 and Y 14-24 are deblocked in parallel on clusters 4-7. On the remaining clusters 0-3, horizontal chroma edges (Cb 11-21, Cb 12-22 and Cr 11-21, Cr 12-22) are deblocked.
In the seventh iteration, luma edges Y 32-42, Y 23-33 and Y 24-34 are deblocked in parallel on any three of the available clusters 0-7, e.g., on clusters 5-7.
In the eight iteration, the last luma edges Y 33-43 and Y 34-44 are deblocked in parallel on any three of the available clusters 0-7, e.g., on clusters 6-7. The total number of iterations is 8.
Using our notation of edge, iteration and cluster sets, we have:
Other allocations are possible as well, with the same efficiency. For example, the allocation of Cb and Cr blocks processing can be exchanged.
In the first iteration, four vertical luma edges (Y 10|11, . . . , Y 40|41 according to the edge numbering convention presented in
In the second iteration, the next four vertical luma edges to the right (Y 11|12, . . . , Y 41|42) are deblocked in parallel on clusters 0-3.
In the third iteration, the next four vertical luma edges to the right (Y 12|13, . . . , Y 42|43) are processed in parallel on clusters 0-3.
In the fourth iteration, the last four vertical luma edges (Y 13|14, . . . , Y 43|44) are deblocked in parallel on clusters 0-3. This finishes the vertical luma edges.
In the fifth iteration, four horizontal luma edges (Y 01-11, . . . , Y 04-14) are deblocked in parallel on clusters 0-3.
In the sixth iteration, the next four horizontal luma edges to the right (Y 11-21, . . . , Y 14-24) are deblocked in parallel on clusters 0-3.
In the seventh iteration, the next four horizontal luma edges to the right (Y 21-31, . . . , Y 24-34) are processed in parallel on clusters 0-3.
In the eighth iteration, the last four horizontal luma edges (Y 31-41, . . . , Y 34-44) are deblocked in parallel on clusters 0-3. This finishes the horizontal luma edges.
In the ninth iteration, two vertical chroma edges (Cb 10|11 and Cb 20|21) are deblocked in parallel on clusters 0-1, and two vertical chroma edges (Cr 10|11 and Cr 20|21) are deblocked in parallel on clusters 2-3.
In the tenth iteration, two vertical chroma edges (Cb 11|12 and Cb 21|22) are deblocked in parallel on clusters 0-1, and two vertical chroma edges (Cr 11|12 and Cr 21|22) are deblocked in parallel on clusters 2-3. This finishes the vertical chroma edges.
In the eleventh iteration, two horizontal chroma edges (Cb 01-11 and Cb 02-12) are deblocked in parallel on clusters 0-1, and two horizontal chroma edges (Cr 01-11 and Cr 02-12) are deblocked in parallel on clusters 2-3.
In the twelfth iteration, two horizontal chroma edges (Cb 11-21 and Cb 12-22) are deblocked in parallel on clusters 0-1, and two horizontal chroma edges (Cr 11-21 and Cr 12-22) are deblocked in parallel on clusters 2-3. This finishes the horizontal chroma edges. The total number of iterations is 12.
Using our notation of edge, iteration and cluster sets, we have:
Other allocations are possible as well, with the same efficiency.
In the first iteration, two top vertical luma edges from the first column (Y 10|11 and Y 20|21, according to the edge numbering convention presented in
In the second iteration, two bottom vertical luma edges from the first column (Y 30|31 and Y 40|41) are deblocked in parallel on clusters 0-1.
In the third iteration, two top vertical luma edges from the second column (Y 11|12 and Y 21|22) are deblocked in parallel on clusters 0-1.
In the fourth iteration, two bottom vertical luma edges from the second column (Y 31|32 and Y 41|42) are deblocked in parallel on clusters 0-1.
In the fifth iteration, two top vertical luma edges from the third column (Y 12|13 and Y 22|23) are deblocked in parallel on clusters 0-1.
In the sixth iteration, two bottom vertical luma edges from the third column (Y 32|33 and Y 42|43) are deblocked in parallel on clusters 0-1.
In the seventh iteration, two top vertical luma edges from the fourth column (Y 13|14 and Y 23|24) are deblocked in parallel on clusters 0-1.
In the eighth iteration, two bottom vertical luma edges from the fourth column (Y 33|34 and Y 43|44) are deblocked in parallel on clusters 0-1. This finishes the vertical luma edges.
In the ninth iteration, two left horizontal luma edges from the first row (Y 01-11 and Y 02-12) are deblocked in parallel on clusters 0-1.
In the tenth iteration, two right horizontal luma edges from the first row (Y 03-13 and Y 04-14) are deblocked in parallel on clusters 0-1.
In the eleventh iteration, two left horizontal luma edges from the second row (Y 11-21 and Y 12-22) are deblocked in parallel on clusters 0-1.
In the twelfth iteration, two right horizontal luma edges from the second row (Y 13-23 and Y 14-24) are deblocked in parallel on clusters 0-1.
In the thirteenth iteration, two left horizontal luma edges from the third row (Y 21-31 and Y 22-32) are deblocked in parallel on clusters 0-1.
In the fourteenth iteration, two right horizontal luma edges from the third row (Y 23-33 and Y 24-34) are deblocked in parallel on clusters 0-1.
In the fifteenth iteration, two left horizontal luma edges from the fourth row (Y 31-41 and Y 32-42) are deblocked in parallel on clusters 0-1.
In the sixteenth iteration, two right horizontal luma edges from the fourth row (Y 33-43 and Y 34-44) are deblocked in parallel on clusters 0-1. This finishes the horizontal luma edges.
In the seventeenth iteration, two vertical Cb chroma edges from the first column (Cb 10|11 and Cb 20|21) are deblocked in parallel on clusters 0-1.
In the eighteenth iteration, two vertical Cb chroma edges from the second column (Cb 11|12 and Cb 21|22) are deblocked in parallel on clusters 0-1. This finishes the vertical Cb chroma edges.
In the nineteenth iteration, two horizontal Cb chroma edges from the first row (Cb 01-11 and Cb 02-12) are deblocked in parallel on clusters 0-1.
In the twentieth iteration, two horizontal Cb chroma edges from the second row (Cb 11-21 and Cb 12-22) are deblocked in parallel on clusters 0-1. This finishes the horizontal Cb chroma edges.
In the twenty-first iteration, two vertical Cr chroma edges from the first column (Cr 10|11 and Cr 20|21) are deblocked in parallel on clusters 0-1.
In the twenty-second iteration, two vertical Cr chroma edges from the second column (Cr 11|12 and Cr 21|22) are deblocked in parallel on clusters 0-1. This finishes the vertical Cr chroma edges.
In the twenty-third iteration, two horizontal Cr chroma edges from the first row (Cr 01-11 and Cr 02-12) are deblocked in parallel on clusters 0-1.
In the twenty-fourth iteration, two horizontal Cr chroma edges from the second row (Cr 11-21 and Cr 12-22) are deblocked in parallel on clusters 0-1. This finishes the horizontal Cr chroma edges. The total number of iterations is 24.
Using our notation of edge, iteration and cluster sets, we have:
Other allocations are possible as well, with the same efficiency (for example, the order of Cb and Cr processing can be exchanged).
Though the H.264 standard defines the derivation process for the computation of all the filters in the filtering unit 500, it does not define the exact implementation of these mathematical operations. To do them on a sequential processor may require many sequential scalar operations, and thus be inefficient.
According to the teachings of the invention, it is novel to perform the edge deblocking process by expressing the filter in terms of mathematical tensor operations on two 4×4 blocks of pixels denoted by P and Q in
In order to deblock vertical edge 340, each of the four rows of pixels (comprising a row of 4 pixels to the left of the edge 340 in the block 342 and a row of 4 pixels to the right of the edge 340 in the block 344) must be filtered. The filter output is computed as a weighted sum of the pixels in the rows. The actual filtering process and the weights are defined in the H.264 standard and depend on parameters computed separately. For each row of pixels, there are three possible outcomes: long filter result, short filter result and no filter at all.
According to the teachings of this invention, a generic edge filtering is performed using a process with data flow schematically represented in
Example of Luma Filter Implementation Using Tensor Operations The following is an example of vertical edge deblocking filter corresponding to boundary strength value of 4, as defined in the H.264 standard. The process goes as follows:
1. Compute the long filter result (two 4×4 blocks, Pl and Ql):
2. Compute the short filter result (two 4×4 blocks, Ps and Qs):
Ps=(p3p2p1(2p1+p0+q1+2)>>2)
Qs=((2q1+q0+p1+2)>>2q1q2q3)
3. Compute the masking matrices, which represent the selection operations:
DELTA=|(p0 p0 p0 p0)−(q0 q0 q0 q0)|
C1=DELTA<α
C2=|(q1 q1 q1 q1)−(q0 q0 q0 q0)|<β
C3=|(p1 p1 p1 p1)−(p0 p0 p0 p0)|<β
M0=C1&C2&C3
C6=DELTA<(α>>2+2)
Mp=|(p2 p2 p2 p2)−(p0 p0 p0 p0)<β &C6
Mq=|(q2 q2 q2 q2)−(q0 q0 q0 q0)|<β & C6
< denotes comparison operation, applied element-wise to a 4×4 matrix as a SIMD operation and resulting in a binary matrix, in which, each element equals if the conditions holds and equal 0 otherwise, For example,
& denotes logical AND operation, applied element-wise to a binary 4×4 matrix (i.e., matrix whose elements are either 1 or 0) as a SIMD operation. For example,
The matrices M0, Mp and mq are 4×4 matrices with binary rows (i.e., containing rows equal to either 1 or 0) and are used as a mathematical representation of the selector 604 in
4. combine the long and short filtered results using the masks:
Pf=M0&(Mp&Pl+(!Mp)&Ps)+(!M0)&P
Qf=M0&(Mq&Ql+(!Mq)&Qs)+(!M0)&Q
Other filtering processes according to the H.264 standard are implemented in a similar manner.
Although the invention has been disclosed in terms of the preferred and alternative embodiments disclosed herein, those skilled in the art will appreciate other alternative embodiments which do not depart from the ideas expressed herein. All such alternatives embodiments are intended to be included within the scope of the claims appended hereto.