(A) Field of the Invention
The present invention relates to a memory efficient parallel architecture for motion estimation, and more specifically to a method of data reuse for motion estimation.
(B) Description of the Related Art
H.264/AVC is the latest video coding standard of the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG). Its new features include variable block sizes motion estimation with multiple reference frames, integer 4×4 discrete cosine transform, in-loop deblocking filter and context-adaptive binary arithmetic coding (CABAC). H.264/AVC can save up to 50% bit-rate compared to MPEG-4 simple profile at the same video quality level. However, a large amount of computation is required. A profiling report shows that motion estimation consumes over 90% of the total encoding time. Moreover, a large amount of pixel data is required, inducing the demand of ultra high memory and bus bandwidth. Therefore, data reuse methodology is quite important.
In traditional hardware design of motion estimation, macroblocks are processed serially. However, there is a large overlap between search windows (SW) of neighboring macroblocks, as depicted in
Motion estimation algorithms exploit the temporal redundancy of a video sequence. Among all the motion estimation algorithms, the full-search block-matching algorithm, as shown in
where CB represents current block, RB represents reference block, N is the block size, and (i, j) is the motion vector. In H.264/AVC, each picture of a video is partitioned into macroblocks of 16×16 pixels and each macroblock can be subdivided into seven kinds of variable size sub-blocks (one 16×16 sub-block, two 16×8 sub-blocks, two 8×16 sub-blocks, four 8×8 sub-blocks, eight 8×4 sub-blocks, eight 4×8 sub-blocks, or sixteen 4×4 sub-blocks). Therefore, the motion vector needs to be found, and the associated minimum SAD for each of 41 sub-blocks needs to be calculated.
As shown in
In “On the Data Reuse and Memory Bandwidth Analysis for Full-Search Block-Matching VLSI Architecture,” IEEE Transactions on Circuits and Systems for Video Technology, Vol 12, pp. 61-72, January 2002, by Jen-Chief Tuan, Tian-Sheuan Chang, and Chein-Wei Jen, the authors provide four levels of data reuse methods: (a) Local locality within candidate block; (b) Local locality among adjacent candidate block strips; (c) Global locality within search area strip; and (d) Global locality among adjacent search area strips. In these four methods, local memory size and memory bandwidth are traded off. Larger local memory size results in lower memory bandwidth but higher hardware cost. These four methods truly decrease off-chip memory bandwidth.
In “Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder,” by Tung-Chien Chen, Shao-Yi Chien, Yu-Wen Huang, Chen-Han Tsai, Ching-Yeh Chen, To-Wei Chen and Liang-Gee Chen, IEEE Transactions on Circuits and Systems for Video Technology, Volume 16, Issue 6, June 2006 Page(s): 673-688, the authors take advantage of inter-candidate parallelism, as shown in
The present invention provides a new data reuse methodology for motion estimation, e.g., used in H.264/AVC standard, so as to resolve the high demand of ultra high memory and bus bandwidth for dealing with the data reuse for motion estimation.
In accordance with a first embodiment of the present invention, a so-called inter-macroblock parallelism is proposed. First, pixel data of one of the consecutive candidate blocks in an overlapped region of search windows of current blocks in a reference frame including reference blocks corresponding to the current blocks are read and transferred to a plurality of processing element (PE) arrays in parallel. The plurality of PE arrays are used to determine the match situation of the current blocks and the reference blocks. Then, the above process is repeated for the rest of the candidate blocks in sequence. For example, if there are four current blocks CB1-CB4 and four consecutive candidate blocks, at the beginning the data of the first candidate block are read and transferred to four PE arrays in parallel, and so to the second, third and fourth candidate blocks in sequence, and the four PE arrays calculate SADs for CB1 to CB4, respectively.
In accordance with a second embodiment of the present invention, a so-called inter-macroblock and inter-candidate parallelism is proposed. Pixel data of consecutive candidate blocks in an overlapped region of search windows of current blocks in a reference frame including reference blocks corresponding to the current blocks are read and transferred to a plurality of groups each including processing element (PE) arrays in parallel. The PE arrays of each group are used to determine the match situation of the current blocks and the reference blocks. For example, if there are four current blocks CB1-CB4 and four consecutive candidate blocks, at the beginning the data of the first, second, third and fourth candidate blocks are read and transferred to four groups of PE arrays in parallel. Each group includes four PE arrays for calculating SADs for CB1 to CB4.
According to the methodology of this invention, on-chip memory bandwidth can be significantly decreased and memory access times can be saved; therefore, power consumption is reduced.
The objectives and advantages of the present invention will become apparent upon reading the following description and upon reference to the accompanying drawings in which:
a)-2(c) show processing steps of a traditional method without parallel processing for motion estimation;
To solve those problems mentioned above, a new data reuse methodology, which takes advantage of inter-macroblock parallelism, is proposed.
As shown in
In summary, for increasing the data reuse rate, data of each of the candidate blocks in the overlapped region are read one at a time and in parallel transferred to four 2D processing elements (PE) arrays. Each PE array is responsible for calculating SAD for one current macroblock. This method reduces on-chip memory bandwidth N times by parallel processing of N 2D PE arrays.
In order to further increase the data reuse ratio and reduce on-chip memory bandwidth, a combination of inter-candidate parallelism methodology and inter-macroblock parallelism methodology is proposed.
In summary, the degree of both parallelisms can be extended according to expected throughput. There are sixteen 2D PE arrays in total in the proposed architecture and each of them consists of 256 processing elements (PE). This sixteen-part 2D PE array is divided into four groups. Four consecutive candidate blocks are read at one time and passed parallel to four groups. Each group calculates SADs of a candidate block for four macroblocks. Therefore, the architecture can complete sixteen candidates in one clock cycle when the pipeline is full. Additionally, the search order in the architecture is column major order for realizing inter-macroblock parallelism.
In the meantime, both the proposed inter-macroblock parallelism method and inter-candidate and inter-macroblock parallelism method can reach 100% hardware utilization, and there is no hardware and power waste. For example, the detail timing diagram of proposed inter-macroblock parallelism method is shown in
Because each reference pixel is read once, the proposed methodology can reduce required memory access times. Moreover, this system only saves one candidate block strip instead of one search area strip and hence reduces necessary memory size.
On-chip and off chip memory bandwidth under six different conditions are analyzed. Different sizes of memory and different reuse methodology are used in these conditions. The details of these six conditions are shown below and the results are shown in Table 1 and Table 2. In addition to memory bandwidth, hardware cost and throughput of six conditions are analyzed. Table 3 shows the detail.
In addition, a real case is used to analyze the necessary memory size and memory bandwidth of the six conditions. The settings of the experiment are shown below and
In this invention, a new data reuse methodology for motion estimation in H.264/AVC is proposed. Experimental results show that our methodology can reduce 97.7% of on-chip memory bandwidth (from 128.3 GBytes/s to 2.9 GBytes/s). It also saves memory access times and therefore reduces power consumption. Finally, hardware utilization of proposed architecture is still 100%.
The above-described embodiments of the present invention are intended to be illustrative only. Numerous alternative embodiments may be devised by those skilled in the art without departing from the scope of the following claims.