This invention generally relates to a method and apparatus for reference area transfer. More specifically, it relates to performing pre-analysis for transferring a specific reference area.
In video processing, minimizing the amount of data transfer from external memory to internal memory for motion estimation (ME) and motion compensation (MC) is critical to reduce power consumption. In general, there is a trade-off between the amount of data transfer and internal memory size, i.e., the amount of data transfer can be reduced by increasing internal memory size and vice versa.
However, because internal memory size is fixed based on silicon area, the amount of data transfer needs to be minimized for a given internal memory size. Thus, there is a need for a reference data transfer method and apparatus that minimizes the amount of data transfer using pre-analysis information for a given internal memory size and that improves coding efficiency.
An embodiment of the present invention provides a method and apparatus for reduction of reference data transfer and coding efficiency improvement. The method includes performing pre-analysis on a decimated version of an image, and utilizing the predictions of the pre-analysis to transfer smaller reference area.
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
To minimize the amount of data transfer using pre-analysis information for a given internal memory size and to improve coding efficiency, utilizing accurate pre-analysis enables to control the amount of data transfer and improves PSNR performance. The proposed method minimizes hardware resources, such as, power consumption and internal memory size, for encoding high resolution videos or fast/complex motion videos and for improving coding efficiency.
For example, minimizing the amount of data transfer from external memory to internal memory for motion estimation and motion compensation is critical to reduce power consumption of a video codec. In general, there is a trade-off between the amount of data transfer and internal memory size, i.e., the amount of data transfer can be reduced by increasing internal memory size and vice versa. However, because internal memory size is fixed based on silicon area, the amount of data transfer needs to be minimized for a given internal memory size. Pre-analysis can provide various information, such as, initial motion search point, motion boundary, partition size, etc., which may be utilized to perform motion estimation that minimizes the amount of data transfer and improves coding efficiency.
In one embodiment, as shown in
Usually, motion search on 4:1 domain is performed based on 16x16 block (64x64 on 1:1 domain). However, it will generate motion vectors (MV) for smaller blocks within a 16x16 block, as well as, motion vector for the 16x16 block. Neighboring motion vectors (left, upper-left, upper and upper-right) and global MV are used as initial prediction points. In pre-analysis, cost maybe evaluated at each point and the best point that produces minimum cost is chosen. More motion vectors, such as, co-located motion vectors can be added to increase prediction accuracy. For each initial prediction point, costs of smaller partitions (16x8, 8x16, 8x8 and 4x4) are also evaluated. Each partition has its own best motion vector.
After determining the best initial motion vector, more points maybe searched around the motion vector, such that accurate motion is found. All points within 16x16 and 8x8 search areas around the motion vector for P-type and B-type frame, respectively, are searched. Each partition keeps updating best motion vector during the refinement. After the refinement, each partition has its own best motion vector. To minimize total cost, more combinations with 8x8 and 4x4 partitions are generated. First, we determine best cost for each 8x8 partition (one 8x8 block or four 4x4 block). Then, we compare the best partition to 16x16, 16x8 and 8x16 partitions.
Cost for a search point consists of sum of absolute difference (SAD) and cost for motion vector, where the cost = SAD + lambda * MVD_bits, wherein lambda is a Lagrangian multiplier and MVD_bits is number of bits to encode MV difference between current motion vector and motion vector predictor (MVP). Motion vector predictor is a median of neighboring motion vectors (left, upper and upper-right). Accurate motion vector predictor is available for 16x16 block; however, for smaller partitions, because motion vectors of neighboring blocks are not determined, motion vector predictor of 16x16 block is used.
In one embodiment, search area on 4:1 domain can be determined based on available data transfer bandwidth and internal memory size. The computational complexity for initial predictor evaluation on 4:1 domain is similar to that on 1:1 domain. Refinement of 4:1 domain motion estimation requires more sum of absolute difference calculations, where the main motion estimation may need, for example, 6-tap filtering and 18 sum of absolute difference calculations for fractional-pel search. Thus, assuming computational complexity per 16x16 block is roughly similar to that of main motion estimation, and the total extra computational complexity is (num_16xl6 / 16) * comp_per_16x16, where num_16x16 is a number of 16x16 blocks in a frame and comp_per_16x16 is computational complexity per 16x16 block on 1:1 domain.
Pre-analysis will produce one MV for each 16x16 on 1:1 domain. Let crude motion vector (CMV) denote the MV from pre-analysis because it is crude on 1:1 domain. Search area on 1:1 domain is determined for each 16x16 block using crude motion vector. Reference window, which is actual area for motion estimation, is calculated based on search range, required number of pixels for fractional-pel search and block size (16x16). For example, when search area is +/-9 around CMV in vertical and horizontal directions, reference window becomes +/-40 around CMV in vertical and horizontal directions (in H.264/AVC).
For motion search on 1:1 domain, neighboring motion vector, global MV, temporal motion vectors and Crude motion vectors are used as initial predictors. However, if a motion vector is not within a valid search area determined by Crude motion vector, then the motion vector will be excluded. Also, crude motion vector is used as an initial predictor to reduce computational complexity at the cost of PSNR performance. Similarly, the best initial predictor may be refined by using 3-step search or grid search. For the best search point, fractional-pel may be performed.
When skip/direct MV is not within a valid search range, reference area for skip/direct motion vector may be transferred from external to internal memory; hence, the cost of skip/direct motion vector can be always evaluated.
At final stage, we select a mode (inter or intra) that produces minimal cost. Since a 16x16 block has its own reference window, the reference window should be transferred from external to internal memory. However, if there is an overlapped area between current reference window and neighboring reference window, only non-overlapped area may be transferred.
Alternatively, larger overlapped area is selected and corresponding non-overlapped area is transferred, which increases data transfer but enables to avoid total overlapped area calculation and complex data transfer. In
A skip/direct motion vector may not be within a valid search range. In such a case, the reference area is transferred for the skip/direct motion vector. In one embodiment, the reference area is 22x22 (3 + 3 + 16 = 22 for each direction in H.264), and transferred. There is no overlapped area calculation between skip/direct motion vector reference window and main 40x40 window, i.e., both data transfers are done separately.
In order to ensure real-time operations, instantaneous and average data transfer rate should meet hardware requirement. For example, data transfer rate in IVAHD2.0 is 3584 bytes per 16x16 block for 3840x2160 @ 30 fps. The amount of data transfer (on 1:1 domain) may be estimated with sum of non-overlapped areas of all 16x16 blocks within a frame. Hence, when reference window size is 40x40 for P-type frame, maximum amount of data transfer is 40*40 + 24*24 = 2176 bytes per 16x16 block. For B-type frame, if reference window size is 32x32, maximum amount of data transfer is 2 * (32*32 + 24*24) = 3200 bytes per 16x16 block. In both cases, maximum amount of data transfer is less than 3584 bytes per 16x16 block, which guarantees real-operations. If overlapped areas are considered, actual amount of data transfer is much less than maximum amount.
The required internal memory size (for 1:1 domain) may be estimated by combining overlapped areas between current reference window and left or upper reference window. If Left_Overlap is larger than Upper_Overlap, Upper_Overlap does not need to be stored, and left overlapped area may be released from internal memory immediately after current window finishes motion search. However, If Upper_ Overlap is larger than Left_Overlap, the Upper _Overlap needs to be stored in internal memory until current window finishes motion search.
Frame size of 4:1 decimated frame is 1/16 of original frame size. For example, 4:1 decimated frame size for 3840x2160 video is 960x540. If vertical sliding window scheme is used with vertical search range +/-64 (+/-256 on 1:1 domain), total internal memory size for B-type frame is 2 * ((2 * 64 +16) * (960 + 32)) = 285696 bytes per 16x16 block. Maximum horizontal search range is same as frame width (+/-960). The amount of data transfer of vertical sliding window scheme is roughly 16 bytes / 4x4 block on 4:1 domain (luma only), which means we need additional transfer of 16 bytes / 16x16 block on 1:1 domain.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
This application is a continuation of U.S. Pat. Application Serial No. 13/542,171, filed Jul. 5, 2012, which is scheduled to issue as U.S. Pat. No. 11,582,479 on Feb. 14, 2023, and which claims priority to U.S. Provisional Pat. Application Serial No. 61/504,587, filed Jul. 5, 2011, the entireties of each of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
61504587 | Jul 2011 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13542171 | Jul 2012 | US |
Child | 18108773 | US |