The present invention relates to a system and method for image processing, and, in particular embodiments, to a system and method for motion estimation for large-size block.
Video coding deals with representation of video data, for storage and/or transmission, for example for digital video. Video coding can be implemented with captured video as well as computer generated video and graphics. Goals of video coding are to accurately and compactly represent the video data, provide navigation of the video (i.e., search forwards and backwards, random access, etc.) and other additional author and content benefits, such as text (subtitles), meta information for searching/browsing and digital rights management. Video data is typically processed in blocks of data bytes or bits, where multiple blocks form an image frame. Video coding can be performed by a processor on the transmitting end (also referred to as an encoder) to compress original video into a format suitable for transmission. Video coding can also be performed by a trans-coder that converts digital-to-digital data from one encoding format to another. The encoder and trans-coder may include software components implemented via a processor or firmware. Video coding functions include motion estimation, which is a process of determining motion vectors that describe the transformation from one two-dimensional (2D) image to another.
High-Efficiency Video Coding (HEVC) is a recent video coding standard that is being developed by the Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T and ISO/IEC. The HEVC standard is incorporated herein by reference. In HEVC, the size of processed blocks (for an image frame) is relatively large, such as 64×64 blocks of data units. The processing of large-size blocks for ME is a computational-intensive operation, which can substantially reduce computation performance and/or increase hardware or chip cost and complexity.
In one embodiment, a method for motion estimation (ME) for a large-size block of image data is disclosed. The method includes obtaining a large-size block for ME processing and dividing the large-size block into a plurality of small-size blocks. The method also includes processing the small-size blocks in parallel using a small-size block ME processing algorithm. The large-size block comprises an integer multiple of the small-size blocks. In an example, the small-size blocks are 16×16 blocks of data bytes.
In another embodiment, an apparatus for implementing ME for a large-size block of image data is disclosed. The apparatus comprises a processor configured to obtain a 64×64 block of bytes of image data for ME processing and divide the 64×64 block into a plurality of 16×16 blocks of data bytes. The processor is also configured to process the 16×16 blocks in parallel using a ME processing algorithm for 16×16 blocks.
In yet another embodiment, a network component for video coding is disclosed. The network component comprises a processor configured to obtain a large-size block of bytes of image data for motion estimation (ME), divide the large-size block into a plurality of small-size blocks of bytes that comprise a same data, and process the small-size blocks for ME individually in parallel using a corresponding small-size block ME processing algorithm. The network component further comprises a single shared register for storing at different times the small-size blocks.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
The making and using of the presently preferred embodiments are discussed in detail below. It should be appreciated, however, that the present invention provides many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the invention, and do not limit the scope of the invention.
In recent video compression standard “HEVC”, large-size blocks of image data that belong to image frames, such as 64×64, 64×32, 32×64, 32×32, 32×16, and 16×32 blocks, are used in ME. The blocks comprise bytes of data and may be represented in the form of matrices. Compared to small-size blocks (e.g., 16×16 blocks or smaller), the large-size blocks require more overhead for ME, such as in number of processor cycles (i.e., clock cycles). For example, processing a 16×16 block may take 16 cycles before starting actual motion search calculation. Using the same ME architecture in video encoder chips, a 64×64 block typically requires 64 cycles to start the actual motion search calculation. Generally, ME is performed for a plurality of lines for the same block, for example for multiples of 16 lines. Thus, the ME overhead (in number of cycles) is proportional to both the block size and the number of lines for motion search. For instance, when there are 64 lines to be processed for a 64×64 block, the number of cycles needed for ME is equal to 64×64 or 4096 cycles.
Using a typical video processing chip and logic, such as based on a 1080P60 HD format, each 64×64 block may have only 6,400 cycles that can be used for overall ME computing. Thus, the actual computing time for ME, e.g., for actual motion search calculation, is reduced significantly after using 4096 of the cycles for line motion searches (for 64 lines per block). The cycles that remain for performing actual motion search calculation may be limited and reduce ME performance in comparison to the case of small-size blocks (e.g., 16×16 blocks). To compensate for this overhead, more complex hardware or chip logic may be used, which increases chip cost and resource (e.g., power) consumption. Thus, improving motion estimation efficiency and simplifying chip logic for large-size blocks is beneficial to significantly improve performance and reduce chip cost for video coding and processing.
To decrease the time for line motion search and the chip cost and improve ME performance for large-size blocks, embodiments are disclosed herein that use fewer cycles than the current approach to efficiently process large-size blocks. An embodiment method may be implemented by an apparatus, a processor (e.g., an encoder), or a network component and includes dividing a large-size block into multiple equivalent 16×16 blocks, and then processing the individual 16×16 blocks using a standard or current ME processing method for such small-size blocks. For example, a 64×64 block may be divided into 16 small-size 16×16 blocks that represent the same data, where each 16×16 block needs 16 cycles of overhead for ME. As such, the resulting total number of cycles for processing the data of the 64×64 block becomes equal to 16×64 or 1024 cycles instead of 64×64 or 4096 cycles, which is required using standard large-size block ME processing. Using this method, the overhead for ME in number of cycles may be reduced by a ratio of about ¾ (i.e., a 75% of overhead reduction). The resulting freed-up cycles may be used for actual motion search calculation, which results in improving ME efficiency and performance. Additionally or alternatively, this reduced overhead may reduce chip complexity and logic, cost, and power consumption.
The ME processing scheme 100 typically uses 64 processor cycles to perform one line motion search for a 64×64 block. The number of lines that are considered for ME may correspond to the number of data rows of the image frame portion 110, i.e., V. Thus, the total number of cycles for line motion searches is equal to V×64 cycles. The number of data rows, V, may be a multiple of 16. For example, when V is equal to 16, the total number of cycles for line motion searches is equal to 16×64 or 1024 cycles, and when V is equal to 64, the total number of cycles is equal to 64×64 or 4096 cycles. Thus, the overhead for ME may substantially increase as the block size increases and as the number of line motion searches or V increases. Additionally, the scheme 100 uses a 64×64 8-bit register, i.e., a total of 64×64×8 or 32K bits, to store the 64×64 block data for processing. Due to the requirements above, it is more feasible to implement the scheme 100 via hardware, e.g., using a HEVC standard chip, with or without software, such as in the case of real-time processing/communications applications.
The ME processing scheme 200 may first divide the large-size block into a plurality of equivalent small-size blocks, for instance a plurality of 16×16 blocks and process the equivalent 16×16 blocks in parallel using a current small-size block ME scheme for ME in existing video coding standards, which is referred to as 16×16 micro-block ME. For example, a 64×64 block may be processed by dividing the block into 16 small-size 16×16 blocks and then processing the individual 16×16 blocks in parallel, e.g., at about the same time using time division multiplexing. Each 16×16 block may be processed using an efficient existing or standard ME processing scheme for small-size 16×16 blocks. Each 16×16 block may need 16 line motion searches, where one line motion search requires 16 processor cycles for ME. Since the resulting 16 small-size 16×16 blocks are processed in parallel, the 16 line motion searches can be implemented at about the same time. As such, the total number of cycles for all the blocks is equal to 16×16 (or 256) cycles and the overhead for ME may be substantially reduced (by about 75%) in comparison to the ME processing scheme 100. The savings in overhead (i.e., in number of cycles) may be used for actual motion search calculation to improve processing efficiency and performance. The savings in overhead may also translate into savings in chip cost and power consumption, for example while maintaining the same level or performance of the current scheme 100.
Additionally, the scheme 200 may use a 16×16 8-bit register, i.e., a total of 16×16×8 (or 2K) bits, to store the 16×16 block data for processing. Since the 16 small-size 16×16 blocks are processed in parallel, e.g., via time vision multiplexing, and a single 2K bit register can be shared to store all the blocks at different times. This corresponds to a ratio 15/16 in register size savings in comparison to the scheme 100. The savings in register size or memory further reduce cost and power consumption and simplify chip logic.
Program code, e.g., the code implementing the algorithms disclosed above, and data can be stored in a memory 420. The memory 420 can be read only memory (or ROM), a local memory such as DRAM or mass storage such as a hard drive, optical drive or other storage (which may be local or remote). While the memory 420 is illustrated functionally with a single block, it is understood that one or more hardware blocks can be used to implement this function. The memory 420 may comprise the shared register that is used to process the small-size blocks in the scheme 200 and the method 300.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.
| Number | Name | Date | Kind |
|---|---|---|---|
| 5170259 | Niihara | Dec 1992 | A |
| 5319457 | Nakahashi et al. | Jun 1994 | A |
| 6118901 | Chen et al. | Sep 2000 | A |
| 7126991 | Mimar | Oct 2006 | B1 |
| 20030020965 | LaRocca et al. | Jan 2003 | A1 |
| 20040008780 | Lai et al. | Jan 2004 | A1 |
| 20120328003 | Chien et al. | Dec 2012 | A1 |
| 20140056355 | Li | Feb 2014 | A1 |
| Entry |
|---|
| Bross, B., et al., “High efficiency video coding (HEVC) text specification draft 8,” Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCTVC-J1003-d7, 10th Meeting, Stockhold, SE, Jul. 11-20, 2012, 260 pages. |
| Number | Date | Country | |
|---|---|---|---|
| 20140092974 A1 | Apr 2014 | US |