So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
Method and apparatus for motion estimation in a video encoder is described. One or more aspects of the invention relate to video coding compliant with the H.264 video coding standard. The documents establishing the AVC/H.264 video coding standard, namely ITU-T Rec. H.264 | ISO/IEC 14496-10 version 4 (1 Mar. 2005), are incorporated by reference herein. Although the present method and apparatus for motion estimation is compatible with and will be explained using H.264 standard guidelines, those skilled in the art will appreciate that the motion estimation of the present invention may be modified and used as best serves a particular standard or application.
Each of the frames is formed of macroblocks of pixels. Each macroblock in a frame includes a 16×8 pixel region. Each reference to pixel dimensions herein includes the vertical pixels first followed by the horizontal pixels (V×H) and is in HHR terms unless otherwise indicated. When discussing H.264 terms, the horizontal dimension should be multiplied by two (i.e., in H.264 terms, each macroblock includes a 16×16 pixel region). Each macroblock comprises two interlaced fields: field-0 (also referred to as the even field) and field-1 (also referred to as the odd field). Each field in a single macroblock includes an 8×8 pixel region. As described below, the FPME module 206 processes the macroblocks of a current frame in vertical pairs. Each macroblock pair includes a 32×8 pixel region. Thus, each field of a macroblock pair includes a 16×8 pixel region. In frame terms, each macroblock pair can be divided into a frame top having 16×8 pixels and a frame bottom having 16×8 pixels.
The FPME module 206 performs a full search across a search region in a reference frame (“reference window”). In one embodiment, the reference window comprises a 128×128 pixel region. In general, the motion vector search for a current macroblock pair begins by placing the macroblock pair at the top left corner of the reference window and performing pixel-for-pixel subtractions. The pixel differences are used to compute various sums of absolute differences (SADs). The computed SADs are minimized to produce motion vector data for the current macroblock pair. The current macroblock pair is then shifted one pixel to the right and the process is repeated across all 128 horizontal pixel locations of the reference window. Then the current macroblock is shifted down one line and the process is repeated for all lines of the reference window.
In particular, the memory controller 304 retrieves macroblock pair of a current frame. The memory controller 304 loads field-0 of the macroblock pair into the register 310 and field-1 of the macroblock pair into the register 312. The memory controller 304 retrieves pixels of the reference window from the memory 302 and loads the pixels for field-0 of the reference window in the field-0 FIFO logic 306, and the pixels for field-1 of the reference window in the field-1 FIFO logic 308. Each of the field-0 FIFO logic 306 and the field-1 FIFO logic 308 is initialized such that the current macroblock pair is placed in the top left corner of the reference window. The memory controller 304 pushes new pixel data into the FIFO logic 306 and the FIFO logic 308 to effectively shift the current macroblock pair within the reference window.
The processing logic 314 is coupled to the field-0 register 310, the field-1 register 312, the field-0 FIFO logic 306, and the field-1 FIFO logic 308, and the cost function module 320. The processing logic 314 is configured to compute SADs and motion vector data for the current macroblock pair. In particular, the processing logic 314 computes pixel differences separately between field-0 of the current macroblock pair and field-0 of the reference window (“field-0 even”), field-1 of the current macroblock pair and field-0 of the reference window (“field-1 odd”), field-0 of the current macroblock pair and field-1 of the reference window (“field-0 odd”), and field-1 of the current macroblock pair and field-1 of the reference window (“field-1 even”). The terms “even” and “odd” refer to the parity. Even parity denotes field-0 and/or field-1 lines of the current macroblock compared with field-0 and/or field-1 lines of the reference window, respectively. Odd parity denotes field-0 and/or field-1 lines of the current macroblock compared with field-1 and/or field-0 lines of the reference window, respectively.
From the pixel differences, the processing logic 314 computes SADs for each of field-0 even, field-0 odd, field-1 even, and field-1 odd comparisons (“field SADs”). The processing logic 314 uses the field SADs to compute SADs for frame top even, frame top odd, frame bottom even, and frame bottom odd comparisons (“frame SADs”). The field SADs are costed and minimized to produce motion vector data for field-0 and field-1 of the current macroblock pair. The frame SADs are costed and minimized to produce motion vector data for the top frame and the bottom frame of the current macroblock pair.
In H.264, a macroblock can be partitioned into smaller block sizes. For example, a macroblock can be divided into sixteen 4×4 partitions, eight 4×8 partitions, eight 8×4 partitions, four 8×8 partitions, two 8×16 partitions, two 16×8 partitions, and one 16×16 partition for a total of 41 partitions per macroblock. Motion estimation in H.264 allows for referencing these partitions when computing motion vectors. In one embodiment, the processing logic 314 is configured to compute SADs for each of the partitions in the current macroblock pair. Alternatively, the processing logic 314 may be configured to process a subset of the partitions, which reduces the clock speed and data bandwidth requirements. For example, the processing logic 314 may be configured to process only the 8×8, 8×16, 16×8, and 16×16 partitions for a total of nine partitions per macroblock. The processing logic 314 generates six motion vectors and associated costed SADs for each partition (i.e., motion vectors and costed SADs for field-0 ever, field-0 odd, field-1 even, field-1 odd, frame top, and frame bottom for each partition). The output of the processing logic 314 is stored in the storage FIFO 322. The processing is repeated for additional macroblock pairs in the current frame and for additional frames.
In one embodiment, data is reloaded into the field-0 FIFO logic 306 and the field-1 FIFO logic 308 for each macroblock pair to allow a new center for each reference window. Alternatively, the reference window data is not reloaded. Rather, additional pixels for the next macroblock pair reference window are shifted into the field-0 and field-1 FIFO logic 306 and 308, keeping the center of the search window the same relative to each macroblock pair. While this increases design efficiency, the search area is limited.
Each SAD computed by the processing logic 314 is “costed” by adding a cost computed by the cost function 320. The cost function 320 implements the following:
where MVx and MVy are x and y components, respectively, of the motion vector for the SAD, PMVx and PMVy are x and y components, respectively, of the median of motion vectors of neighboring macroblock pairs, selen is the signed exponential Golomb length, and λ is a constant for the entire current frame. In one embodiment, PMV may be computed from any combination of the neighbor motion vectors. The cost function 320 computes a cost for frame top, frame bottom, field 0, and field 1. In addition, the constant λ may be dynamically selected based on the partition associated with the SAD that is being costed (e.g., there may be different λ constants for 4×4 SADs, 4×8 SADs, 8×8 SADs, etc.). In one embodiment, λ may be different for each macroblock pair based on several factors, such as macroblock relative spatial activity and quantization level. The neighbor module 317 is configured to select previous motion vector(s) (if any) from the storage 316, and the PMV calculation module 318 is configured to compute the median of the retrieved motion vector(s) (if any) to compute the PMV.
In particular, previous motion vectors are stored in the previous MV storage 316. Given a current macroblock pair, the neighbor module 317 determines which, if any, previous motion vectors should be included in the median calculation for the frame top, frame bottom, field-0, and field-1 PMVs. Assume the selectable neighbors for a current macroblock pair are designated north, northeast, northwest, and west. The north neighbor is above, the northeast neighbor is above and to the right, the northwest neighbor is above and to the left, and the west neighbor is to the left of the current macroblock pair. If the current macroblock pair is from the top left corner of the frame, then it is the first macroblock pair processed and thus there are no previous motion vectors in the storage 316. The PMVs are zero.
If the current macroblock pair is from the top edge of the frame (other than the top left corner), then the neighbor module 317 retrieves previous motion vector data associated with the west neighbor. The PMVs are the previous motion vectors for frame top, frame bottom, field-0, and field-1 for the west neighbor. If the current macroblock pair is from the left edge of the frame (other than the top left corner), then the neighbor module 317 retrieves previous motion vector data associated with the north neighbor and the northeast neighbor. The frame top PMV is the median of the frame top motion vectors of the north and northeast neighbors, the frame bottom PMV is the median of the frame bottom motion vectors of the north and northeast neighbors, the field-0 PMV is the median of the field-0 motion vectors of the north and northeast neighbors, and the field-1 PMV is the median of the field-1 motion vectors of the north and northeast neighbors.
If the current macroblock pair is from the right edge of the frame (other than the top right corner), then the neighbor module 317 retrieves previous motion vector data associated with the west, north, and northwest neighbors. Each type of PMV is the median of the like type of previous motion vectors of the west, north, and northwest neighbors. For every other macroblock pair in the frame, the neighbor module 317 retrieves previous motion vector data associated with the west, north, and northeast neighbors. Each type of PMV is the median of the like types of previous motion vectors of the west, north, and northeast neighbors. The previous motion vector storage 316, the neighbor module 317, the PMV calculation module 318, and the cost function 320 are generally referred to as costing logic. The cost function 320 is also configured to store at least a portion of the motion vectors produced by the processing logic 314 in the previous motion vector storage 316.
The computation block 402 includes common sum modules 414 through 420, SAD modules 422 through 436, and compare modules 438 through 444. Aspects of operation for the computation block 402 may be understood with respect to
The basic building block for computing a SAD is a SAD of two pixels, which is defined as:
|REFm,n−CMBm,n|+|REFm,n+1−CMBm,n+1|,
where REF denotes the reference window, CMB denotes the current macroblock (a 16×8 HHR pixel region), and m and n denote pixel locations in the coordinate space 500. Summing two 2-pixel SADs yields a SAD for a 2×4 region (non-HHR). A SAD for a 4×4 partition (e.g., partition 0) can be computed by summing two 2×4 region SADs. Likewise, a SAD for an 8×8 partition (e.g., partition 0-1-2-3) can be computed by summing eight 2×4 region SADs and so on for other partition types.
In general, each of the 4×4, 4×8, 8×4, 8×8, 8×16, 16×8, and 16×16 partitions of an even/odd field can be computed by summing a combination of 2×4 region SADs for that even/odd field. In addition, each of the 4×4, 4×8, 8×4, 8×8, 8×16, 16×8, and 16×16 partitions of a top/bottom frame can be computed by summing a combination of 2×4 region SADs for both even and odd fields. For example, a SAD for a 4×4 partition in a top or bottom frame can be computed by summing a 2×4 region SAD of field-0 with a 2×4 region SAD of field-1. For this reason, if the processing logic 314 is configured to process all of the partition types, the 2×4 region SAD for a field can be considered to be a “common sum”. As discussed above, in some embodiments, not every partition type is processed. For example, in one embodiment, only the 8×8, 8×16, 16×8, and 16×16 partitions are processed. In such a case, a 4×8 region SAD is a common sum. For a field, SADs for the 8×8, 8×16, 16×8, and 16×16 partitions can be computed by summing combinations of the 4×8 region SADs. For a frame, SADs for the 8×8, 8×16, 16×8, and 16×16 partitions can be computed by summing combinations of the 4×8 region SADs for field-0 and field-1.
The common sum module 414 (“f0-f0 module”) computes common sums for current field-0 (Cf0) and reference field-0 (Rf0). The common sum module 416 (“f0-f1 module”) computes common sums for current field-0 and reference field-1 (Rf1). The common sum module 418 (“f1-f1 module”) computes common sums for current field-1 (Cf1) and reference field-1. The common sum module 420 (“f1-f0 module”) computes common sums for current field-1 and reference field-0.
The SAD module 422 (“frame top even”) receives common sums from the f0-f0 and f1-f1 modules 414 and 418 and computes SADs for partitions in the top frame with even parity. The SAD module 424 (“frame bottom even”) receives common sums from the f0-f0 and f1-f1 modules 414 and 418 and computes SADs for partitions in the bottom frame with even parity. The SAD module 426 (“field-0 even”) receives common sums from the f0-f0 module 414 and computes SADs for the partitions in field-0 with even parity. The SAD module 428 (“field-0 odd”) receives common sums from the f0-f1 module 416 and computes SADs for the partitions in field-0 with odd parity. The SAD module 430 (“field-1 even”) receives common sums from the f1-f1 module 418 and computes SADs for the partitions in field-1 with even parity. The SAD module 432 (“field-1 odd”) receives common sums from the f1-f0 module 420 and computes SADs for the partitions in field-1 with odd parity. The SAD module 434 (“frame top odd”) receives common sums from the f0-f1 and f1-f0 modules 416 and 420 and computes SADs for the partitions in the top frame with odd parity. The SAD module 436 (“frame bottom odd”) receives common sums from the f0-f1 and f1-f0 modules 416 and 420 and computes SADs for the partitions in the bottom frame with odd parity. The SAD modules 422 through 436 may compute SADs for all partitions or less than all partitions, as discussed above.
The compare module 438 (“frame top compare module”) receives SADs from the frame top even SAD module 422 and the frame top odd SAD module 434. The compare module 438 also receives cost data from the cost function 320. The compare module 438 performs a two stage compare for each partition type: First, for each partition type, the frame top compare module 438 adds the associated costs to the SADs and compares the costed frame top even SAD with the costed frame top odd SAD to select a minimum frame top SAD. For each partition type, the compare module 438 maintains a running minimum costed SAD for all shifts of the current macroblock pair in the reference window. In the second stage, for each partition type, the frame top compare module 438 compares the minimum frame top SAD obtained from the first stage with the running minimum. If a new running minimum is found and stored, the motion vector associated with that new minimum is also stored.
The compare module 440 (“field-0 compare module”) receives SADs from the field-0 even SAD module 426 and the field-0 odd SAD module 428. The compare module 440 also receives cost data from the cost function 320. The compare module 440 performs a two stage compare for each partition type, similar to the frame top compare module 438. First, for each partition type, the field-0 compare module 440 adds the associated costs to the SADs and compares the costed field-0 even SAD with the costed field-0 odd SAD to select a minimum field-0 SAD. Second, for each partition type, the field-0 compare module 440 compares the minimum field-0 SAD obtained from the first stage with the running minimum. If a new running minimum is found and stored, the motion vector associated with that new minimum is also stored. In another embodiment, the field-0 even and field-0 odd results have separate compare modules.
The compare module 442 (“field-1 compare module”) receives SADs from the field-1 even SAD module 430 and the field-1 odd SAD module 432. The compare module 442 also receives cost data from the cost function 320. Again, the compare module 442 performs a two stage compare for each partition type. First, for each partition type, the field-1 compare module 442 adds the associated costs to the SADs and compares the costed field-1 even SAD with the costed field-1 odd SAD to select a minimum field-1 SAD. Second, for each partition type, the field-1 compare module 442 compares the minimum field-1 SAD obtained from the first stage with the running minimum. If a new running minimum is found and stored, the motion vector associated with that new minimum is also stored. In another embodiment, the field-1 even and field-1 odd results have separate compare modules.
The compare module 444 (“frame bottom compare module”) receives SADs from the frame bottom even SAD module 424 and the frame bottom odd SAD module 436. The compare module 444 also receives cost data from the cost function 320. The compare module 444 performs a two stage compare for each partition type. First, for each partition type, the frame bottom compare module 444 adds the associated costs to the SADs and compares the costed frame bottom even SAD with the costed frame bottom odd SAD to select a minimum frame bottom SAD. Second, for each partition type, the frame bottom compare module 440 compares the minimum frame bottom SAD obtained from the first stage with the running minimum. If a new running minimum is found and stored, the motion vector associated with that new minimum is also stored.
The minimum compare module 406 (“final frame top”) receives, for each partition, a minimum SAD and associated motion vector from the frame top compare module 438 in each of the computation blocks 402 and 404. The final frame top compare module 406 compares the results from the two computation blocks 402 and 404 and selects the minimum as the final frame top SAD. The minimum compare module 408 (“final field-0”) receives, for each partition, a minimum SAD and associated motion vector from the field-0 compare module 440 in each of the computation blocks 402 and 404. The final field-0 compare module 408 compares the results from the two computation blocks 402 and 404 and selects the minimum as the final field-0 SAD. The minimum compare module 410 (“final field-1”) receives, for each partition, a minimum SAD and associated motion vector from the field-1 compare module 442 in each of the computation blocks 402 and 404. The final field-1 compare module 410 compares the results from the two computation blocks 402 and 404 and selects the minimum as the final field-1 SAD. The minimum compare module 412 (“final frame bottom”) receives, for each partition, a minimum SAD and associated motion vector from the frame bottom compare module 444 in each of the computation blocks 402 and 404. The final frame bottom compare module 406 compares the results from the two computation blocks 402 and 404 and selects the minimum as the final frame bottom SAD. In this manner, the processing logic 314 generates costed SADs and motion vectors for partitions in frame top, frame bottom, field-0, and field-1 of the current macroblock pair. The processing logic 314 repeats the operation described above for additional macroblock pairs in the current frame, and then for additional frames in the input video.
At step 912, partition SADs are generated for the current macroblock pair from combinations of the common SADs. As discussed above, SADs can be computed for all or a subset of partitions for frame top, frame bottom, even field, and odd field of the current macroblock pair for both even and odd parity with respect to the reference window. At step 914, the partition SADs are costed. At step 916, the costed partition SADs are minimized. Notably, like-type partition SADs are minimized for each of frame top, frame bottom, even field, and odd field as between even and odd parity. The results are then compared against running minimum partition SADs to determine if new minimums have been found.
At step 918, a determination is made whether the search has been completed. If not, the method 900 continues to step 919, where the reference window FIFO logic is shifted. The method 900 returns to step 909, where new pixel differences are computed. If the search is complete, the method 900 proceeds from step 918 to step 920. At step 920, costed SADs and associated motion vectors are output for all or a subset of partitions of top frame, bottom frame, even field, and odd fields of the current macroblock pair. The method 900 may be repeated for each macroblock pair in the current frame, and for multiple frames.
The demultiplexer 702 includes a single input terminal and nine output terminals. The output terminals of the demultiplexer 702 are coupled to input terminals of the FIFOs 706, respectively. The output of the FIFO 706-9 is coupled to an input of the register 710-9. Each of the multiplexers 708 includes two input terminals and one output terminal. The FIFOs 706-1 through 706-8 are coupled to first input terminals of the multiplexers 708-1 through 708-8, respectively. Output terminals of the registers 710 are coupled to input terminals of the FIFOs 712. An output terminal of the FIFO 712-9 is coupled to the second input terminal of the multiplexer 708-8; an output terminal of the FIFO 712-8 is coupled to the second input terminal of the multiplexer 708-7; an output terminal of the FIFO 712-7 is coupled to the second input terminal of the multiplexer 708-6; and so on until the output terminal of the FIFO 712-2 is coupled to the second input terminal of the multiplexer 708-1. The input terminal and output terminals of the demultiplexer 702 are 64 bits (8 bytes) wide. The input terminals of the FIFOs 706 are 8 bytes wide. The output terminals of the FIFOs 706 are one byte wide. The FIFOs 706 are 32 bytes deep. The input and output terminals of the multiplexers 708, the registers 710, and the FIFOs 712 are one byte wide. The registers 710 are configured to store 8 bytes. The FIFOs 712 are 128 bytes deep. The demultiplexer 704, the FIFOs 714, the multiplexers 716, the registers 718, and the FIFOs 720 are configured identically to the demultiplexer 702, the FIFOs 706, the multiplexers 708, the registers 710, and the FIFOs 712.
As described above, the motion vector search is performed starting at the top left corner of the reference window and proceeds across 128 locations for each of the 64 field lines. The dual spiral cylinder 700 includes a 128 byte deep secondary stage FIFO (i.e., FIFOs 712 and FIFOs 720). Each of the FIFOs 712 and 720 represent one line of the reference window, 128 pixels across (each pixel is assumed to be one byte). The FIFOs 712 represent odd lines 1 through 17, and the FIFOs 720 represent even lines 0 through 16. That is, odd lines are stored in the spiral cylinder 701 and even lines are stored in the spiral cylinder 703. The registers 710 and 718 represent data accessible for SAD calculations. That is, the registers 710 store an 8×18 pixel array. The first stage FIFO (i.e., FIFOs 706 and 714) provide a buffer between the memory controller 304 and the registers 710 and 718. The input terminals of the demultiplexers 702 and 704 are configured to receive data from the memory controller 304. The multiplexers 708 and 716 allow for two modes of operation: parallel load and spiral load.
In the parallel load mode, data is gathered from the memory 302 in chunks of 32 byte bursts. Each burst represents a single line of 32 bytes (32 pixels). The first burst is stored into the FIFO 714-1, after which data is sent byte-wide serially through the register 718-1, where the data is stored. The next line is read in a similar fashion and so on for lines 0 through 17. Each even line read is stored into the spiral cylinder 701, while each odd line read is stored into the spiral cylinder 703. The dual spiral cylinder 700 stores data for one field. Another dual spiral cylinder stores data for the other field.
Once all 18 lines have been loaded for 32 pixels each, SAD calculations can begin. Since there are two spiral cylinders 701 and 703, SADs can be calculated for line 0, as well as line 1. While the first set of SADs is being calculated, another chunk of 18×32 bytes of data are collected to continue the process. Pixels are shifted into the register array (registers 710 and 718), after which data is shifted into the secondary stage FIFO (FIFOs 712 and 720). This process continues until the entire secondary stage FIFO is full. This mode of operation is effectively parallel loading of the secondary stage FIFO.
The next stage of data collection changes data loading to only the bottom of the spiral cylinder 701 (the FIFO 706-9, the register 710-9, and the FIFO 712-9) and the bottom of the spiral cylinder 703 (the FIFO 714-9, the register 718-9, and the FIFO 720-9). All of the multiplexers 708 and 716 switch from the parallel data mode to the spiral mode. That is, in the parallel mode, the inputs of the multiplexers 708 and 716 that are coupled to the FIFOs 706 and 714 are selected. In the spiral mode, the inputs of the multiplexers 708 and 716 that are coupled to the FIFOs 712 and 720 are selected. In the spiral mode, the multiplexers 708 and 716 take data from the bottom most FIFO and feed the one above for every pixel data gathered from the memory 302. Since there are two spiral cylinders 701 and 703, 2 lines of data needs to be loaded for the given field. Data is loaded again in 32 byte chunks, first shifting in 8 pixels on the bottom spiral cylinder 703, then the top spiral cylinder 701. After these two lines are loaded, SAD calculations continue. For every pixel shifted, the spiral cylinders 701 and 703 move pixels up. The top of each spiral cylinder 701 and 703 drops the pixels that are not needed.
While the foregoing is directed to illustrative embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.