The present disclosure relates to motion estimation and, more particularly, to structures and techniques for computing matching criteria typically employed in motion estimation.
Video coding employing Motion Estimation (ME) and/or Motion Compensation (MC) is widely used in various video coding standards and/or specifications, such as MPEG [see Moving Pictures Experts Group, ISO/IEC/SC29/WG11 standard committee]. Advances, for example, in integrated circuit technology, in recent times have made it possible to implement block matching techniques in hardware, such as with silicon or semiconductor devices. An excellent discussion of ME may be found in Bhaskara and Constantis, [see V. Bhaskaran and K. Konstantinides. “Image and Video Compression Standards: Algorithms and Architectures”, Kluwer Academic Publishers, 1995.]
Usually, a search begins with the motion vector, MV=(0,0) or no motion. For this particular embodiment, a search window is the block of pixels from a previous frame around MV=(0,0). The block size and choice of search window size typically reflects an implementation trade-off; therefore, again, no particular size is necessarily preferred over another in this context. For example, the larger the search window, the higher the computational complexity and memory/data bandwidth capability desired, but, likewise, improved is the chance to get a good match.
The subject matter is particularly pointed out and distinctly claimed in the concluding portion of the specification. The claimed subject matter, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the claimed subject matter. However, it will be understood by those skilled in the art that the claimed subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail in order so as not to obscure the claimed subject matter.
As indicated previously, a full search technique is typically computationally intensive. Therefore, for high speed video encoding applications, it has proven desirable to instead implement a logarithmic window search, rather than a full search. In one embodiment of a logarithmic search technique, instead of computing the SAD value in every position within the search window, first an SAD is computed at, for example, location (0, 0) and 8 other search points, such as, for example, at coordinates (0, d), (0, −d), (d, 0), (−d, 0), (d, d), (d, −d), (−d, d), (−d, −d), where, in this example, 2d is the dimension of the search window (e.g. for a 32×32 search window, d=16). The location that gives the most desirable SAD, is the match location and the search is now narrowed down to a d×d search window centering at this location. This is continued until the search window size is narrowed down to size 1×1. Hence, for such an embodiment, the overall number of search locations is
where 2d is the size of the initial search window. For example, with a 32×32 (e.g., d=32) search window the total number of search points will be 8*4+1=33.
Referring to
A representative or sample raw performance and/or bandwidth capability to implement a logarithmic search (LS) method may be calculated. Computing a motion vector, where, for example, the Sum-of-Absolute Difference (SAD) is employed, involves a comparison between a reference block and a corresponding block in a previous frame in 33 selected positions in a 32×32 search window and 41 selected positions in a 64×64 window, for example. Assume that the size of a search window is S×S, resolution of the video is M×N and the frame rate is F frames per second. For a 16×16 macroblock, for example, the number of SAD computations per second involved in LS motion estimation is F*33*(S*S)*(M*N)/(16*16) for a 32×32 search window and F*41*(S*S)*(M*N)/(16*16) for a 64×64 search window.
As is well-known, the CCIR standard for video employs resolution of 720×480 at 30 frames per second. In MPEG2 and MPEG4 video, the size of a search window for block matching is 32×32 and the corresponding search window selection mode is indicated by a variable, Fcode=1. For Fcode=2, 3, . . . , the search window sizes are 64×64, 128×128, . . . , respectively. Although the claimed subject matter is not limited to these block sizes, resolutions or particular search windows, nonetheless, employing them to perform calculations for a potential implementation is instructive. Hence, the computational burden involved for 720×480 resolution video at 30 frames per second is approximately.
An embodiment of a method for motion estimation employing an architecture 100 that includes a cross-bar coupled image signal processor (ISP) is described. Such an embodiment provides advantages in terms of computational performance and/or bandwidth utilization, as described in more detail hereinafter. Here, an ISP may comprise several basic processing elements (PE) coupled together via a register file switch, as shown in
Although the claimed subject matter is not limited in scope in this respect, in this particular embodiment, a register file 200 comprises a bank of 16 registers. In this embodiment, a register may be written to by any PE and may be read by any PE. Thus, a register may be used as a link to send data from one PE to another. A register has 8-write ports, so that, for this particular embodiment, any PE may write to it. Likewise, here a register has 8 read port that couples to all PEs. The register file in this embodiment also includes a stalling mechanism that stalls a PE attempting to write when (a) there is a higher priority PE that is also attempting to write in the same cycle and/or (b) the register has unread data. It is of course appreciated that alternate embodiments may omit a register file or may employ a register file with additional and/or different capabilities.
Using general-purpose registers (GPRs) in the register file switch, a PE may communicate with another PE in the ISP in this particular embodiment. Here, there are up to 16 GPRs in a register file switch allowing concurrent communication between various PEs at substantially the same time, if desired.
In this particular embodiment, a GPR may be written and read by any PE. Likewise, in this particular embodiment, a PE may write to and read from any GPR. For example, PE0 may use GR0 to send data to PE1. At substantially the same time, PE2 may use GR2 to send data to PE4, etc. Thus, although the claimed subject matter is not limited in scope in this respect, there may be up to 16 concurrent transfers occurring on a given cycle.
In this embodiment, therefore, the register file switch provides a mechanism for sharing data between PEs. Although the claimed subject matter is not limited in scope in this respect, in this embodiment, a PE has a dual SAD computation capability by performing SAD computations in parallel. A SAD may be implemented in this embodiment using a special instruction, directed to the processing elements (PEs). One aspect of implementing this particular embodiment is mapping tasks to this architecture so that communication between PEs occurs efficiently with relatively low communication overhead.
In this particular embodiment, as illustrated in
In this embodiment, the PEs within an ISP may be customized to perform specific functions. For example, an input PE (IPE) may be employed to move data into registers on the ISP from a source external to the ISP. Similarly, one or more memory PEs (MPEs) may provide local storage to the PEs. An output PE (OPE) may be employed to move processed data out of an ISP. For example, an IPE and/or OPE may interface to SDR/DDR or other memory technology, for example, to move data into and out of an image processor. A general-purpose PE (GPE) may provide general-purpose processing functionality. In this embodiment, then, although the claimed subject matter is not limited in scope in this respect, for example, an ISP may comprise: an IPE, an OPE, 1 or more MPEs and 1 or more GPEs. The configuration of the ISP may depend, at least in part, on the particular application, including the mapping approach used to map the computation process to the ISP, as described in more detail herein after.
Since the computational power and bandwidth desired may in some instances be relatively high, using a single high-performance processor or a DSP to perform motion estimation may not provide a practical solution. In this embodiment, instead, the LS process is, in essence, “mapped” to multiple ISPs to take advantage of the ISP engines described above. In this particular embodiment, although the claimed subject matter is not limited in scope in this regard, the data and computation flows within the ISP are distributed amongst the PE,s as shown in
(16×16)+(32×32)+(16×16) Bytes=˜1.5KB (MPE0)
(32*32)+(32*32)=2KB (MPE1)
In order to mitigate potential bandwidth constraints, 3 PEs (e.g., GPE0, GPE1, GPE2 in
In order to illustrate the concept, consider the case where GPE0, GPE1, and GPE2 run in parallel to compute an SAD for consecutive positions in the search window. The MPEs may store the reference macroblock and the search region, such as in MPE0 and MPE1, and feed the 3 PEs with data. In this embodiment, the reference macroblock may be fed to PEs using a set of 3 GPRs. The data from a search window in a previous frame may be fed to using a GPR.
Since the PEs are computing the SADs for discrete positions, as alluded to above, pixel data may be shared in this particular embodiment, although the claimed subject matter is not limited in scope in this respect. For example, one GPR may be employed for sending reference block information to 3 PEs, although the claimed subject matter is not limited in scope in this respect. For a row of SAD computation, for example, PE0, PE1 and PE2 may share 16 pixels of the reference region. Hence, to feed data to 3 PEs working in parallel, 16*4 pixel data per row for 3 SAD computations may be employed for this embodiment, although, again, the claimed subject matter is not limited in scope to this example embodiment.
For the following discussion, reference is made to
Another assumption for convenience and/or simplicity, as previously indicated, although the claimed subject matter is not limited in scope in this respect, is that a reference block is stored in one block of memory and a search window is stored in another. Thus, two accesses (one for reference block data and another for search window data) are employed per cycle. In
A parallel process to compute 3 SADs with such an architecture may be expressed in terms of pseudo-code as follows, although the subject matter is not limited in scope in this respect (let us assume that x0, x1, . . . , x15 are the pixels from a row of the reference block and y0, y1, y2, . . . are the corresponding data form the reference block to be matched):
Begin
/* PE0-PE5 in the following are GPE0-GPE5 in FIG. 4. */
IPE:
MPEs:
row = 0 to 15
(sequentially 16 rows are computed)
T1: Par
Par
;
T2: PE4
Par:
← Accumulate partial SADi (row); T3: PE5:
End.
For this particular embodiment, the bandwidth capability desired may be recomputed as follows:
Bandwidth to compute 3 SADs=2*(4*8)*16 Bytes=1K Bytes
Bandwidth to compute 1.4M SADs=1.4M*1K/3=460 MB/s
That represents an overall saving of >35% compared to 716 MB/s bandwidth, as computed earlier. The clock cycles to compute a 16×16 SAD may also be determined for this embodiment, e.g., having 3 PEs working in parallel. As discussed, in this example, a PE may compute 2 SADs in parallel, resulting in a potential doubling of the compute performance of the PE.
Hence,
Clocks per PE per row of SAD computation=(16/2) clocks
(two SAD computations in parallel, from
Clocks per PE per 16 rows of SAD computation=(8)*16 clocks
(for a 16×16 macroblock)
Clocks per ISP 16×16 SAD computation=(8*16)/3 clocks=43 clocks
(3 PEs operation in parallel)
Clocks per ISP for 1.4M SAD computation=43*1.4M clock=57 M clocks
Assuming that ISPs run at 266 MHz, 1 ISP therefore provides the capability to implement LS processing using a 32×32 search window (for a 64×64 search window, 1 ISP may still be employed).
Likewise, bandwidth capability may be determined as follows. An MPE may supply 2 words (16-bits each) per cycle (e.g., 4 bytes per cycle), providing a total bandwidth out of an MPE as 4*266 MB/s or ˜1.064 GB/s. By employing in this embodiment 2 MPEs per ISP, total bandwidth capability exceeds 2 GB/s, which is higher than the bandwidth of 460 MB/s. Thus, as demonstrated, for this embodiment, 1 ISP may suitably handle the data bandwidth for a 32×32 search window for block matching.
In the above discussion, synchronous DRAM (SDR) and/or dual-data rate DRAM (DDR) bandwidth to download the reference block and search region information to one or more MPEs is now considered. The bandwidth (from
Bandwidth to download data for 1 macroblock=(16*16)+(32*32)+(16*16) Bytes
Bandwidth to download 1367 blocks=1367*1536 Bytes
Bandwidth desired per second=30*1367*1536 B/s=63 MB/s
Assuming one DDR channel (16-bit wide and running at 133 MHz), provides a total bandwidth of 2*133*2 MB/s or 512 MB/s, this is more than sufficient. The top level bandwidth estimation at different communication points for this embodiment is illustrated in
It will, of course, be understood that, although particular embodiments have just been described, the claimed subject matter is not limited in scope to a particular embodiment or implementation. For example, one embodiment may be in hardware, such as implemented to operate on an integrated circuit chip, for example, whereas another embodiment may be in software. Likewise, an embodiment may be in firmware, or any combination of hardware, software, or firmware, for example. Likewise, although the claimed subject matter is not limited in scope in this respect, one embodiment may comprise an article, such as a storage medium. Such a storage medium, such as, for example, a CD-ROM, or a disk, may have stored thereon instructions, which when executed by a system, such as a computer system or platform, or an imaging or video system, for example, may result in an embodiment of a method in accordance with the claimed subject matter being executed, such as an embodiment of a method of performing motion estimation, for example, as previously described. For example, an image or video processing platform or another processing system may include a video or image processing unit, a video or image input/output device and/or memory.
While certain features of the claimed subject matter have been illustrated and described herein, many modifications, substitutions, changes and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the claimed subject matter.
This patent application is a U.S. Continuation-In-Part Patent Application of “Motion Estimation,” by Acharya et al., filed on Sep. 4, 2002, U.S. patent application Ser. No. 10/235,121 assigned to the assignee of the current invention and herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
4908751 | Smith | Mar 1990 | A |
5473379 | Horne | Dec 1995 | A |
5602727 | Kurokawa et al. | Feb 1997 | A |
5649029 | Galbi | Jul 1997 | A |
5706059 | Ran et al. | Jan 1998 | A |
5739872 | Kim et al. | Apr 1998 | A |
5757668 | Zhu | May 1998 | A |
5838827 | Kobayashi et al. | Nov 1998 | A |
5875122 | Acharya | Feb 1999 | A |
5995210 | Acharya | Nov 1999 | A |
6005980 | Eifrig et al. | Dec 1999 | A |
6009201 | Acharya | Dec 1999 | A |
6009206 | Acharya | Dec 1999 | A |
6037987 | Sethuraman | Mar 2000 | A |
6047303 | Acharya | Apr 2000 | A |
6058142 | Ishikawa et al. | May 2000 | A |
6091851 | Acharya | Jul 2000 | A |
6094508 | Acharya et al. | Jul 2000 | A |
6108039 | Linzer et al. | Aug 2000 | A |
6108453 | Acharya | Aug 2000 | A |
6118901 | Chen et al. | Sep 2000 | A |
6124811 | Acharya et al. | Sep 2000 | A |
6130960 | Acharya | Oct 2000 | A |
6151069 | Dunton et al. | Nov 2000 | A |
6151415 | Acharya et al. | Nov 2000 | A |
6154493 | Acharya et al. | Nov 2000 | A |
6166664 | Acharya | Dec 2000 | A |
6178269 | Acharya | Jan 2001 | B1 |
6195026 | Acharya | Feb 2001 | B1 |
6208692 | Song et al. | Mar 2001 | B1 |
6215908 | Pazmino et al. | Apr 2001 | B1 |
6215916 | Acharya | Apr 2001 | B1 |
6229578 | Acharya | May 2001 | B1 |
6233358 | Acharya | May 2001 | B1 |
6236433 | Acharaya et al. | May 2001 | B1 |
6236765 | Acharya | May 2001 | B1 |
6258796 | Richards | Jul 2001 | B1 |
6269181 | Acharya | Jul 2001 | B1 |
6275206 | Tsai et al. | Aug 2001 | B1 |
6292114 | Tsai et al. | Sep 2001 | B1 |
6301392 | Acharya | Oct 2001 | B1 |
6330282 | Miyazaki | Dec 2001 | B1 |
6348929 | Acharya et al. | Feb 2002 | B1 |
6351555 | Acharya et al. | Feb 2002 | B1 |
6356276 | Acharya | Mar 2002 | B1 |
6366692 | Acharay | Apr 2002 | B1 |
6366694 | Acharay | Apr 2002 | B1 |
6373481 | Tan et al. | Apr 2002 | B1 |
6377280 | Acharya et al. | Apr 2002 | B1 |
6381357 | Tan et al. | Apr 2002 | B1 |
6392699 | Acharya | May 2002 | B1 |
6449380 | Acharya et al. | Sep 2002 | B1 |
6501799 | Kohn | Dec 2002 | B1 |
6535648 | Acharya | Mar 2003 | B1 |
6556242 | Dunton et al. | Apr 2003 | B1 |
6563948 | Tan et al. | May 2003 | B2 |
6574374 | Acharya | Jun 2003 | B1 |
6600833 | Tan et al. | Jul 2003 | B1 |
6625308 | Acharya et al. | Sep 2003 | B1 |
6625318 | Tan et al. | Sep 2003 | B1 |
6628716 | Tan et al. | Sep 2003 | B1 |
6628827 | Acharya | Sep 2003 | B1 |
6633610 | Acharya | Oct 2003 | B2 |
6639691 | Acharya | Oct 2003 | B2 |
6640017 | Tsai et al. | Oct 2003 | B1 |
6650688 | Acharya et al. | Nov 2003 | B1 |
6654501 | Acharya et al. | Nov 2003 | B1 |
6658399 | Acharya et al. | Dec 2003 | B1 |
6694061 | Acharya | Feb 2004 | B1 |
6697534 | Tan et al. | Feb 2004 | B1 |
6731706 | Acharya et al. | May 2004 | B1 |
6731807 | Pazmino et al. | May 2004 | B1 |
6748017 | Joung | Jun 2004 | B1 |
6748118 | Acharya et al. | Jun 2004 | B1 |
6757430 | Metz et al. | Jun 2004 | B2 |
6759646 | Acharya et al. | Jul 2004 | B1 |
6798901 | Acharya et al. | Sep 2004 | B1 |
6813384 | Acharya et al. | Nov 2004 | B1 |
6825470 | Bawolek et al. | Nov 2004 | B1 |
6850569 | Park et al. | Feb 2005 | B2 |
6954228 | Acharya et al. | Oct 2005 | B1 |
6961472 | Acharya et al. | Nov 2005 | B1 |
7053944 | Acharya et al. | May 2006 | B1 |
7065253 | Acharya et al. | Jun 2006 | B2 |
20010014166 | Hong | Aug 2001 | A1 |
20010046264 | Fandrianto et al. | Nov 2001 | A1 |
20020017914 | Roggel | Feb 2002 | A1 |
20020064228 | Sethuraman et al. | May 2002 | A1 |
20030106053 | Sih et al. | Jun 2003 | A1 |
20030108247 | Acharya et al. | Jun 2003 | A1 |
20030174252 | Bellas et al. | Sep 2003 | A1 |
20040057626 | Acharya et al. | Mar 2004 | A1 |
20040207725 | Fandrianto et al. | Oct 2004 | A1 |
20050213661 | Xiang et al. | Sep 2005 | A1 |
Number | Date | Country |
---|---|---|
0592128 | Apr 1994 | EP |
Number | Date | Country | |
---|---|---|---|
20040047422 A1 | Mar 2004 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10235121 | Sep 2002 | US |
Child | 10242148 | US |