The present invention generally relates to the field of processor architectures and, more specifically, to a processing unit that comprises a template Front End Processor (FEP) with an Extendable Data Path portion for customizing the FEP in accordance with a plurality of specific functional processing needs.
Media processing and communication devices comprise hardware and software systems that utilize interdependent processes to enable the processing and transmission of media. Media processing comprises a plurality of processing function needs such as entropy encoding, discrete cosine transform (DCT), inverse discrete cosine transform (IDCT), motion compensation, de-blocking filter, de-interlacing, and de-noising. Typically, different functional processing units may be dedicated to each of the aforementioned different functional needs and the structure of each functional unit is specific to the coding approach or standard being used in a given processing device. However, it is desirable to not have to design the structure of each of the functional processing units from scratch and have the structure of the functional processing unit designed in such a manner, that it can be programmed for use with any coding standard or approach.
For example, integer-based transform matrices are used for transform coding of digital signals, such as for coding image/video signals. Discrete Cosine Transforms (DCTs) are widely used in block-based transform coding of image/video signals, and have been adopted in many Joint Photographic Experts Group (JPEG), Motion Picture Experts Group (MPEG), and network protocol standards, such as MPEG-1, MPEG-2, H.261, H.263 and H.264. Ideally, a DCT is a normalized orthogonal transform that uses real-value numbers. This ideal DCT is referred to as a real DCT. Conventional DCT implementations use floating-point arithmetic that requires high computational resources. To reduce the computational burden, DCT algorithms have been developed that use fix-point or large integer arithmetic to approximate the floating-point DCT.
In conventional forward DCT, image data is subdivided into small 2-dimensional segments, such as symmetrical 8×8 pixel blocks, and each of the 8×8 pixel blocks is processed through a 2-dimensional DCT. Implementing this process in hardware is resource intensive and becomes exponentially more demanding as the size of the pixel blocks to be transformed is increased. Also, prior art image processing typical uses separate hardware structures for DCT and IDCT. Additionally, prior art approaches to DCT and IDCT processing requires different hardware to support codecs with differing DCT/IDCT processing methodologies. Therefore, different hardware would be required for DCT 4×4, IDCT 4×4, DCT 8×8, and IDCT 8×8, among other configurations.
Similarly, prior art video processing systems require separate hardware structures to do quantization and de-quantization for different CODECs. Prior art motion compensation processing units also use multiple processing units (different DSPs) for handling various codecs such as H.264, MPEG 2 and 4, VC-1, AVS. However, it is desirable to have a motion compensation processing unit that is highly configurable, programmable, scalable and uses a single data path to handle a plurality of codecs at cycles less than 500 MHz. It is also desirable to have efficient processing using fewer clock cycles without excessive cost.
Additionally, DBFs are needed because they remove discontinuities between the processed blocks in a frame. Frames are processed on a block by block level. When a frame is reconstructed by placing all the blocks together, discontinuities may exist between blocks that need to be smoothened. The filtering needs to be responsive to the boundary difference. Too much filtering creates artifacts. Too little fails to remove the choppiness/blockiness of the image. Typically, deblocking is done sequentially, taking each edge of each block and working through all block edges. The blocks can be of any size: 16×16, 4×4 (if H.264), or 8×8 (if AVS or VC-1).
To perform DBF properly, the right data needs to be available, at the right time, to filter. Persons of ordinary skill in the art would appreciate that to get high orders of processing speeds (example: 30 frames per second) the DBF needs to be tailored to a specific codec, like H.264. Programmable DBFs can use a generic RISC processor, but it will not be optimized for any one codec and, therefore, high processing speeds (i.e., 30 frames per second) will not be achieved. Given that each codec has a different approach to when, and in what sequence, DBF should occur, it becomes challenging to tailor a single deblocking DSP to doing DBF.
Accordingly, there is need for a template processing structure that can be tailored to each processing unit needed for the various functional processing needs. Need further exists for combining the DCT and IDCT functions into a single processing block. And also for a unified hardware structure that can be used to do both quantization and de-quantization on 8 words in a single clock cycle.
There is yet further need in the art for a hardware processing structure that is flexible enough to implement different equations in order to support multiple CODEC standards and has the capability of computing significant coefficients on the fly with no overhead to speed up processing for entropy coding. Accordingly there is a need in the prior art to have a de-blocking filter DSP that a) can be programmed to be used for any codec, particularly H.264, AVS, MPEG-2, MPEG-4, VC-1 and derivatives or updates thereof, and b) can operate at least 30 frames per second.
Additionally, there is also need for a two dimensional register set arrangement to facilitate two dimensional processing in a single clock cycle thereby accelerating the processing function. In processors, data registers are used to upload operands for an operation and then store the output. They are typically accessible in only one dimension.
There is also a need for a media processing unit that can be used to perform a given processing function for various kinds of media data, such as graphics, text, and video, and can be tailored to work with any coding standard or approach. It would further be preferred that such a processing unit provides optimal data/memory management along with a unified processing approach to enable a cost-effective and efficient processing system. More specifically, a system on chip architecture is needed that can be efficiently scaled to meet new processing requirements, while at the same time enabling high processing throughputs.
The present specification discloses a processing architecture that has multiple levels of parallelism and is highly configurable, yet optimized for media processing. Specifically, the novel architecture has three levels of parallelism. At the highest level, the architecture is structured to enable each processor, which is dedicated to a specific media processing function, to operate substantially in parallel. For example, as shown in
The processor therefore has no inherent limits on how much data can be processed. Unlike other processors, the presently disclosed processor has no limitation on the number of functional data paths or execution units that can be implemented because of the multiple data buses, namely a program data bus and two data buses, which operate in parallel and where each bus is configurable such that it can carry one or N number of operands.
In addition to this multi-layered parallelism, the processor has multiple layers of configurability. Referring to
In one embodiment, the present invention is directed toward a processor with a configurable functional data path, comprising: a plurality of address generator units; a program flow control unit; a plurality of data and address registers; an instruction controller; a programmable functional data path; and at least two memory data buses, wherein each of said two memory data buses are in data communication with said plurality of address generator units, program flow control unit; plurality of data and address registers; instruction controller; and programmable functional data path. Optionally, the programmable function data path comprises circuitry configured to perform entropy encoding, discrete cosine transform (DCT), inverse discrete cosine transform (IDCT), motion compensation, motion estimation, de-blocking filter, de-interlacing, de-noising, quantization, or dequantization on data input into said programmable function data path. Optionally, the circuitry configured to perform entropy encoding, discrete cosine transform (DCT), inverse discrete cosine transform (IDCT), motion compensation, motion estimation, de-blocking filter, de-interlacing, de-noising, quantization, or dequantization processing on data input into said programmable function data path can be logically programmed to perform that processing in accordance with any of the H.264, MPEG-2, MPEG-4, VC-1, or AVS protocols without modifying the physical circuitry. Optionally, the any of the aforementioned processing can be performed to enable a display of video at least 30 frames per second at a processor frequency of 500 MHz or below.
In another embodiment, the present invention is directed toward a processor, comprising: a plurality of address generator units; a program flow control unit; a plurality of data and address registers; an instruction controller; and a programmable functional data path, wherein said programmable function data path comprises circuitry configured to perform any one of the following processing functions on data input into said programmable function data path: DCT processing, IDCT processing. motion estimation, motion compensation, entropy encoding, de-interlacing, de-noising, quantization, or dequantization. Optionally, the circuitry can be logically programmed to perform said processing functions in accordance with any of the H.264, MPEG-2, MPEG-4, VC-1, or AVS protocols without modifying the physical circuitry. The processing functions can be performed to enable a display of video at least 30 frames per second at a processor frequency of 500 MHz or below.
In another embodiment, the present invention is a system on chip comprising at least five processors of claim 1 and a task scheduler wherein a first processor comprises a programmable function data path configured to perform entropy encoding on data input into said programmable function data path; a second processor comprises a programmable function data path configured to perform discrete cosine transform processing on data input into said programmable function data path; a third processor comprises a programmable function data path configured to perform motion compensation on data input into said programmable function data path; a fourth processor comprises a programmable function data path configured to perform deblocking filtration on data input into said programmable function data path; and fifth processor comprises a programmable function data path configured to perform de-interlacing on data input into said programmable function data path. Additional processors can be included directed any of the processing functions described herein.
Therefore, it is an object of the present invention to provide a media processing unit that comprises a template Front End Processor (FEP) with an Extendable Data Path portion for customizing the FEP in accordance with a plurality of specific functional processing needs.
It is another object of the present invention to provide a two dimensional register set arrangement to facilitate two dimensional processing in a single clock cycle, thereby accelerating media processing functions.
According to another objective, a processing unit of the present invention combines DCT and IDCT functions in a single unified block. A single programmable processing block allows for computationally efficient processing of 2, 4, and 4 point forward and reverse DCT.
It is also an object of the present invention to provide a processing unit that combines Quantization (QT) and De-Quantization (DQT) functions in a single unified block and is flexible enough to implement different equations in order to support multiple CODEC standards and has the capability of computing significant coefficients on the fly with no overhead to speed up processing for entropy coding. Accordingly, in one embodiment a unified processing unit is used to do both quantization and de-quantization on 8 words in a single clock cycle.
According to another object of the present invention a motion compensation processing unit uses a single data path to process multiple codecs.
It is another object of the present invention to have a de-blocking filter DSP that can be programmed to be used for any codec and can also operate at least 30 frames per second.
It is a yet another object of the present invention to have a media processing unit that can be used to perform a given processing function for various kinds of media data, such as graphics, text, and video, and can be tailored to work with any coding standard or approach. Accordingly, in one embodiment the media processing unit of the present invention provides optimal data/memory management along with a unified processing approach to enable a cost-effective and efficient processing system.
These and other features and advantages of the present invention will be appreciated, as they become better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:
a is a first representation of an 8 row×8 column matrix representation of an 8-point forward DCT;
b is a second representation of an 8 row×8 column matrix representation of an 8-point forward DCT;
c is a third representation of an 8 row×8 column matrix representation of an 8-point forward DCT;
a shows a circuit structure of an 8-point DCT system of the present invention;
b is a structure of an addition and subtraction circuit comprising of a pair of an adder and a subtractor implemented in the present invention;
c is a structure of a multiplication circuit implemented in the present invention;
a is a first representation of an 8 row×8 column matrix representation of an 8-point Inverse DCT;
b is a second representation of an 8 row×8 column matrix representation of an 8-point Inverse DCT;
c is a third representation of an 8 row×8 column matrix representation of an 8-point Inverse DCT;
a shows a circuit structure of an 8-point inverse DCT of the present invention;
b is a view of a structure of a multiplication circuit implemented in the present invention;
a is a first representation of a 4 row×4 column matrix representation of a 4-point forward DCT;
b is a second representation of a 4 row×4 column matrix representation of a 4-point forward DCT;
c is a third representation of a 4 row×4 column matrix representation of a 4-point forward DCT;
a shows a circuit structure of a 4-point DCT system of the present invention;
b is a view of a structure of an addition and subtraction circuit comprising of a pair of an adder and a subtractor;
c is a view of a structure of a multiplication circuit;
a is a first representation of a 4 row×4 column matrix representation of a 4-point Inverse DCT;
b is a second representation of a 4 row×4 column matrix representation of a 4-point Inverse DCT;
c is a third representation of a 4 row×4 column matrix representation of a 4-point Inverse DCT;
a is a first representation of a 2 row×2 column matrix representation of a 2-point forward DCT;
b is a second representation of a 2 row×2 column matrix representation of a 2-point forward DCT;
c is a third representation of a 2 row×2 column matrix representation of a 2-point forward DCT;
While the present invention may be embodied in many different forms, for the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended. Any alterations and further modifications in the described embodiments, and any further applications of the principles of the invention as described herein are contemplated as would normally occur to one skilled in the art to which the invention relates. Where arrows are utilized in the drawings, it would be appreciated by one of ordinary skill in the art that the arrows represent the interconnection of elements and/or components via buses or any other type of communication channel.
The present invention will presently be described with reference to the aforementioned drawings. Headers will be used for purposes of clarity and are not meant to limit or otherwise restrict the disclosures made herein. Where arrows are utilized in the drawings, it would be appreciated by one of ordinary skill in the art that the arrows represent the interconnection of elements and/or components via buses or any other type of communication channel.
It should be appreciated that this processor, when configured for a specific processing function, can be implemented in a system architecture that may comprise a plurality of processors, 1901-1910, with each processor being dedicated to a specific processing function, such as entropy encoding (1901), discrete cosine transform (DCT) (1902), inverse discrete cosine transform (IDCT) (1903), motion compensation (1904), motion estimation (1905), de-blocking filter (1906), de-interlacing (1907), de-noising (1908), quantization (1909), and dequantization (1910), and being managed by a task scheduler 1911. In addition to processor-level parallelism, each processing unit (1901-1910) can operate on multiple words in parallel, rather than just a single word per clock cycle. Finally, at the instruction level, the control data memory (shown as 125 in
The FEP 105 comprises two Address Generation Units (AGU) 120 connected to a data memory 125 via data bus 130 that in one embodiment is a 128 bit data bus. The data bus further connects PCU 16×16 register file 135, address registers 140, program control 145, program memory 150, arithmetic logic unit (ALU) 155, instruction dispatch and control register 160 and engine interface 165. Block 190 depicts a MOVE block. The FEP 105 receives and manages instructions, forwarding the data path specific instructions to the Extendable Data Path 110, and manages the registers that contain the data being processed.
In one embodiment the FEP 105 has 128 data registers that are further divided into upper 96 registers for the Extendable Data Path 110 and lower 32 registers for the FEP 105. During operation the instruction set is transmitted to Extendable Data Path 110 and the FEP 105 directs requisite data to the registers (the AGU 120 decodes instructions to know what data to put into the registers), allocating the data to be executed on by the Extendable Data Path 110 into the upper 96 registers. For example, if the instruction set is R3=R0+R1 then since this is done in the ALU 155, the data values for it are stored in the lower 32 registers. However, if another instruction is a filter instruction that needs to be executed by the Extendable Data Path 110, the required data is stored in the upper 96 registers.
The Extendable Data Path 110 further comprises instruction decoder and controller 170 and has an independent path 175 from Variable Size Engine Register File 180 to data memory 185. This path 175 can be of any size, such as 1028 bits, 2056 bits, or other sizes, and customized to each Function Specific Data Path 115. This provides flexibility in the amount of data that can be processed in any given clock cycle. Persons of ordinary skill in the art should note that in order to make the Extendable Data Path 110 useful for its intended purpose, the processing unit 100 is flexible enough to accept a wide range of instructions. The instruction format 200 of
While each functional path specific to one or more media processing functions will be described in greater detail below, a novel system and method of enabling rapid data access, employed by one or more of such functional paths specific to one or more media processing functions, uses a two dimensional data register set.
When compared with prior art one dimensional register set 300 of
Thus, during processing, when Register0 is processed (to do a transformation such as ‘Discrete Cosine Transform’) an entire clock cycle is used in accessing only Register° in the prior art one dimensional register. However, in the two dimensional register set of the present invention a single clock cycle can be used to not only access/process Register0 but also the column (defined as Register 0 to Register N) which is a logically different register and that occupies the same physical space as Register0.
It should be appreciated that the DCT unit 513, array of transpose registers 514, 515, scaling memory 518, and 8 quantizers 517, represent elements of the function specific data path, shown as 115 in
Additionally, as discussed above, the same circuit structure useful for processing a DCT/IDCT function in accordance with one standard or protocol can be repurposed and configured to process a different standard or protocol. In particular, the DCT/IDCT functional data path for processing data in accordance with H.264 and be used to also process data in accordance with VC-1, MPEG-2, MPEG-4, or AVS. Accordingly, different sized blocks in an image can be DCT or IDCT processed with processor 500. For example, 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, 4×4, and 2×2 macro-blocks can be transformed using horizontal and vertical transform matrices of sizes 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4.
Referring to
A typical forward DCT can be mathematically expressed as Y=CXCT where C is a transformation matrix, X is the input matrix and Y is the output transformed coefficients. For an 8-point forward DCT, this equation can be implemented mathematically in the form of 8×8 matrices as shown in
Thus, in an 8-point forward DCT mode, 8×8 blocks of pixel information are transformed into 8×8 matrices of corresponding frequency coefficients. To do this transformation, the present invention uses row-column approach where each row of the input matrix is transformed first using 8-point DCT, followed by transposition of the intermediate data, and then another round of column-wise transformation. Each time 8-point DCT is performed, 8 coefficients are produced from the matrix multiplication shown below:
{y1y1y2y3y4y5y6y7}={x0x1x2x3x4x5x6x7}×A
where:
In one embodiment, the above mentioned equations are implemented in three pipeline stages, producing eight coefficients at a time, as shown in
Referring now to
In stage two the second intermediate values a8 to all and first intermediate values a1, a3, a5, a7 are selectively paired, written to first stage intermediate value holding registers 720 from where they are output pair-wise to multiplication circuits where they are multiplied with parameters c1 to c7. For example, second intermediate values a8=a0+a2 and a10=a4+a6 are multiplied with a pair of parameters c4, c4 in multiplication circuit 7021 to obtain a quadruple of intermediate values k0=a8c4, k1=a10c4, k2=a8c4 and k3=a10c4 that are written to second stage intermediate value holding registers 721. Persons of ordinary skill in the art would appreciate that values k0, k1, k2 and k3 are equivalent to [(x0+x7)+(x3+x4)]c4, [(x1+x6)+(x2+x5)]c4, [(x0+x7)+(x3+x4)]c4, [(x1+x6)+(x2+x5)]c4 respectively. Similarly, values k4 to k23 are obtained as evident from the logic flow diagram of
In stage three, a routing switch 725 is used that outputs intermediate values k0 to k23 in selective pairs for further adding or subtraction. For example, values k0 and k1 are added to obtain intermediate value m0=k0+k1 while values k6 and k7 are subtracted to obtain intermediate value m3=k6−k7 and so on as shown in
Since the inverse and forward DCT are orthogonal, the inverse DCT is given as X=CTYC, where C is the transformation matrix, Y is the input transformed coefficients and X is the output inverse transformed samples. For an 8-point inverse DCT, this equation can be implemented mathematically in the form of 8×8 matrices as shown in
For H.264 codec:
a0=y0+y4;
a4=y0−y4;
a2=(y2>>1)−y6;
a6=y2+(y6>>1);
a1=−y3+y5−y7−(y7>>1);
a3=y1+y7−y3−(y3>>1);
a5=−y1+y7+y5+(y5>>1); and
a7=y3+y5+y1+(y1>>1).
b0=a0+a6;
b2=a4+a2;
b4=a4−a2;
b6=a0−a6;
b1=a1+a7>>2;
b7=−a1>>2+a7;
b3=a3+a5>>2; and
b5=a3>>2−a5.
Yet further:
m0=b0+b7;
m1=b2+b5;
m2=b4+b3;
m3=b6+b1;
m4=b6−b1;
m5=b4−b3;
m6=b2−b5; and
m7=b0−b7.
8-point Inverse DCT can be viewed as matrix multiplication as shown below:
{x0x1x2x3x4x5x6x7}={y0y1y2y3y4y5y6y7}×B
where:
For H.264 codec:
a0=y0+y4=k0+k1=m0=m6;
a4=y0−y4=k0−k1=m2=m4;
a2=(y2>>1)−y6=k6−k7=m3=m5;
a6=y2+(y6>>1)=k4+k5=m1=m7;
a1=−y3+y5−y7−(y7>>1)=(y5)−(y3+y7+y7>>1)=(k10+k13)−(k16+k23)=m14−m15=p7;
a3=y1+y7−y3−(y3>>1)=(y1)−(y3+y3>>1−y7)=(k12+k9)−(k20−k17)=m12−m13=p6;
a5=−y1+y7+y5+(y5>>1)=−((y1−(y5+y5>>1))−y7)=−((k14−k11)−(k22+k19))=−(m10−m11)=−p5; and
a7=y3+y5+y1+(y1>>1)=((y1+y1>>1)+y5)+(y3)=(k8+k15)+(k18+k21)=m8+m9=p4.
b0=a0+a6=m0+m1=p0;
b2=a4+a2=m2+m3=p1;
b4=a4−a2=m4−m5=p2;
b6=a0−a6=m6−m7=p3;
b1=a1+a7>>2=p7+p4>>2=q4;
b3=a3+a5>>2=p6+(−(−p5>>2))=q5;
b5=a3>>2−a5=p6>>2+(−p5)=q6; and
b7=−a1>>2+a7=−p7>>2+p4=q7.
Yet further:
m0=b0+b7=p0+q7=x0;
m1=b2+b5=p1+q6=x1;
m2=b4+b3=p2+q5=x2;
m3=b6+b1=p3+q4=x3;
m4=b6−b1=p3−q4=x4;
m5=b4−b3=p2−q5=x5;
m6=b2−b5=p1−q6=x6; and
m7=b0−b7=p0−q7=x7.
These equations are implemented in pipeline stages, producing eight output inverse transforms at a time, as shown in
As illustrated in
For a 4-point forward DCT, the transformation can be implemented mathematically in the form of 4×4 matrices as shown in
Each time 4-point DCT is used, 4 coefficients are produced from matrix multiplication as shown below:
Again, the logic structure 700 of
b is a view of the basic structure of the addition and subtraction circuit 1101 comprising of a pair of an adder 1105 and a subtractor 1106. The input data x0 and x1 are input to the adder 1105 and the subtractor 1106. The adder 1105 outputs the result of the addition of x0 and x1 as x0+x1, while the subtractor 1106 outputs the result of subtraction of x0 and x1 as x0−x1.
For a 4-point inverse DCT, the transformation can be implemented mathematically in the form of 4×4 matrices as shown in
4-point Inverse DCT can be implemented by matrix multiplication as shown below:
These equations are implemented in pipeline stages, producing eight output inverse transforms at a time, as shown in
For a 2-point forward DCT, the transformation can be implemented mathematically in the form of 2×2 matrices as shown in
Each time 2-point DCT is used, 2 coefficients are produced from 2×1 by 2×2 matrix multiplication as shown below:
As discussed above, the logical circuit 1500 in
Referring back to
The amount of quantization is controlled by a step value referred to as Quantization Parameter (QP). QP determines the scaling value with which each element of the block is quantized or scaled. These scaling values are stored in lookup tables, such as within a scaling memory, at the time of initialization, and are retrieved later during the quantization operation. The QP computes the pointer to this table. Thus, the quantizer is programmed with a quantization level or step size.
According to an important aspect of the present invention the quantization and de-quantization occur in the same pipeline stage and therefore the operations are performed in sequence one after the other using the same hardware structure. In other words, according to a novel aspect the hardware structure of the present invention is configurable and generic to support different type of equations (depending upon different types of video encoding standards or CODECs). This is accomplished by breaking down the hardware into simpler functions and then controlling them through instructions to perform different types of equations different types of video encoding standards or CODECs.
Referring to
As mentioned earlier the quantization techniques used depend on the encoding standard. For example, the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) drafted a video coding standard titled ITU-T Recommendation H.264 and ISO/IEC MPEG-4 Advanced Video Coding, which is incorporated herein by reference. In the H.264 standard, video is encoded on a macroblock-by-macroblock basis.
Generally, the human eye is more perceptive to the luma characteristics of video, compared to the chroma red and chroma blue characteristics. Accordingly, there are more pixels in the luma grid 1705 compared to the chroma red grid 1706 and the chroma blue grid 1707. In the H.264 standard, the chroma red grid 1706 and the chroma blue grid 1707 have half as many pixels as the luma grid 1705 in each direction. Therefore, the chroma red grid 1706 and the chroma blue grid 1707 each have one quarter as many total pixels as the luma grid 1705. Also, H.264 uses a non-linear scalar, where each component in the block is quantized using a different step value.
In one embodiment there are two lookup tables namely LevelScale 2130 and LevelOffset 2140, shown as inputs into the quantization layers 2105 in
LevelScale=LevelScale4×4Luma[1][luma_qp_rem]
LevelOffset=LevelOffset4×4Luma [1][luma_qp_per]
level=[(abs(input)*LevelSacle[indxPtr])+(LevelOffset[indxPtr])]>>(qbits)
ouput=level*sign(input)
LevelScale=LevelScale4×4Chroma [CrCb][Intra][cr_qp_rem or cb_qp_rem]
LevelOffset=LevelOffset4×4Chroma [CrCb][Intra][cr_qp_per or cb_qp_per]
VC-1 is a standard promulgated by the SMPTE, and by Microsoft Corporation (as Windows Media 9 or WM9).
Output=[(input)*DQScaleTable [DCStepSize])+(1<<17)]>>18
De-Quantization is the inverse of quantization, where the quantized coefficients are scaled up to their normal range before transforming back to the spatial domain. Similar to quantization, there are equations (provided below) for the de-quantization.
One embodiment uses a single lookup table—InvLevelScale. During de-quantization process, values from these tables are read and used in the equations (provided below) using index pointers that are computed using QP.
InvLevelScale=InvLevelScale4×4Luma[1][luma_qp_rem]
InvLevelScale=InvLevelScale4×4Chroma [CrCb][Intra][cr_qp_rem or cb_qp_rem]
In one embodiment, assuming 16-bits for Level Scale, Inverse Level Scale & Level Offset, the total memory required for Level Scale is 1344 Bytes, and for Level Offset & Inverse Level Scale together is 1728 Bytes. With 128-bit wide memory, one instance of 84 & one instance of 108 deep memories are needed, in one embodiment.
Standards such as MPEG, AVS, VC-1, ITU-T H.263 and ITU-T H.264 support video coding techniques that utilize similarities between successive video frames, referred to as temporal or inter-frame correlation, to provide inter-frame compression. The inter-frame compression techniques exploit data redundancy across frames by converting pixel-based representations of video frames to motion representations. In addition, some video coding techniques may utilize similarities within frames, referred to as spatial or intra-frame correlation, to further compress the video frames. The video frames are often divided into smaller video blocks, and the inter-frame or intra-frame correlation is applied at the video block level.
In order to achieve video frame compression, a digital video device typically includes an encoder for compressing digital video sequences, and a decoder for decompressing the digital video sequences. In many cases, the encoder and decoder form an integrated “codec” that operates on blocks of pixels within frames that define the video sequence. For each video block in the video frame, a codec searches similarly sized video blocks of one or more immediately preceding video frames (or subsequent frames) to identify the most similar video block, referred to as the “best prediction.” The process of comparing a current video block to video blocks of other frames is generally referred to as motion estimation. Once a “best prediction” is identified for a current video block during motion estimation, the codec can code the differences between the current video block and the best prediction.
This process of coding the differences between the current video block and the best prediction includes a process referred to as motion compensation. Motion compensation comprises a process of creating a difference block indicative of the differences between the current video block to be coded and the best prediction. In particular, motion compensation usually refers to the act of fetching the best prediction block using a motion vector, and then subtracting the best prediction from an input block to generate a difference block. The difference block typically includes substantially less data than the original video block represented by the difference block.
The present invention provides a motion compensation processor that is a highly configurable, programmable, scalable processing unit that handles a plurality of codecs. In one embodiment the motion compensation processor comprises the front end processor with an extendable data path, and more specifically, functional data path configured to provide motion compensation processing. In one embodiment, this processor runs at or below 500 MHz, more preferably 250 MHz. In another embodiment, the physical circuit structure of this processor can be logically programmed to process high definition content using multiple different codecs, protocols, or standards, including H.264, AVS, H.263, VC-1, or MPEG (any generation), while running at or below 250 MHz
Referring to
More specifically, the adaptive Deblocking Filter (hereinafter referred to as DBF) of the present invention comprises Front-End Processor (FEP) 2505 and extendable data path DBF 2510. The extendable data path DBF 2510 uses the Extended Data Path (EDP) of FEP 2505 acting as a co-processor, decoding instructions forwarded by FEP 2505 and executing them in Control Data Path (CDP) 2515 and configurable 1-D filter 2520. The FEP 2505 provides unified programming interface for DBF 2510. The extendable data path DBF 2510 comprises a first Transpose module (T0) 2521 and a second Transpose module (T1) 2522, Control Data Path (CDP) 2515, Configurable Parallel-In/Parallel-Out 1-D Filter 2520, Instruction Decoder 2525, Parameters Register File (PRF) 2530, and Engine Register File (DBFRF) 2535.
In one embodiment, the transpose modules 2521, 2522 are each 8×4 pixel arrays that are used to store and process two adjacent 4×4 blocks, row by row. Modules 2521, 2522 use transpose functions when performing vertical filtering on H-boundaries (horizontal boundaries) and regular functions when performing horizontal filtering on V-boundaries. The two modules are used as ping-pong arrays to speed up the filtering process.
CDP 2515 is used to compute the conditions needed to decide the filtering, and in one embodiment implements H.264/AVC, VC-1, and AVS codecs. It also contains three look-up tables needed to compute different thresholds. 1-D 2520 filter is a two-stage pipelined filter comprising of adders and shifters. Parameter control 2530 comprises all information/parameters related to the current macro block that the DBF 2505 is processing. The information/parameters are provided by content manager (CM). The parameters are used in CDP 2515 for making decision for filtering. Engine Register File 2535 comprises information used from the extended function specific instructions inside DBF 2505.
Table 1 below shows the comparison of the main properties of DBF 2505 for different codecs covered in one embodiment. A preferred picture resolution targeted herein is at least 1080i/p (1080x 1920@30 Hz) High Definition.
The architecture of the adaptive DBF of the present invention can take any block size and transpose as necessary in order to abide by the filtering requirements of a specific codec. To achieve this, the architecture first organizes the memory in a manner that can support any of the various codecs' approaches to doing DBF. Specifically, the memory organization ensures that whatever data is needed from neighbor blocks (or as a result of processing that was just completed) is readily available. Persons of ordinary skill in the art would appreciate that the actual filtering algorithm is defined by the codec being used, the use of the transpose function is defined by the codec being used and the size/number of blocks is defined by the codec being used.
The data path pipeline stages are shown in
Max Requirement
1080i/p @ 30 Hz(30 frames/sec),
((1080+offset)*1920)/(16*16)=(1088*1920)/256=8160 MB/frame
1/(30*8160)=4.085*1E−6=4085 ns/frame
4085 ns/(1/235 MHz)=4085 ns/4.26 ns=958.92 clock cycles≈956 clock cycles
Based on
Actual Performance
100 cycles+16(HLuma)*8 cycles+4(HCb)*8 cycles+4(HCr)*8 cycles 24+16(VLuma)*10 cycles+4(VCb)*10 cycles+4(VCr)*10 cycles+100 cycles+200cyckles=832 cycles
The calculations above show that one should fit within the target performance requirements to process one macro block (MB).
The deblocking filtering is done on a macro block basis, with macro blocks being processed in raster-scan order throughout the picture frame. Each MB contains 16×16 pixels and the block size for motion compensation can be further partitioned to 4×4 (the smallest block size for inter prediction). H.264/AVC and VC-1 can have 4×4, 8×4, 4×8, and 8×8 block sizes, and AVS can have only 8×8 block size. Persons of ordinary skill in the art would realize that mixed block sizes within the MB boundary can also be had.
In order to ensure a match in the filtering process between decoder and encoder, the filtering preferably follows a pre-defined order. One embodiment of the filtering order for H.264/AVC is shown in
The filtering process also affects the boundaries of the already reconstructed macro blocks above and to the left of the current macro block. In one embodiment, frame boundaries are not filtered.
Similarly the same order applies for macro blocks in AVS but on the 8×8 boundary. The order of the internal filtered edges is the same as in H.264. In VC-1 the filtering ordering is different. For I, B, and BI pictures filtering is performed on all 8×8 boundaries, where for P pictures filtering could be performed on 4×4, 4×8, 8×4, and 8×8 boundaries. For P picture this is the filtering order. First all blocks or sub-blocks that have horizontal boundaries along the 8th, 16th, 24th, etc. horizontal lines are filtered. Next all sub-blocks that have horizontal boundaries along the 4th, 12th, 20th, etc. horizontal lines are filtered. Next all sub-blocks that have vertical boundaries along the 8th, 16th, 24th, etc. vertical lines are filtered. Last, all sub-blocks that have vertical boundaries along the 4th, 12th, 20th, etc. vertical lines are filtered.
In H.264/AVC for each boundary between adjacent luma blocks a “Boundary Strength” parameter bS is assigned as shown on
To preserve image sharpness, the true edges need to be left unfiltered as much as possible while filtering artificial edges to reduce their visibility. For that purpose the deblocking filtering is applied to a line of 8 samples (p3, p2, p1, p0, q0, q1, q2, q3) of two adjacent blocks in any direction, with the boundary line 3115 between p03105 and q03125 as shown in
Filtering does not take place for edges with bS equal to zero (bS=0). For edges with nonzero bS value, a pair of quantization-dependent threshold parameters, referred to as α and β, are used in the content activity check that determines whether each set of 8 samples is filtered. In one embodiment, sets of samples across this edge are only filtered if the following condition is true:
filterFlag=(bS≠0 &&|p0−q0|<α &&|p1−p0|<β &&|q1−q0|<β) (1-1)
Up to 3 pixels on each side of the boundary can be filtered in H.264/AVC. The values of the thresholds a and 0 are dependent on the average value of quantization parameter (qPp and qPq) for the two blocks as well as on a pair of index offsets “FilterOffsetA” and “FilterOffsetB” that may be transmitted in the slice header for the purpose of modifying the characteristics of the filter.
Overlap transform or smoothing is performed across the edges of two neighboring Intra blocks for both luma and chroma channels. This process is performed subsequent to decoding the frame and prior to deblocking filter. Overlap transforms are modified block based transforms that exchange information across the block boundary. Overlap smoothing is performed on the edges of 8×8 blocks that separate two Intra blocks.
The overlap smoothing is performed on the un-clipped 10 bit/pel reconstructed data. This is important because the overlap function can result in range expansion beyond the 8 bit/pel range.
Vertical edges are filtered first followed by the horizontal edges.
For I, B, and BI pictures the filtering is performed at all 8×8 block boundaries (luma, Cb or Cr plane). For P pictures the blocks may be Intra or Intra-coded. If the blocks are Intra-coded filtering is performed on 8×8 boundaries, and if the blocks are Inter-coded filtering is performed on 4×4, 4×8, 8×4, and 8×8 boundaries.
The pixels for filtering are divided into 4×4 segments. In each segment the 3rd row is always filtered first. The result of this filtering determines if the other 3 rows will be filtered or not. The Boolean value of ‘filter_other—3_pixels’ defines whether the remaining 3 rows in the segment are also to be filtered. If ‘filter_other—3_pixels’==TRUE, then they are filtered, otherwise they are not filtered and the filtering operation proceeds to the next 4×4 pixel segment.
In VC-1 up to one pixel on each side of the boundary can be filtered. The following four exceptions are described in the Main Profile deblocking for P picture:
Referring to
The ME engine 3400 is provided with a dedicated pixel memory 3405, with different address mapping for different interfaces such as ME Filter 3401 and ME Array 3402 in the ME engine, as well as for related functional processing units of a media processing system, such as motion compensation (MC) and Debug. In one embodiment, the ME pixel memory 3405 comprises four vertical banks with the provision of multiple simultaneous writes across banks by means of address aliasing across the banks.
The ME Control block 3406 contains the circuitry and logic for controlling and coordinating the operation of various blocks in the ME engine 3400. It also interfaces with the Front End processor (FEP) 3407 which runs the firmware to control various functional processing units in a media processing system.
Data access and writes to the memory are facilitated through a set of four multiplexers (MUX) in the ME engine. While the Filter SRC MUX 3408 and REF SRC MUX 3409 interface with the pixel memory 3405 as well as external memory, the CUR SRC MUX 3410 is used to receive data from external memory and the Output Mux 3411 is used when data is to be written to the external memory.
During motion estimation processing, in order to progress through the frame, the selected window shifts down a pixel row for every clock cycle. Therefore, the ME Array 3402 is provided with a set of registers 3412 called Row 16 registers, which are used to store pixel data corresponding to the last row.
Referring to
The filters 3510 are designed to support loads from both external memory and internal memory 3505, and are capable of the following filter operation sizes:
The integrated circuit details for the filter design are illustrated in
In existing motion estimation systems, the structure of the ME array is designed to move data in two directions, and it takes 16 cycles to load a 16×16 array. However, in the motion estimation system of the present invention, the 16×16 motion estimation array is designed such that it is moves data in 3 directions. An exemplary structure of such an ME Array is illustrated in
Further, the vertical intermediate columns of the array 3700, illustrated as [0:3] 4802, [4:7] 4803 and so on, help to save additional data by avoiding new loads for an adjacent coordinate. Another novel feature of the array structure of
The novel array structure of the present invention allows for data movement in three directions—top, down and left. The array structure is capable of supporting loads from external memory as well as internal memory, and supports the following search sizes:
The array structure also permits optional data flipping on the byte boundary for write operations. The advantages and features of the ME array structure will become more clear when described with reference to the operation of motion estimation engine of the present invention in the forthcoming sections.
It is known in the art that each frame in an image signal is divided into two kinds of blocks, known as luminance and chrominance blocks, as discussed above. For coding efficiency, motion estimation is applied to the luminance block.
After the best match is found amongst the candidate blocks, a motion vector for the best matching block is determined. This is shown in step 3805. The motion vector represents the displacement of the matched block to the present frame.
Thereafter, the input frame is subtracted from the prediction of the reference frame, as shown in step 3806. This allows just the motion vector and the resulting error to be transmitted instead of the original luminance block. This process of motion estimation is repeated for all the frames in the image signal, as illustrated in step 3807. As a result of using motion estimation, inter-frame redundancy is reduced, thereby achieving data compression.
On the decoder side, a given frame is rebuilt by adding the difference signal from the received data to the reference frames. The addition reproduces the present frame.
Functionally, motion estimation uses a specific window size, such as 8×8 or 16×16 pixels for example, and the current window is move around to obtain motion estimation for the entire block. Thus, a motion estimation algorithm needs to be exhaustive, covering all the pixels across the block. For this purpose, an algorithm can use a larger window size; however it comes at the cost of sacrificing clock cycles. The motion estimation engine of the present invention implements a unique method of efficiently moving the search window around, making use of the novel ME Array structure (as described previously). According to this method:
1. Using the reference frame, a set of pixels corresponding to the chosen window size is loaded in the ME Array. The beginning point is the upper left corner of the frame.
2. At the same time when a set of pixels corresponding to the window is loaded, a “ghost column” to the right of the window is also loaded. As previously mentioned, the ME Array contains a ghost column after every fourth array column. That ghost column includes pixels to the right of the window and keeps them ready for processing when the window moves one pixel to the right.
3. To move around the frame, the window moves down by one pixel row every clock cycle. Each time it moves down, pixels at the top of the window move out of the array and new pixels at the bottom move in. This continues until the bottom of the frame is reached. Once the bottom is reached, the window moves one column to the right, thereby including the pixels in the ghost column.
4. The process is repeated, except that this time the window moves from bottom to up, that is, the frame moves down. On reaching the top of the frame, the window shifts to the right again, and again makes use of the ghost column.
Thus, the ghost column acts to significantly minimize loads, regardless of what window size is chosen.
As previously disclosed, the motion estimation involves identifying the best match between a current frame and a reference frame. To do so, ME engine applies a window to the reference frame, extracts each pixel value into an array and, at each processing element in the array, performs a calculation to determine the sum of the differences. The processing element contains arithmetic units and two registers to hold the current pixel and reference pixel values. Since the window moves by a pixel row every clock cycle to progress through the frame, and shifts to the right on reaching the end of a column, therefore, to perform this integer search, only one clock cycle is needed to load the data required to perform an analysis for a search point.
When doing an integer search, a motion estimation method may stop on obtaining an initial match. However, in the motion estimation method of the present invention, when the best match is found in a frame, the corresponding window is captured and sent to a filter to calculate the ½ pixel (½ pel) and ¼ pixel (¼ pel) values. This is referred to as interpolation. Thus, on finding the best integer match, all the required data around the search location downloaded and interpolation is performed around it. At the same time reference information for carrying out the next search also needs to be downloaded. The architecture of the motion estimation system of the present invention enables performing in searches and interpolation concurrently. That is, data for search can be loaded at the same time when data for filtering is loaded. For implementing this parallel operation, the FEP executes two instructions—one to perform filtering and one for carrying out searching. The memory structure of the motion estimation engine of the present invention is also designed for allow simultaneous loading of data, thereby enabling parallel searching and interpolate/filtering.
Specifically, with the integer search, every time the window is moved by a row or a column, data for the new row or column is loaded in, while data from the other rows or columns is retained. This is because during integer search, a majority of the rows or columns are reused in new calculations in subsequent processing steps. This automatically lowers the number of clock cycles required per search point to just one. However, for ½ pixel or ¼ pixel search, the data being used for each search point is not reused from the immediately prior calculation. In fact, each time, the data is completely new.
This fact is illustrated by means of
This implies that the entire data needs to be reloaded for each search point. If each column or row were to be loaded in the conventional manner, it would require 16 clock cycles for a 16×16 window, which is very inefficient.
In order to address this problem of inefficient data loading, the system of present invention employs a novel design for the ME Array comprising horizontal banking The concept of horizontal banking has been mentioned previously. Specifically, horizontal banking in the ME Array of the present invention involves having four separate memory banks, which are responsible for loading a portion of the window data. They can be used either to load data horizontally or vertically. By using four separate memory banks to load data for each search point, a search point can be processed in just 4 clock cycles, instead of 16. One of ordinary skill in the art will appreciate that the number of separate, dedicated memory banks in the ME Array is not limited to four, and may be determined on the basis of the window size chosen for motion estimation processing. The registers of the ME Array are able to determine when data is required to be loaded from the memory banks, and are capable of automatically computing the address of the memory bank from where data is to be accessed.
The ME Engine of the present invention employs another novel design feature to further speed up the processing. The novel design feature involves provision of a shadow memory that is used in between the external memory interface (EMIF) and internal memory interface (IMIF). This is illustrated in
This problem is addressed in the system of present invention by making use of shadow memory 4160. The shadow memory comprises a set of three circular disks of memories—SM14161, SM24162, and SM34163. The shadow memories 4160 are used to load certain data blocks and store them for future use, permitting the DMA 4120 to keep filling the memory 4110. An exemplary operation of shadow memories is illustrated by means of a table in
Referring to
Referring now to the instruction format 4200 of
The Loop slot 4205 provides a way to specify zero-overhead hardware loops of a single packet or multiple packets. DP0 and DP1 slots are used for engine-specific instructions and ALU instructions (Bit 17 differentiates the two). This is illustrated in the following table:
The engine instruction set is not explicitly defined here as it is different for every media processing function engine. For example, Motion Estimation engine provides an instruction set, and the DCT engine provides its own instruction set. These engine instructions are not executed in the FEP. The FEP issues the instruction to the media processing function engines and the engines execute them.
ALU instructions can be 18-bit or 36-bit. If the DP0 slot has a 36-bit ALU instruction, then the DP1 slot cannot have an instruction. AGU0 and AGU1 slots are used for AGU (Address Generation Units) instructions. If the AGU0 slot has an instruction with an immediate operand, then the least significant 16-bits of the AGU1 slot contains the 16-bit immediate operand and therefore the AGU1 slot cannot have an instruction. Referring now to the pipeline diagram of the FEP of
The FEP supports zero-overhead hardware loops. If the loop count (LC) is specified using the immediate value in the instruction, the maximum value allowed is 32. If the loop count is specified using the LC register, the maximum value allowed is 2048. An 8 entry loop counter stack is provided in the hardware to support up to 8 nested loops. The loop counter stack is pushed (popped) when the LC register is written (read). This allows the software to extend the stack by moving it to memory.
The DP0 and DP1 slots support ALU instructions and engine-specific instructions. The ALU instructions are executed in the FEP. The ALU instructions provide simple operations on the data registers (DR). The general format is DRk=DRi op DRj. The DP0 slot and DP1 slot instruction table has a list of instructions supported by the FEP ALU. The AGU instructions include load from memory, store to memory, and data movement between all kinds of registers (address registers, data registers, special registers, and engine-specific registers), compare data registers, branch instruction, and return instruction.
As mentioned earlier, the FEP has 8 address registers and 4 increment registers (also known as offset registers). The different processing units use a 24 bit address bus to address the different memories. Of these 24 bits, the top 8 bits coming from the bottom 8 bits of the Address Prefix register identify the memory that is to be addressed and the remaining 16-bits coming from the Address Register address the specific memory. Even though the data word size is 16-bits inside the FEP, the addresses it generates are byte-addresses. This may be useful for some media processing function engines that need to know where the data is coming from at a pixel (byte) level. The FEP also supports an indexed addressing mode. In this mode, the top 8 bits of the address come from the top 8 bits of the Address Prefix register. The next 10 bits come from the top 10 bits of the Array Pointer register. The next 5 bits come from the instructions. The last bit is always 0. In this mode, the data type is 16-bits or more. Load Byte, and Store Byte instructions are not supported. The FEP also supports another address increment scheme specially suited for the scaling function in the video post-processor. In this scheme, the address update is done according to the following equation: {An, ASn[7:0]}={An, ASn[7:0]}+In, where { } is the concatenation operation, An refers to the address register, ASn refers to the address suffix register, and In refers to the increment register.
Two data registers (DRi, DRj) can be compared using the Compare instructions. Thus, CMP_S assumes that the two data registers are signed numbers and CMP_U assumes that the two data registers are unsigned numbers. FLAG register contains the output of a comparison operation. For example, if DRi was less than DRj, LT bit will be set. For further information on the FLAG register please refer to the Register Definition section.
Conditional branch instructions allow two types of conditions. The conditional branch can check any bit in the FLAG register for a ‘1’ or a ‘0’. The second type of condition allows the programmer to check any bit in any Data Register for a ‘1’ or a ‘0’. Bit 7 and bit 6 of the FLAG register are read only and are set to 0 and 1 respectively. This can be used to implement unconditional branches.
The Branch instruction also has an option (‘U’ bit is set to ‘1’) to save the PC of the instruction following the delay slot (PC+2) into the SPC (saved PC) stack. This helps support subroutines along with a return instruction which uses SPC as the target address. The SPC stack is 16-deep and it is also used to implement DSL-DEL loops. The SPC stack is pushed (popped) whenever the SPC register is written (read) either implicit or explicit. This allows software to extend the stack by moving it to memory.
The Branch instruction has an always executed delay slot. There are “kill” options which may help the programmer to fill the delay slot flexibly. There is an option to kill the delay slot when the branch is taken (KT bit) and another option to kill when the branch is not taken (KF bit). The following table illustrates how these two bits can be used:
The flag register is updated whenever the FEP executes either an ALU or a compare instruction. Bits [13:8] are updated by ALU instructions and bits [5:0] are updated by compare instructions. Bits 15 and 7 have a fixed value of 0 and bits 14 and 6 are fixed to a value of 1. Those fixed bits can be used to simulate unconditional branches.
Bit 0 is the master interrupt enable. At reset, it is set to ‘1’ which is enabled. When the FEP takes an interrupt it clears this bit and then goes into the Interrupt Service Routine. In the ISR, the programmer can decide whether the code can take further interrupts and set this bit again. The RTI instruction (return from ISR) will also set this bit.
Bit 1 is the master debug enable. At reset, it will be set to ‘1’ which is enabled. The programmer can shield some portion of the firmware from debug mode. In some media processing function engines, some of the optimized sections of code may not be stalled and debug mode is implemented using stalls.
Bit 2 is the cycle count enable. At reset, it will be cleared to ‘0’ which disables the cycle counters. The programmer can write “0” to CCL and CCH and then set this bit to ‘1 ’. This will enable the cycle counter. CCL is the least significant 16-bits of the counter and CCH is the most significant 16-bits of the counter.
Bit 3 is the software interrupt enable. At reset, it will be set to ‘0’ which means disabled, ‘1’ means enabled. If this bit is ‘0’, SWI instruction will be ignored and if this bit is ‘1’, SWI instruction will make the FEP take an interrupt and go to the vector address 0x2.
The deblocking filter utilizes the Front-End Processor (FEP), which is a 5-slot VLIW controller. The format of the FEP instructions is as follows:
The Loop Slot is used to specify LOOP, DLOOP (Delayed LOOP) and NOOP instructions. Any instruction in the DP slots is passed onto the DBF data path for execution. These slots could be used to specify two 18-bit data path instructions, or a single 36-bit instruction. AGU slots are used to load data from internal memories to the DBF using the two Internal Memory Interfaces (IMIF0, IMIF1). To load the AGU Slot 0/1 LOAD instruction can be used. Essentially there are 89 DBF internal registers D32:D120.
Static hazards are hazards that occur between instructions in different execution slots but within the same instruction packet. The rules below are designed to minimize such hazards from occurring.
The FEP handles all the pipeline hazards that are due to data dependencies. All the explicit dependencies are handled automatically by the FEP. In most cases, the data is forwarded (bypassed) to the execution unit that needs the data to increase performance. In some cases this forwarding is not possible and the FEP stalls the pipeline. A good understanding of these cases could help the programmer to minimize stall cycles. The following are the cases for which the FEP stalls automatically:
The FEP does not handle the implicit dependencies. Implicit dependencies are the cases in which the dependency is due to an implicit operand in the instruction (that is, the operand is not explicitly spelled out in the instruction). The following are the cases for which the FEP does not stall and so these implicit dependencies have to be handled in firmware:
In addition to the above cases, there could be some stall cycles introduced when memory is accessed and depend on the external implementation.
The FEP supports one interrupt input, INT_REQ. There is an interrupt controller outside the FEP which supports 16 different interrupts. A single-packet repeat instruction that uses the immediate value as the Loop Count is not interrupted. Similarly a branch delay slot is not interrupted. The FEP checks for these two conditions and if these are not present, it takes the interrupt and branch to the interrupt vector (INT_VECTOR). The return address is saved in the SPC stack. This is the only state information that is saved by hardware. The software is responsible for saving anything that is modified by the Interrupt Service Routine (ISR). The RTI instruction (Return from ISR) returns the code to the interrupted program address.
Bit 0 of the FEP control register (part of the special register set) is a master interrupt enable bit. At reset, this bit is set to ‘1’ which means interrupts are enabled. When an interrupt is taken, the FEP clears the interrupt enable bit. The RTI instruction sets the master interrupt enable bit. In the Interrupt Service Routine, the programmer can decide whether the code can take further interrupts and set this bit again if necessary. Before setting this bit, the programmer must clear the interrupt using the Interrupt Clear register inside the interrupt controller.
The interrupt controller has the following registers that are accessible to the FEP through special registers. The special register ICS corresponds to interrupt control register when writing and interrupt status register when reading. The special register IMR corresponds to the interrupt mask register.
These 16 interrupts have interrupt vector address 0x4. The interrupt service routine can read the Interrupt Status Register to identify the specific interrupt source. In addition to these hardware interrupt bits, the SWI instruction can be used to interrupt the FEP. If SWI_EN bit in the FEP Control register is ‘1’, this instruction makes the FEP take an interrupt and branch to the interrupt vector address which is fixed at 0x2. This also clears the master interrupt enable bit in the FEP Control register. The RTI instruction can be used to return from the ISR. A 4-cycle gap is needed between the instruction clearing the interrupt (the write to ICS register) and the RTI instruction.
The debug interface is designed to provide the following features:
1. Read and write the program memory
2. Stop the program based on the program address that FEP is executing
3. Stop the program based on any other event
4. Step through the program one instruction packet at a time
5. Read and write the FEP registers.
6. Read and write the memories that are accessible to the FEP.
The FEP supports these features with the help of a debug controller.
The FEP has the following ports:
It should be appreciated that the present invention has been described with respect to specific embodiments, but is not limited thereto. In particular, the present invention is directed toward integrated chip architecture for a motion estimation engine, capable of processing multiple standard coded video, audio, and graphics data, and devices that use such architectures.
Although described above in connection with particular embodiments of the present invention, it should be understood the descriptions of the embodiments are illustrative of the invention and are not intended to be limiting. Various modifications and applications may occur to those skilled in the art without departing from the true spirit and scope of the invention as defined in the appended claims.
The present invention relies on the following provisionals for priority U.S. Provisional Application Nos. 61/151,540, filed on Feb. 11, 2009, 61/151,542, filed on Feb. 11, 2009, 61/151,546, filed on Feb. 11, 2009, and 61/151,547 filed on Feb. 11, 2009. The present application is also related to the following U.S. patent application Ser. Nos. 11/813,519, filed on Nov. 14, 2007, 11/971,871, filed on Jan. 9, 2008, 11/971,868, filed Jan. 9, 2008, 12/101,851, filed on Apr. 11, 2008, 12/114,746, filed on May 3, 2008, 12/114,747, filed on May 3, 2008, 12/134,283, filed on Jun. 6, 2008, 11/875,592, filed on Oct. 19, 2007, and 12/263,129, filed on Oct. 31, 2008. The specifications of all of the aforementioned applications are herein incorporated by reference by their entirety.
Number | Date | Country | |
---|---|---|---|
61151540 | Feb 2009 | US | |
61151542 | Feb 2009 | US | |
61151546 | Feb 2009 | US | |
61151547 | Feb 2009 | US |