1. Field of the Invention
The present invention relates to video encoders and, more particularly, to a method and apparatus for a real-time video encoder.
2. Description of the Background Art
The International Telecommunication Union (ITU) H.264 video coding standard is able to compress video much more efficiently than earlier video coding standards, such as ITU H.263, MPEG-2 (Moving Picture Experts Group), and MPEG-4. H.264 is also known as MPEG-4 Part 10 and Advanced Video Coding (AVC). H.264 exhibits a combination of new techniques and increased degrees of freedom in using existing techniques. Among the new techniques defined in H.264 are 4×4 discrete cosine transform (DCT), multi-frame prediction, context adaptive variable length coding (CAVLC), SI/SP frames, and context-adaptive binary arithmetic coding (CABAC). The increased degrees of freedom come about by allowing multiple reference frames for prediction and many more tessellations of a 16×16 pixel macroblock. These new tools and methods add to the coding efficiency at the cost of increased encoding and decoding complexity in terms of logic, memory, and number of operations. This complexity far surpasses those of H.263 and MPEG-4 and begs the need for efficient implementations.
The H.264 standard belongs to the hybrid motion-compensated DCT (MC-DCT) family of codecs. H.264 is able to generate an efficient representation of the source video by reducing temporal and spatial redundancies and allowing distortions. Temporal redundancies are removed by a combination of motion estimation (ME) and motion compensation (MC). ME is the process of estimating the motion of a current frame in the source video from previously coded frame(s). This motion information is used to motion compensate the previously coded frame(s) to form a prediction. The prediction is then subtracted from the original frame to form a displaced frame difference (DFD). The motion information is present for each block of pixel data. In H.264, there are seven possible block sizes within a macroblock—16×16, 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4 (also referred to as tessellations or partitions). Thus, a 16×16 pixel macroblock (MB) can be tessellated into: (A) one 16×16 macroblock region; (B) two 16×8 tessellations; (C) two 8×16 tessellations; and (D) four 8×8 tessellations. Furthermore, each of the 8×8 tessellations can be decomposed into: (a) one 8×8 region; (b) two 8×4 regions; (c) two 4×8 regions; and (d) four 4×4 regions.
Thus, there are 41 possible tessellations of a single macroblock. Further, the motion vector for each block is unique and can point to different reference frames. The job of the encoder is to find the optimal way of breaking down a 16×16 macroblock into smaller blocks (along with the corresponding motion vectors) in order to maximize compression efficiency. This breaking down of the macroblock into a specific pattern is commonly referred to as “mode selection” or “mode decision.”
However, current mode selection and mode decision processes demand a significant amount of resources from an encoder, thereby hindering performance and processing times. This results in an overwhelming increase in complexity, rendering the encoder practically non-realizable in some applications, such as real-time applications. Accordingly, there exists a need in the art for a real-time encoder capable of generating video streams in a more efficient manner.
In one embodiment, the present invention discloses a real-time encoder, e.g., a real-time H.264 compliant encoder or a real-time AVC compliant encoder. For example, the encoder comprises a first digital signal processor (DSP) for processing a first panel of an input image and a second digital signal processor (DSP) for processing a second panel of the input image. Finally, the encoder comprises a field programmable gate array (FPGA) for supporting both the first DSP and the second DSP.
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
Method and apparatus for implementing a video encoder is described. More specifically, the present invention discloses an implementation of a real-time H.264 encoder. As discussed above, as encoding methods incorporate ever more complex algorithms, there is a need to provide a hardware implementation where the complex encoding algorithms can be implemented in real time applications.
Before describing the present hardware architecture, a brief description of the various encoding functions performed by an H.264 encoder or an H.264-like encoder are first described. One or more of these encoding functions are then described in the context of the present hardware architecture, thereby illustrating the real-time processing capability of the present hardware architecture.
Embodiments of the invention use the following definitions:
The DCT module 104 transforms the difference signal from the pixel domain to the frequency domain using a DCT algorithm to produce a set of coefficients. The quantizer 106 quantizes the DCT coefficients. The entropy coder 108 codes the quantized DCT coefficients to produce a coded frame. [0021]The inverse quantizer 110 performs the inverse operation of the quantizer 106 to recover the DCT coefficients. The inverse DCT module 112 performs the inverse operation of the DCT module 104 to produce an estimated difference signal. The estimated difference signal is added to the predicted frame by the summer 114 to produce an estimated frame, which is coupled to the deblocking filter 116. The deblocking filter deblocks the estimated frame and stores the estimated frame or reference frame in the frame memory 118. The motion compensated predictor 120 and the motion estimator 124 are coupled to the frame memory 118 and are configured to obtain one or more previously estimated frames (previously coded frames).
The motion estimator 124 also receives the source frame. The motion estimator 124 performs a motion estimation algorithm using the source frame and a previous estimated frame (i.e., reference frame) to produce motion estimation data. For example, the motion estimation data includes motion vectors and minimum SADs for the macroblocks of the source frame. The motion estimation data is provided to the entropy coder 108 and the motion compensated predictor 120. The entropy coder 108 codes the motion estimation data to produce coded motion data. The motion compensated predictor 120 performs a motion compensation algorithm using a previous estimated frame and the motion estimation data to produce the predicted frame, which is coupled to the intra/inter switch 122. Motion estimation and motion compensation algorithms are well known in the art.
To illustrate, the motion estimator 124 may include mode decision logic 126. The mode decision logic 126 can be configured to select a mode for each macroblock in a predictive (INTER) frame. The “mode” of a macroblock is the partitioning scheme. That is, the mode decision logic 126 selects MODE for each macroblock in a predictive frame, which is defined by values for MB_TYPE and SUB_MB_TYPE. For example, an R-D optimization method may attempt to minimize the Lagrangian cost function. In one embodiment, the mode decision logic 126 may optimize an estimated cost function, defined as:
Ĵ={circumflex over (D)}+λ{circumflex over (R)} Eq. 2
The estimate {circumflex over (D)} is mainly a function of QP and represents distortion. QP is a fixed value for a macroblock. In one embodiment, {circumflex over (D)}(QP,SAD) is assumed to be constant for a given QP. Assuming that {circumflex over (D)} is constant for a macroblock, RD optimization reduces to a minimization of {circumflex over (R)}, which is the estimate of the number of bits needed to encode a macroblock. {circumflex over (R)} can be broken down into components of {circumflex over (R)}DCT, {circumflex over (R)}MV, {circumflex over (R)}MODE,and {circumflex over (R)}MISC.
The quantity {circumflex over (R)}MISC is the component that is independent of the mode decision (e.g., bits for the quantization parameter, QP). The quantity {circumflex over (R)}MV is the bit cost for transmitting the motion vector. The value can be computed exactly for a given motion vector without actually encoding. The quantity {circumflex over (R)}MODE is the bit cost associated with encoding the mode. This can be determined exactly without encoding the data. The quantity {circumflex over (R)}DCT is all the bits associated with encoding the residual block data. This includes bits to encode “coded_block_pattern,” the zeros, and run and levels for DCT coefficient. It is not feasible to compute {circumflex over (R)}DCT exactly without actually going through the encoding process. Hence, in one embodiment, the function {circumflex over (R)}DCT (QP) is calculated statistically through simulations.
The above description only provides a brief view of the various complex algorithms that must be executed to provide the encoded bitstreams generated by an H.264 encoder. The increase in complexity is often a result of a desire to provide better encoding characteristics, e.g., less distortion in the encoded images while using less number of bits to transmit the encoded images. In order to achieve these improved encoding characteristics, it is often necessary to increase the overall computational overhead of an encoder. Unfortunately, the increase in computational overhead also increases the difficulty in implementing a real-time H.264 encoder.
It should be noted that
One novel aspect of the present invention is the unique interactions of the FPGA 206 and the two DSPs 203-204 in each PPE pair unit 201. More specifically, one unique aspect is the ability of each PPE pair unit 201 to perform load balancing between the two DSPs 203-204 and the FPGA 206. For example, in one embodiment, the FPGA is performing quarter-pel motion estimation (among other functions) in support of both DSPs. For example, when the FPGA is finished with performing the quarter-pel computation for one DSP, it will then perform the quarter-pel computation for the other DSP, and then back to the first DSP and so on. This ability to distribute complex encoding algorithms to be performed among the two DSPs and the FPGA allows the present real-time H.264 to be realized. Furthermore, the use of a plurality of PPE pair unit 201s further increases the capability of the present hardware architecture where it can easily be scaled to handle images of different image resolutions.
In one embodiment of the present invention, each PPE pair unit 201 is tasked with processing two successive panels of an input image. A panel is broadly defined as comprising “x” number of rows of macroblocks of the input image, where x is an even number. Thus, an input image can be divided, at minimum, into two panels, or it can be divided, at maximum, into “y” number of panels, where y represents the number of rows of macroblocks of an input image divided by two. As such, in one embodiment, if there are only two panels for each input image, then a single PPE pair unit 201 can be used to process the input image. However, if there are 4 panels, then two PPE pair unit 201s are used to process the input image and so on.
Thus, the FPGA 206 may be connected to other FPGAs 206 that exist in the overall encoding system via a plurality of connections, such as a neighborhood and deblock interface (NDI) ring or chain 209, a RECON ring or chain 210, a full pel motion vector (FPMV) ring or chain 211, a luma ring or chain 212, a chroma ring or chain 213, and the like. Each of the ring or chain is providing a separate type of information between the various FPGAs.
In operation, the SDRAM S1 units contain luma and chroma pixels from the current panel macroblocks (MBs), Adaptive Quantization Level (AQL) information, collocated luma motion vectors and Refids (Reference indices) for all partitions, and reconstructed chroma reference pixels for their respective DSP. The DDR2 S2 unit 205, which is attached to the FPGA 206, contains reconstructed luma reference pixels that correspond to the DSP pair 203-204.
In one embodiment, the PPE pair unit 201 obtains various forms of original data from the plurality of rings or chains. Specifically, the DSP may receive original input luma pixel data, original input chroma pixel data, neighborhood and deblock data, and full motion vector data from the luma chain 212, the chroma chain 213, NDI ring 209, and the FPMV chain 211, respectively. The use of the ring communication channel allows the present hardware architecture to provide the real-time processing capability of the present real-time H.264 encoder. Namely, various encoding processes are distributed within the encoding system. For example, full pel motion estimation is performed by a separate motion estimation module (not shown) that is coupled to the ring communication channel. More specifically, the full pel motion vectors are received on the FPMV chain 211.
This distributed processing approach is also implemented within each of the PPE pair unit 201s. For example, spatial and temporal encoding often require information from one or more neighboring macroblocks or one or more neighboring frames. As such, it is often necessary for a processing unit to obtain information from one or more neighboring macroblocks (or previous macroblocks in terms of time) or one or more neighboring frames in order to process a current macroblock. Proper management of how a DSP and an FPGA are used in processing previous macroblocks and a current macroblock will greatly enhance the real-time processing capability of an encoding system.
To illustrate, in general, the PPE DSP pair 203-204 processes the received original data by using the generated quarter pixel motion estimation information that is provided by the FPGA. More specifically, while a DSP is in the process of receiving data for a current macroblock, the FPGA is generating quarter pel motion estimation data for a previous macroblock which is then provided to the DSP. In turn, the DSP will use the quarter pel motion estimation data to perform a mode decision operation for the previous macroblock. Furthermore, the DSP then builds neighborhood information and generates motion compensation data for the current macroblock and forwards both data to the FPGA for processing. The FPGA will use the received data to perform quarter pel processing on the current macroblock.
Having provided the necessary information to the FPGA to work on the current macroblock, the DSP will then turn its attention back to the previous macroblock to complete the processing of the previous macroblock. Namely, the DSP will perform chroma processing, deblocking, and reconstruction on the previous macroblock. The DSP will also then encode the previous macroblock, e.g., using a Context Adaptive Binary Arithmetic Coding (CABAC) video encoding algorithm. The resultant processed data is then sent out as a CABAC stream to the central DSP 202, which is the main processing unit that controls the encoding system or encoder 200, via a PCI connection 218.
In one embodiment, the present invention is configured to process macroblocks (MBs) in various ways. For example, the order in which MBs are processed may depend on the frame resolution be used to display the image (e.g., whether the image utilizes interlaced or progressive frames and how many panels or lines to be utilized).
Panel 304 (e.g., 2 rows of macroblocks) also demonstrate this processing aspect. Specifically, panel 304 illustrates a panel of only two rows of macroblocks for an input image having a resolution of 1920×1080 in an interlace format.
Panel 404 (e.g., 2 rows of macroblocks) also demonstrate this progressive processing aspect. Specifically, panel 404 illustrates a panel of only two rows of macroblocks for an input image having a resolution of 1920×1080 in a progressive format.
In one embodiment, the neighborhood 500 comprises a plurality of 8×8 block data structures (e.g., 16 subblocks) that is used to store and compress data in a more efficient manner. Since internal memory is valuable, a macroblock adaptive frame field (MBAFF) neighborhood enables the encoder 100 to store relevant neighboring macroblock data, such as motion vectors (MVs) and Refids, in less space as shown in
More specifically,
The present invention is designed to encode a plurality of macroblocks. Although the MBs are initially received and ultimately encoded in a sequential order (e.g., MB(0), MB(1), MB(2), etc.), the MBs are processed in a unique, non-sequential manner by the present invention. For example, suppose the encoder has previously received one or more prior MBs (e.g., MB(0) and MB(1), which will be explained below) and the encoder initiates the collection of data from a new macroblock (e.g., MB(2)). In one embodiment, the collected data may comprise luma data, chroma data, and co-location data of the new MB. For example, this data can be collected by DSP1203. After this data is collected, the DSP1203 begins performing two parallel functions, e.g., processing a current macroblock and processing a previous macroblock. First, the DSP1203 performs a mode decision operation on a previously processed macroblock (e.g., MB(1)). In one embodiment, the DSP1203 may ascertain the best three modes from a plurality of different configurations. For example, the encoding system 200 may consider various INTRA modes (e.g., 16×16, 8×8, and 4×4), a plurality of predicted modes (e.g., 16×16, 16×8, and 8×8), a direct mode, and a skipped mode.
Once the mode decision processing on a previous macroblock is completed, the second operation is performed, i.e., building a neighborhood data structure (e.g., as shown in
The method 600 begins at step 602 and proceeds to step 604 where the luma data, chroma data, and MB co-location data for a current MB (i) are collected. In one embodiment, this data is typically provided over the NDI Ring 209. In the event that the current macroblock is not in the first panel to be processed, the DSP may also obtain neighborhood data over the NDI Ring 209. It should be noted that while the DSP is collecting the data in step 603,
At step 605, data is received from the FPGA for a previous MB (i-1). For example, quarter pel results may be received from the FPGA.
At step 608, a mode decision operation is performed. More specifically, data processed by the FPGA 206 for a previous MB is utilized in this step. In one embodiment, the DSP performs a mode decision operation on a previous macroblock MB(i-1). The mode decision operation may entail the determination of what motion vectors are associated with the macroblock as well as the partition type of the macroblock (e.g., 16×16, 8×4, 4×4, etc.). In one embodiment, this step is initially skipped if there is not a “previous” MB.
The method 600 continues to step 610 where a neighborhood data structure 500 is built for the current macroblock MB(i). In one embodiment, the neighborhood data structure is a 4×4 MBAFF neighborhood structure as shown in
At step 612, data is generated, e.g., motion compensation data for a current block MB(i). In one embodiment, the DSP utilizes the collected chroma and luma data to generate motion compensation data that is usable by FPGA 206.
At step 614, the generated data is sent to the FPGA 206.
At step 618, chroma processing, deblocking and reconstructing processes are performed. It should be noted that these processes are preformed on a previous MB (i-1).
The method 600 continues to step 620 where the previous MB(i-1) is encoded and method 600 then returns to step 604 to repeat the process with a new macroblock.
Again, it should be noted that while the DSP is performing the chroma processing, deblocking and reconstructing processes in step 618 and the encoding process in step 620,
In one embodiment, the memory 703 stores processor-executable instructions and/or data that may be executed by and/or used by the processor 701 as described further below. These processor-executable instructions may comprise hardware, firmware, software, and the like, or some combination thereof. Modules having processor-executable instructions that are stored in the memory 703 may include encoding module 712. The encoding module 712 is configured to perform the method 600 of
An aspect of the invention is implemented as a program product for execution by a processor. Program(s) of the program product defines functions of embodiments and can be contained on a variety of signal-bearing media (computer readable media), which include, but are not limited to: (i) information permanently stored on non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM or DVD-ROM disks readable by a CD-ROM drive or a DVD drive); (ii) alterable information stored on writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or read/writable CD or read/writable DVD); or (iii) information conveyed to a computer by a communications medium, such as through a computer or telephone network, including wireless communications. The latter embodiment specifically includes information downloaded from the Internet and other networks. Such signal-bearing media, when carrying computer-readable instructions that direct functions of the invention, represent embodiments of the invention.
While the foregoing is directed to illustrative embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.