This disclosure relates to digital video processing and, more particularly, encoding of video sequences.
Digital video capabilities can be incorporated into a wide range of devices, including digital televisions, digital direct broadcast systems, wireless communication devices, personal digital assistants (PDAs), laptop computers, desktop computers, digital cameras, digital recording devices, cellular or satellite radio telephones, and the like. Digital video devices can provide significant improvements over conventional analog video systems in creating, modifying, transmitting, storing, recording and playing full motion video sequences.
A number of different video encoding standards have been established for encoding digital video sequences. The Moving Picture Experts Group (MPEG), for example, has developed a number of standards including MPEG-1, MPEG-2 and MPEG-4. Other standards include the International Telecommunication Union (ITU) H.263 standard, QuickTime™ technology developed by Apple Computer of Cupertino Calif., Video for Windows™ developed by Microsoft Corporation of Redmond, Wash., Indeo™ developed by Intel Corporation, RealVideo™ from RealNetworks, Inc. of Seattle, Wash., and Cinepak™ developed by SuperMac, Inc. New standards continue to emerge and evolve, including the ITU H.264 standard and a number of proprietary standards.
Many video encoding standards allow for improved transmission rates of video sequences by encoding data in a compressed fashion. Compression can reduce the overall amount of data that needs to be transmitted for effective transmission of video frames. Most video encoding standards, for example, utilize graphics and video compression techniques designed to facilitate video and image transmission over a narrower bandwidth than can be achieved without the compression.
The MPEG standards and the ITU H.263 and ITU H.264 standards, for example, support video encoding techniques that utilize similarities between successive video frames, referred to as temporal or inter-frame correlation, to provide inter-frame compression. The inter-frame compression techniques exploit data redundancy across frames by converting pixel-based representations of video frames to motion representations. In addition, some video encoding techniques may utilize similarities within frames, referred to as spatial or intra-frame correlation, to further compress the video frames.
In order to support compression, a digital video device typically includes an encoder for compressing digital video sequences, and a decoder for decompressing the digital video sequences. In many cases, the encoder and decoder form an integrated encoder/decoder (CODEC) that operates on blocks of pixels within frames that define the sequence of video images. In the MPEG-4 standard, for example, the encoder typically divides a video frame to be transmitted into “macroblocks,” which comprise 16 by 16 pixel arrays. The ITU H.264 standard supports 16 by 16 video blocks, 16 by 8 video blocks, 8 by 16 video blocks, 8 by 8 video blocks, 8 by 4 video blocks, 4 by 8 video blocks and 4 by 4 video blocks.
For each video block in the video frame, an encoder searches similarly sized video blocks of one or more immediately preceding video frames (or subsequent frames) to identify the most similar video block, referred to as the “best prediction.” The process of comparing a current video block to video blocks of other frames is generally referred to as motion estimation. Once a “best prediction” is identified for a video block, the encoder can encode the differences between the current video block and the best prediction. This process of encoding the differences between the current video block and the best prediction includes a process referred to as motion compensation. Motion compensation comprises a process of creating a difference block, indicative of the differences between the current video block to be encoded and the best prediction. Motion compensation usually refers to the act of fetching the best prediction block using a motion vector, and then subtracting the best prediction from an input block to generate a difference block.
After motion compensation has created the difference block, a series of additional encoding steps are typically performed to encode the difference block. These additional encoding steps may depend on the encoding standard being used. In MPEG4 compliant encoders, for example, the additional encoding steps may include an 8×8 discrete cosine transform, followed by scalar quantization, followed by a raster-to-zigzag reordering, followed by run-length encoding, followed by Huffman encoding.
An encoded difference block can be transmitted along with a motion vector that indicates which video block from the previous frame was used for the encoding. A decoder receives the motion vector and the encoded difference block, and decodes the received information to reconstruct the video sequences.
In many standards, half-pixel values are also generated during the motion estimation and motion compensation. In MPEG 4, for example, half-pixel values are generated as the average pixel values between two adjacent pixels. The half-pixels are used in candidate video blocks, and may form part of the best prediction identified during motion estimation. Relatively simple two-tap filters can be used to generate the half-pixel values, as they are needed in the motion estimation and motion compensation processes. The generation of non-integer pixel values can improve the resolution of inter-frame correlation, but generally complicates the encoding and decoding processes.
This disclosure describes video encoding techniques and video encoding devices that implement such techniques. The described video encoding techniques may be useful for a wide variety of encoding standards that allow for non-integer pixel values in motion estimation and motion compensation. In particular, video encoding standards such as the ITU H.264 standard, which uses half-pixel and quarter-pixel values in motion estimation and motion compensation may specifically benefit from the techniques described herein. More generally, any standard that specifies a three-tap filter or greater in the generation of non-integer pixel values in a given dimension, e.g., vertical or horizontal, may benefit from the techniques described herein. The techniques are particularly useful for portable devices, where processing overhead can significantly influence device size and battery consumption.
In one embodiment, this disclosure describes a video encoding device comprising a motion estimator that generates non-integer pixel values for motion estimation, the motion estimator including a filter that receives at least three inputs of integer pixel values. The device also includes a memory that stores the non-integer pixel values generated by the motion estimator, and a motion compensator that uses the stored non-integer pixel values for motion compensation. For compliance with the ITU H.264 standard, for example, the motion estimator may generate half-pixel values using a six-tap filter, and store the half-pixel values for use in both motion estimation and motion compensation. The motion estimator may also generate quarter-pixel values using a two-tap filter, and use the quarter-pixel value in the motion estimation, without storing the quarter-pixel values for motion compensation. In that case, the motion compensator uses the stored half-pixel values that were generated by the motion estimator, but re-generates the quarter-pixel values using another two-tap filter. In some cases, separate filters are implemented for both horizontal and vertical interpolation, but the output of any large filters (of three taps or greater) is reused for motion estimation and motion compensation. In other cases, the same large filter may be used for both horizontal and vertical interpolation. In those cases, however, the clock speed of the encoding device may need to be increased.
These and other techniques described herein may be implemented in a digital video device in hardware, software, firmware, or any combination thereof. If implemented in software, the techniques may be directed to a computer readable medium comprising program code, that when executed, performs one or more of the encoding techniques described herein. Additional details of various embodiments are set forth in the accompanying drawings and the description below. Other features, objects and advantages will become apparent from the description and drawings, and from the claims.
Communication link 15 may comprise a wireless link, a physical transmission line, fiber optics, a packet based network such as a local area network, wide-area network, or global network such as the Internet, a public switched telephone network (PSTN), or any other communication link capable of transferring data. Thus, communication link 15 represents any suitable communication medium, or possibly a collection of different networks and links, for transmitting video data from source device 12 to receive device 14.
Source device 12 may be any digital video device capable of encoding and transmitting video data. Source device 12 may include a video memory 16 to store digital video sequences, a video encoder 18 to encode the sequences, and a transmitter 20 to transmit the encoded sequences over communication link 15 to source device 14. Video encoder 18 may include, for example, various hardware, software or firmware, or one or more digital signal processors (DSP) that execute programmable software modules to control the video encoding techniques, as described herein. Associated memory and logic circuitry may be provided to support the DSP in controlling the video encoding techniques. As will be described, video encoder 18 may be configured to generate non-integer pixel values, and may use the generated non-integer pixel values for both motion estimation and motion compensation.
Source device 12 may also include a video capture device 23, such as a video camera, to capture video sequences and store the captured sequences in memory 16. In particular, video capture device 23 may include a charge coupled device (CCD), a charge injection device, an array of photodiodes, a complementary metal oxide semiconductor (CMOS) device, or any other photosensitive device capable of capturing video images or digital video sequences.
As further examples, video capture device 23 may be a video converter that converts analog video data to digital video data, e.g., from a television, video cassette recorder, camcorder, or another video device. In some embodiments, source device 12 may be configured to transmit real-time video sequences over communication link 15. In that case, receive device 14 may receive the real-time video sequences and display the video sequences to a user. Alternatively, source device 12 may capture and encode video sequences that are sent to receive device 14 as video data files, i.e., not in real-time. Thus, source device 12 and receive device 14 may support applications such as video clip playback, video mail, or video conferencing, e.g., in a mobile wireless network. Devices 12 and 14 may include various other elements that are not specifically illustrated in
Receive device 14 may take the form of any digital video device capable of receiving and decoding video data. For example, receive device 14 may include a receiver 22 to receive encoded digital video sequences from transmitter 20, e.g., via intermediate links, routers, other network equipment, and like. Receive device 14 also may include a video decoder 24 for decoding the sequences, and a display device 26 to display the sequences to a user. In some embodiments, however, receive device 14 may not include an integrated display device 14. In such cases, receive device 14 may serve as a receiver that decodes the received video data to drive a discrete display device, e.g., a television or monitor.
Example devices for source device 12 and receive device 14 include servers located on a computer network, workstations or other desktop computing devices, and mobile computing devices such as laptop computers or personal digital assistants (PDAs). Other examples include digital television broadcasting satellites and receiving devices such as digital televisions, digital cameras, digital video cameras or other digital recording devices, digital video telephones such as mobile telephones having video capabilities, direct two-way communication devices with video capabilities other wireless video devices, and the like.
In some cases, source device 12 and receive device 14 each include an encoder/decoder (CODEC) (not shown) for encoding and decoding digital video data. In particular, both source device 12 and receive device 14 may include transmitters and receivers as well as memory and displays. Many of the encoding techniques outlined below are described in the context of a digital video device that includes an encoder. It is understood, however, that the encoder may form part of a CODEC. In that case, the CODEC may be implemented within hardware, software, firmware, a DSP, a microprocessor, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), discrete hardware components, or various combinations thereof. Moreover, the encoding techniques described herein may allow for various digital filters or hardware components to be used for both encoding and decoding applications.
Video encoder 18 within source device 12 operates on blocks of pixels within a sequence of video frames in order to encode the video data. For example, video encoder 18 may execute motion estimation and motion compensation techniques in which a video frame to be transmitted is divided into blocks of pixels (referred to as video blocks). The video blocks, for purposes of illustration, may comprise any size of blocks, and may vary within a given video sequence. As an example, the ITU H.264 standard supports 16 by 16 video blocks, 16 by 8 video blocks, 8 by 16 video blocks, 8 by 8 video blocks, 8 by 4 video blocks, 4 by 8 video blocks and 4 by 4 video blocks. Smaller video blocks can provide better resolution in the encoding, and may be specifically used for locations of video frame that include higher levels of detail. Moreover, as described below, video encoder 18 may be designed to operate on 4 by 4 video blocks in a pipelined manner, and reconstruct larger video blocks from the 4 by 4 video blocks, as needed.
Each pixel in a video block may be represented by an n-bit value, e.g., 8 bits, that defines visual characteristics of the pixel such as the color and intensity in values of chrominance and luminance. However, motion estimation is often performed only on the luminance component because human vision is more sensitive to changes in luminance than chromaticity. Accordingly, for purposes of motion estimation, the entire n-bit value may quantify luminance for a given pixel. The principles of this disclosure, however, are not limited to the format of the pixels, and may be extended for use with simpler fewer-bit pixel formats or more complex larger-bit pixel formats.
For each video block in the video frame, video encoder 18 of source device 12 performs motion estimation by searching video blocks stored in memory 16 for one or more preceding video frames already transmitted (or a subsequent video frames) to identify a similar video block. Upon determining a “best prediction” from the preceding or subsequent video frame, video encoder 18 performs motion compensation to create a difference block indicative of the differences between the current video block to be encoded and the best prediction. Motion compensation usually refers to the act of fetching the best prediction block using a motion vector, and then subtracting the best prediction from an input block to generate a difference block.
After the motion compensation process has created the difference block, a series of additional encoding steps are typically performed to encode the difference block. These additional encoding steps may depend on the encoding standard being used. In MPEG4 compliant encoders, for example, the additional encoding steps may include an 8×8 discrete cosine transform, followed by scalar quantization, followed by a raster-to-zigzag reordering, followed by run-length encoding, followed by Huffman encoding.
Once encoded, the encoded difference block can be transmitted along with a motion vector that identifies the video block from the previous frame (or subsequent frame) that was used for encoding. In this manner, instead of encoding each frame as an independent picture, video encoder 18 encodes the difference between adjacent frames. Such techniques can significantly reduce the amount of data that needed to accurately represent each frame of a video sequence.
The motion vector may define a pixel location relative to the upper-left-hand corner of the video block being encoded, although other formats for motion vectors could be used. In any case, by encoding video blocks using motion vectors, the required bandwidth for transmission of streams of video data can be significantly reduced.
In some cases, video encoder 18 can support intra frame encoding, in addition to intra-frame encoding. Intra-frame encoding utilizes similarities within frames, referred to as spatial or intra-frame correlation, to further compress the video frames. Intra-frame compression is typically based upon texture encoding for compressing still images, such as discrete cosine transform (DCT) encoding. Intra-frame compression is often used in conjunction with inter-frame compression, but may also be used as an alterative in some implementations.
Receiver 22 of receive device 14 may receive the encoded video data in the form of motion vectors and encoded difference blocks indicative of encoded differences between the video block being encoded and the best prediction used in motion estimation. Decoder 24 performs video decoding in order to generate video sequences for display to a user via display device 26. The decoder 24 of receive device 14 may also be implemented as an encoder/decoder (CODEC). In that case, both source device 12 and receive device 14 may be capable of encoding, transmitting, receiving and decoding digital video sequences.
In accordance with this disclosure, non-integer pixel values that are generated from three or more input pixel values during video encoding in a given dimension (horizontal or vertical) can be stored in a local memory of video encoder 18 and then used for both motion estimation and motion compensation. The stored non-integer pixel values may be separately buffered, or allocated to any specific memory locations, as long as the non-integer pixel values can be located and identified, when needed. In contrast, non-integer pixel values that are generated from two input pixel values in a given dimension need not be stored for any significant amount of time, but can be generally calculated as needed for the motion estimation or the motion compensation.
Video memory 16A typically comprises a relatively large memory space. Video memory 16A, for example, may comprise dynamic random access memory (DRAM), or FLASH memory. In other examples, video memory 16A may comprise a non-volatile memory or any other data storage device.
Video encoder 18A includes a local memory 25A, which may comprise a smaller and faster memory space relative to video memory 16A. By way of example, local memory 25A may comprise synchronous random access memory (SRAM). Local memory 25A may also comprise “on-chip” memory integrated with the other components of video encoder 18A to provide for very fast access to data during the processor-intensive encoding process. During the encoding of a given video frame, the current video block to be encoded may be loaded from video memory 16A to local memory 25A. A search space used in locating the best prediction may also be loaded from video memory 16A to local memory 25A. The search space may comprise a subset of pixels of one or more of the preceding video frames (or subsequent frames). The chosen subset may be pre-identified as a likely location for identification of a best prediction that closely matches the current video block to be encoded.
In many video standards, fractional pixels or non-integer pixels are also considered during the encoding process. For example, in MPEG-4, half-pixel values are calculated as the average between two adjacent pixels. In MPEG-4 compliant encoders, the average between two adjacent pixels can be easily generated for a given dimension, as needed, using a relatively simple digital filter that has two inputs and one output, commonly referred to as a two-tap digital filter.
By way of example, in the simple MPEG2 or MPEG4 case, if interpolation is performed both horizontally and vertically, then two-tap digital filters can be used for each dimension. Alternatively, interpolation in two-dimensions could be done as a single 4-tap averaging filter. When filters specify more than two inputs in a given dimension, or more than five inputs for two-dimensional interpolation, the techniques described herein become very useful.
The tap weights of the digital filter are specified by the encoding standard. In order to support MPEG-4, motion estimator 26A and motion compensator 28A may include similar two-tap digital filters, which allow the half-pixel values to be generated for horizontal and vertical dimensions at any time using the integer pixel values of the search space loaded in local memory 25A.
For some newer standards, however, the generation of non-integer pixels is more complex. For example, many newer standards specify the generation of half-pixel values in a given dimension based on the weighted sum of more that two pixels. As one specific example, the ITU H.264 standard specifies calculation of half-pixel values in both the horizontal and vertical dimensions as the weighted average between six pixels. For fractional horizontal pixels, the three pixels on the left of the half-pixel value are weighted similarly to the three pixels on the right of the half-pixel value. For fractional vertical pixels, the three pixels on the top of the half-pixel value are weighted similarly to the three pixels on the bottom of the half-pixel value. In both cases, a filter having six inputs and one output (a six-tap digital filter) is generally needed to generate the half-pixel values.
Moreover, the ITU H.264 standard also specifies the generation of quarter-pixel values, which are calculated as the average between an integer pixel and an adjacent half-pixel. Thus, generation of quarter-pixel values typically involves the use of a six-tap filter to generate a half-pixel value, followed by the use of a two-tap filter to generate the quarter-pixel value. Many proprietary standards also use other weighted averaging rules for non-integer pixel generation, which can add significant complexity to the generation of non-integer pixel values.
In accordance with this disclosure, non-integer pixel values that are generated from three or more input pixel values in a given dimension can be stored in local memory 25A as part of the search space. The stored non-integer pixel values may be separately buffered, or allocated to any specific memory locations, as long as the non-integer pixel values can be located and identified, when needed. In contrast, non-integer pixel values that are generated from two input pixel values need not be stored for any significant amount of time, but can be generally calculated as needed.
This disclosure recognizes the trade-off between the need for additional memory space in local memory 25A to store any non-integer pixel values for significant amounts of time, and the hardware or processing power needed to filter inputs and generate the non-integer pixel values. Two-tap filters are very simple to implement in one dimension, and therefore, two-tap filters can be used in many locations of video encoder to generate non-integer pixel values from two inputs, when needed. However, filters having greater than three inputs for one dimension, and specifically six-tap filters used for compliance with the ITU H.264 standard are more complex. When these larger filters are needed, it is more advantageous to implement a single filter that receives three or more inputs, and then store or buffer the output of the large filter in local memory 25A for reuse in the encoding process, when needed.
For example, video encoder 18A includes a motion estimator 26A and a motion compensator 28A, which respectively perform motion estimation and motion compensation in the video encoding process. As shown in
In some cases, separate filters are implemented for both horizontal and vertical interpolation, but the output of any large filters (of three taps or greater) can be reused for motion estimation and motion compensation. In other cases, the same large filter may be used for both horizontal and vertical interpolation and the output of the large filter can be stored for use in both motion estimation and motion compensation. In those cases, however, the clock speed may need to be increased since a single filter is used for both horizontal and vertical interpolation, which may increase power consumption.
Local memory 25A is loaded with a current video block to be encoded and a search space, which comprises some or all of one or more different video frames used in inter-frame encoding. Motion estimator 26A compares the current video block to various video blocks in the search space in order to identify a best prediction. In some cases, however, an adequate match for the encoding may be identified more quickly, without specifically checking every possible candidate, and in that case, the adequate match may not actually be the “best” prediction, albeit adequate for effective video encoding.
Motion estimator 26A supports encoding schemes that use non-integer pixel values. In particular, non-integer pixel computation unit 32A may generate non-integer pixel values that expand the search space to fractional or non-integer pixel values. Both horizontal non-integer pixel values and vertical non-integer pixel values may be generated. Any non-integer pixel values generated from two inputs may be used and then discarded or overwritten in local memory 25A, as these non-integer pixel values generated from two inputs can be easily re-generated, as needed. However, any non-integer pixel values generated from three or more inputs may be used and maintained in local memory 25A for subsequent use in the encoding process, as these non-integer pixel values generated from three or more inputs are more complicated to generate and re-generate.
Video block matching unit 34A performs the comparisons between the current video block to be encoded and the candidate video blocks in the search space of memory 25A, including any candidate video blocks that include non-integer pixel values generated by non-integer pixel computation units(s) 32A. For example, video block matching unit 34A may comprise a difference processor, or a software routine that performs difference calculations in order to identify a best prediction (or simply an adequate prediction).
By way of example, video block matching unit 34A may perform SAD techniques (sum of absolute difference techniques), SSD techniques (sum of squared difference techniques), or other comparison techniques, if desired. The SAD techniques involve the tasks of performing absolute difference computations between pixel values of the current video block to be encoded, with pixel values of the candidate video block to which the current video block is being compared. The results of these absolute difference computations are summed, i.e., accumulated, in order to define a difference value indicative of the difference between the current video block and the candidate video block. For an 8 by 8 pixel image block, 64 differences may be computed and summed, and for a 16 by 16 pixel macroblock, 256 differences may be computed and summed. The overall summation of all of the computations can define the difference value for the candidate video block.
A lower difference value generally indicates that a candidate video block is a better match, and thus a better candidate for use in motion estimation encoding than other candidate video blocks yielding higher difference values, i.e. increased distortion. In some cases, computations may be terminated when an accumulated difference value exceeds a defined threshold, or when an adequate match is identified early, even if other candidate video blocks have not yet been considered.
The SSD techniques also involve the task of performing difference computations between pixel values of the current video block to be encoded with pixel values of the candidate video block. However, in the SSD techniques, the results of difference computations are squared, and then the squared values are summed, i.e., accumulated, in order to define a difference value indicative of the difference between the current video block and the candidate video block to which the current macro block is being compared. Alternatively, video block matching unit 34A may use other comparison techniques such as a Mean Square Error (MSE), a Normalized Cross Correlation Function (NCCF), or another suitable comparison algorithm.
Ultimately, video block matching unit 34A can identify a “best prediction,” which is the candidate video block that most closely matches the video block to be encoded. However, it is understood that, in many cases, an adequate match may be located before the best prediction, and in those cases, the adequate match may be used for the encoding. In the following description, reference is made to the “best prediction” identified by video block matching unit 34A, but it is understood that this disclosure is not limited in that respect, and any adequate match may be used and can possibly be identified more quickly than the best prediction.
In some embodiments, video block matching unit 34A may be implemented in a pipelined manner. For example, video block matching unit 34A may comprise a processing pipeline that simultaneously handles more than one video block. Moreover, in some cases, the processing pipeline may be designed to operate on 4-pixel by 4-pixel video blocks, even if the size of the video block to be encoded is larger than 4-pixel by 4-pixel video blocks. In that case, the difference computations for an adjacent set of 4-pixel by 4-pixel candidate video blocks may be summed to represent the difference computations for a larger video block, e.g., a 4-pixel by 8-pixel video block comprising two 4-pixel by 4-pixel candidates, an 8-pixel by 4-pixel video block comprising two 4-pixel by 4-pixel candidates, an 8-pixel by 8-pixel video block comprising four 4-pixel by 4-pixel candidates, an 8-pixel by 16-pixel video block comprising eight 4-pixel by 4-pixel candidates, a 16-pixel by 8-pixel video block comprising eight 4-pixel by 4-pixel candidates, a 16-pixel by 16-pixel video block comprising sixteen 4-pixel by 4-pixel candidates, and so fourth.
In any case, once a best prediction is identified by video matching unit 34A for a video block, motion compensator 28A creates a difference block indicative of the differences between the current video block and the best prediction. Video block encoder 39A may further encode the difference block to compress the difference block, and the encoded difference block can forwarded for transmission to another device, along a motion vector that indicates which candidate video block from the search space was used for the encoding. For simplicity, the additional components used to perform encoding after motion compensation are generalized as difference block encoder 39A, as the specific components would vary depending on the specific standard being supported. In other words, difference block encoder 39A may perform one or more conventional encoding techniques on the difference block, which is generated as described above.
Motion compensator 28A includes non-integer pixel computation unit(s) 36A for generating any non-integer pixels of the best prediction. As outlined above, however, non-integer pixel computation unit(s) 36A of motion compensator 28A only include two-tap digital filters for a given dimension, and generally do not include larger digital filters, because the output of any larger digital filters of non-integer pixel computation unit(s) 32A of motion estimator 26A are stored in local memory 25A for use in both motion estimation and motion compensation. Accordingly, the need to implement digital filters that require three or more inputs for a given dimension in motion compensator 28A can be avoided.
Difference block calculation unit 38A generates a difference block, which generally represents the differences between the current video block and the best prediction. The difference block may also be referred to as a “prediction matrix” or as a “residual.” The difference block is generally a matrix of values that represent the difference in pixel values of the best prediction and the current video block. In other words:
Difference Block=Pixel values of best prediction−pixel values of Current Video Block
Video block encoder 39 encodes the difference block to compress the difference block, and the encoded video blocks are then forwarded to transmitter 20A for transmission to another device. In some cases, encoded video blocks may be temporarily stored in video memory 16A, where the encoded video blocks are accumulated and then sent by transmitter 20A as a stream of video frames. In any case, the encoded video blocks may take the form of encoded difference blocks and motion vectors. The difference block represents the difference in pixel values of the best prediction and the current video block. The motion vector identifies the location of the best prediction, either within the frame or within fractional pixels generated from the frame. In different video standards there are different ways to identify which frame the motion vector applies to. For example, in H.264 utilizes a reference picture index, and in MPEG4 or MPEG2 this information is carried in the macroblock header information.
As shown in
Video memory 16B typically comprises a relatively large memory space. Video memory 16B, for example, may comprise DRAM, FLASH memory, possibly a non-volatile memory, or any other data storage device.
Video encoder 18B includes a local memory 25B, which may comprise a smaller and faster memory space relative to video memory 16B. By way of example, local memory 25B may comprise synchronous random access memory (SRAM). Local memory 25B may also comprise “on-chip” memory integrated with the other components of video encoder 18B to provide fast access during the processor-intensive encoding process. During the encoding of a given video frame, the current video block to be encoded may be loaded form video memory 16B to local memory 25B.
Motion estimator 26B compares the current video block to various video blocks in the search space in order to identify a best prediction. Motion estimator 26B supports the ITU H.264 encoding schemes that use half-pixel values and quarter-pixel values. In particular, non-integer pixel computation unit 32B may include a six-tap filter 31 for half-pixel interpolation, and a two-tap filter 33 for quarter-pixel interpolation. Both horizontal half- and quarter-pixel values and vertical half- and quarter-pixel values may be generated.
The half-pixel values are generated by six-tap filter 31 as the weighted average of six successive pixels, according to the ITU H.264 video encoding standard. The quarter-pixel values are generated by two-tap filter 33 as the average of an integer pixel value and an adjacent half-pixel value. In other words, the tap weights of the filters may be specified by the ITU H.264 video encoding standard, although this disclosure is not limited in that respect.
In some cases, separate six-tap filters are implemented in motion estimator 26B for both horizontal and vertical interpolation, and the output of both six-tap filter can be used in motion estimation and motion compensation. In other cases, the same six-tap filter may be used for both horizontal and vertical interpolation. In the later case, however, the clock speed may need to be increased, which will increase power consumption. Accordingly, it may be more desirable to implement two six-tap digital filters for separate horizontal and vertical interpolation in motion estimation, and then reuse the output of both six-tap digital filters for horizontal and vertical interpolation in motion compensation. Regardless of whether motion estimator 26B implements two six-tap filters horizontal and vertical half-pixel interpolation or a single six-tap filter is used for both the horizontal and vertical half-pixel interpolation, a single two-tap digital filter may be implemented in each of motion estimator 26B and motion compensator 28B for quarter-pixel interpolation. However, additional two-tap filters might also be included to increase processing speed.
In any case, in accordance with this disclosure, the half-pixel output of six-tap filter 31 is used for both motion estimation and motion compensation. In other words, the half-pixel output of six-tap filter 31 is used for motion estimation and then stored in memory 25B for subsequent use in motion compensation. In contrast, the quarter-pixel output of two-tap filter 33 is used only for motion estimation, and is then discarded or overwritten in memory 25B.
Video block matching unit 34B performs the comparisons between the current video block to be encoded and the candidate video blocks in the search space of memory 25B, including any candidate video blocks that include quarter- or half-pixel values generated by non-integer pixel computation units 32B. For example, video block matching unit 34B may comprise a difference processor, or a software routine that performs difference calculations in order to identify a best prediction (or simply an adequate prediction). By way of example, video block matching unit 34A may perform SAD techniques, SSD techniques, or other comparison techniques such as a Mean Square Error (MSE), a Normalized Cross Correlation Function (NCCF), or another suitable comparison algorithm.
Ultimately, video block matching unit 34B can identify a “best prediction,” which is the candidate video block that most closely matches the video block to be encoded. In some embodiments, video block matching unit 34B may be implemented in a pipelined manner. For example, video block matching unit 34B may comprise a processing pipeline that simultaneously handles more than one video block. Moreover, in some cases, the processing pipeline may be designed to operate on 4-pixel by 4-pixel video blocks, even if the size of the video block to be encoded is larger than 4-pixel by 4-pixel video blocks. In the pipelined embodiment, memory allocated to quarter-pixel storage may be overwritten once pixel are considered in the pipeline, which can reduce the amount of memory needed. Of course, half-pixel values are stored for subsequent use, as outlined herein.
Once a best prediction is identified by video matching unit 34B for a video block, motion compensator 28B can generate a difference block indicative of the differences between the current video block and the best prediction. Motion compensator 28B can then forward a difference block to difference block encoder 39B, which performs various additional encoding supported by the ITU H.264 encoding standard. Difference block encoder 39B forwards encoded difference blocks to transmitter 20B via bus 35B for transmission to another device, along a motion vector that indicates which candidate video block used for the encoding.
Motion compensator 28B includes non-integer pixel computation unit 36B for generating any non-integer pixels of the best prediction that are not already stored in local memory 25B. Non-integer pixel computation unit 36B of motion compensator 28B only includes a two-tap digital filter 37 for generating quarter-pixel values, and generally does not include a six-tap digital filter for generating half-pixel values, because the half-pixel output of six-tap digital filter 31 motion estimator 26B is stored in local memory 25B for use in both motion estimation and motion compensation. Accordingly, the need to implement a six-tap digital filter in motion compensator 28B can be avoided. Again, two-tap digital filters can be implemented very easily without requiring significant chip circuit area. Six-tap digital filters, in contrast, are much more complicated. Accordingly, the additional memory space required to buffer the half-pixel output of six-tap digital filter 31 for a significant amount of time during the encoding process of a given video block is worthwhile because it can eliminate the need for an additional six-tap digital filter.
Difference block calculation unit 38B generates a difference block, which generally represents the differences between the current video block and the best prediction. Again, the difference block is typically calculated as follows:
Difference Block=Pixel values of best prediction−pixel values of Current Video Block
Motion compensator 28B forwards difference blocks to difference block encoder 39B, which encodes and compresses the difference blocks and sends the encoded difference blocks to transmitter 20B for transmission to another device. The transmitted information may take the form of an encoded difference block and a motion vector. The difference block represents the difference in pixel values of the best prediction and the current video block. The motion vector identifies the location of the best prediction, either within the frame or within fractional pixels generated from the frame.
An additional buffer may be allocated for quarter-pixel values, but this buffer may be more limited in size. Quarter-pixel values can be stored in the quarter-pixel buffer, but then overwritten with other quarter-pixel values, after being considered. This disclosure recognizes that two-tap digital filters are less costly, from a chip-implementation standpoint, than the additional memory space that would otherwise be needed to store every generated quarter-pixel value for the entire encoding process of a given video block.
In addition, the same hardware may be used for both encoding and decoding. The decoding scheme is less intensive and generally requires the generation of pixel values, as needed. In accordance with this disclosure, the same digital filters used in the motion estimator and motion compensator may also be used in decoding to generate any non-integer pixel values.
Also, additional vertical half-pixel values may be generated from set of pixels 72 to define another set of pixels 74. For example, pixel C03 may comprise the weighted sum of pixels B03-B53. Any quarter-pixel values may be similarly generated, as needed, using a two-tap digital filter with inputs being an integer pixel value and the adjacent half-pixel value. For example, a quarter-pixel value between pixels A02 and A03 that is closer to pixel A02 would be the average of A02 and B00. Similarly, the quarter-pixel value between pixels A02 and A03 that is closer to pixel A03 would be the average of B00 and A03.
Importantly, the same hardware used for the encoding, i.e., the six-tap digital filter and various two-tap digital filters, can be used to generate any output needed for decoding, based on search space 70 as the input. Accordingly, the encoding techniques described herein are entirely consistent with a decoding scheme in which the same hardware can be used for both the encoding and the decoding.
At this point, video block matching unit 34B can perform motion estimation difference computation for half-integer video blocks, i.e., any video blocks that include half-integer pixel values (85). Two-tap digital filter 33 generates quarter-pixel values, e.g., as the average of an integer pixel value and an adjacent half-pixel value (86). The quarter-pixel values can be used for motion estimation, but need not be stored for any subsequent use. Video block matching unit 34B can perform motion estimation difference computation for quarter-integer video blocks, i.e., any video blocks that include quarter-integer pixel values (87).
Once every candidate video block, including half-pixel blocks and quarter-pixel blocks have been compared to the current video block to be encoded, motion estimator 26B identifies a best prediction (88). However, as mentioned above, this disclosure also contemplates the use of an adequate match, which is not necessarily the “best” match, albeit a suitable match for effective video encoding and compression. Motion compensation is then performed.
During motion compensation, motion compensator 28B uses the half-pixel values generated by six-tap filter 31 and stored in local memory 25B (89). However, two-tap filter 37 generates any quarter-pixel values needed for motion compensation (90). In that case, two-tap filter 37 may re-generate at least some of the quarter-pixel values that were previously generated by two-tap digital filter 33. Difference block calculation unit 38A generates a difference block, e.g., indicative of the difference between the current video block to be encoded and the best prediction video block (91). The difference block can then be encoded and transmitted with a motion vector identifying the location of the candidate video block used for the video encoding.
A number of different embodiments have been described. The techniques may be capable of improving video encoding by achieving an efficient balance between local memory space and hardware used to perform non-integer pixel computations. In these and possibly other ways, the techniques can improve video encoding according to standards such as ITU H.264 or any other video encoding standards that use non-integer pixel values, including any of a wide variety of proprietary standards. In particular, the techniques are particularly useful whenever a video encoding standard calls for use of a three-tap filter or larger, in the generation of non-integer pixel values in a specific dimension. In other words, the techniques are particularly useful whenever a video encoding standard calls for use of a three-tap filter or larger for one-dimensional interpolation. The techniques may also be useful if a standard can be implemented using a five-tap filter or larger, in two-dimensional interpolation. The given standard being supported may specify the tap weights of the various filters.
The techniques may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the techniques may be directed to a computer readable medium comprising program code, that when executed in a device that encodes video sequences, performs one or more of the methods mentioned above. In that case, the computer readable medium may comprise random access memory (RAM) such as synchronous dynamic random access memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, and the like.
The program code may be stored on memory in the form of computer readable instructions. In that case, a processor such as a DSP may execute instructions stored in memory in order to carry out one or more of the techniques described herein. In some cases, the techniques may be executed by a DSP that invokes various hardware components such as a motion estimator to accelerate the encoding process. In other cases, the video encoder may be implemented as a microprocessor, one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs), or some other hardware-software combination. These and other embodiments are within the scope of the following claims.