This disclosure relates to encoding and decoding video data using direct prediction.
Digital video streams typically represent video using a sequence of frames. Each frame can include a number of blocks, which in turn may contain information describing the value of color, brightness or other attributes for pixels. The amount of data in a typical video stream is large, and transmission and storage of video can use significant computing or communications resources. Various approaches have been proposed to reduce the amount of data in video streams, including compression and other encoding techniques. These approaches can involve predicting data of a frame using motion vectors and data from other frames.
Disclosed herein are implementations of systems, methods and apparatuses for encoding and decoding a video stream of frames having a plurality of macroblocks. One aspect of the disclosed implementations is a method for decoding a video stream of frames, each frame having a plurality of macroblocks. The method comprises receiving a frame of a video stream to be decoded. The method further comprises determining, based on a header of the frame, a variable length code defining sizes of one or more superblocks in the frame using a processor. Each superblock is formed of at least one macroblock encoded using direct prediction. The variable length code includes one bit to describe a most frequent size for the one or more superblocks in the frame and two bits to describe each of a second and a third most frequent size for the one or more superblocks in the frame. The method further comprises selecting a superblock of the frame. The method further comprises determining a size of the superblock based on a header of the superblock and the variable length code. The size of the superblock indicates a number of macroblocks belonging to the superblock. The method further comprises decoding those macroblocks belonging to the superblock using direct prediction. The method further comprises decoding any macroblocks of the frame that do not belong to the one or more superblocks.
Another aspect of the disclosed implementations is an apparatus for decoding a video stream of frames, each frame having a plurality of macroblocks. The apparatus comprises a memory and a processor. The processor is configured to execute instructions stored in the memory to receive a frame of a video stream to be decoded. The processor is further configured to execute instructions stored in the memory to determine, based on a header of the frame, a variable length code defining sizes of one or more superblocks in the frame using a processor. Each superblock is formed of at least one macroblock encoded using direct prediction. The variable length code includes one bit to describe a most frequent size for the one or more superblocks in the frame and two bits to describe each of a second and a third most frequent size for the one or more superblocks in the frame. The processor is further configured to execute instructions stored in the memory to select a superblock of the frame. The processor is further configured to execute instructions stored in the memory to determine a size of the superblock based on a header of the superblock and the variable length code. The size of the superblock indicating a number of macroblocks belonging to the superblock. The processor is further configured to execute instructions stored in the memory to decode those macroblocks belonging to the superblock using direct prediction. The processor is further configured to execute instructions stored in the memory to decode any macroblocks of the frame that do not belong to the one or more superblocks.
Another aspect of the disclosed implementations is a method for decoding a video signal using a computing device, the video signal including frames defining a video sequence, each frame having a plurality of macroblocks. The method comprises decoding a bitstream representative of the video signal. The bitstream includes a variable length code indicative of a size of one or more superblocks included in a frame of the video signal. The variable length code includes one bit to describe a most frequent size for the one or more superblocks and two bits to describe each of a second and third most frequent size for the one or more superblocks.
Variations in these and other implementations will be described in additional detail hereafter.
The description herein makes reference to the accompanying drawings wherein like reference numerals refer to like parts throughout the several views, and wherein:
Digital video is used for various purposes including, for example, remote business meetings via video conferencing, high definition video entertainment, video advertisements, and sharing of user-generated videos. As technology is evolving, users have higher expectations for video quality and expect high resolution video even when transmitted over communications channels having limited bandwidth.
One way to reduce the number of bits in an encoded video stream while maintaining acceptable video quality is to encode macroblocks using direct motion prediction, where intra-frame motion prediction is performed by re-using motion vectors from a previous frame. The bit savings from direct prediction are mitigated, however, by the need to add bits to each block (such as a 16×16 macroblock) to indicate which type of prediction was used to encode the block. Aspects of disclosed embodiments improve encoding and transmission or storage efficiency of video streams by constructing superblocks comprising a larger number of pixels. Superblocks can be formed by, for example, combining macroblocks depending upon calculated motion vectors. Then, the superblocks can be encoded direct prediction. Since motion vectors for previous frames are re-used, no new motion vectors need to be included in the video stream along with the superblock residual data.
Combining macroblocks in to superblocks with similar motion prediction attributes can achieve bit savings by reducing the number of bits used to indicate the motion prediction since fewer superblocks are included in the video stream. Additional bit savings can be realized by encoding the bits that indicate the size and shape of the superblock into a variable length field where the most commonly used superblocks are encoded using fewer bits. Details of implementations taught herein can first be obtained by reference to a system in which the teachings can be implemented.
A network 28 can connect transmitting station 12 and a receiving station 30 for encoding and decoding of the video stream. Specifically, the video stream can be encoded in transmitting station 12 and the encoded video stream can be decoded in receiving station 30. Network 28 can, for example, be the Internet. The network 28 can also be a local area network (LAN), wide area network (WAN), virtual private network (VPN), a cellular telephone network or any other means of transferring the video stream from transmitting station 12 to, in this example, receiving station 30.
Receiving station 30, in one example, can be a computing device having an internal configuration of hardware including a processor such as a CPU 32 and a memory 34. CPU 32 can be a controller for controlling the operations of receiving station 30. CPU 32 is connected to memory 34 by, for example, a memory bus. Memory 34 can be ROM, RAM or any other suitable memory device. Memory 34 can store data and program instructions that are used by CPU 32. Other suitable implementations of receiving station 30 are possible. For example, the processing of receiving station 30 can be distributed among multiple devices.
A display 36 configured to display a video stream can be connected to receiving station 30. Display 36 can be implemented in various ways, including by a liquid crystal display (LCD), a cathode-ray tube (CRT) or a light emitting diode (LED) display, such as an OLED display. Display 36 is coupled to CPU 32 and can be configured to display a rendering, or screen image, 138 of the video stream decoded by a decoder in receiving station 30.
In the implementations described, for example, an encoder is in transmitting station 12 and a decoder is in receiving station 30 as instructions in memory or a component separate from memory. However, an encoder or decoder can be connected to a respective station 12, 30 rather than in it. Further, one implementation can omit network 28 and/or display 36. In another implementation, a video stream can be encoded and then stored for transmission at a later time to receiving station 30 or any other device having memory. In one implementation, a video stream is received by receiving station 30 (e.g., via network 28, a computer bus and/or some communication pathway) and stored for later decoding. In another implementation, additional components can be added to encoder and decoder system 10. For example, a display or a video camera can be attached to transmitting station 12 to capture the video stream to be encoded. In an exemplary implementation, a real-time transport protocol (RTP) is used for transmission of the encoded video. In another implementation, a transport protocol other than RTP may be used, e.g., an HTTP-based video streaming protocol.
When video stream 50 is presented for encoding, each frame 56 within video stream 50 is processed in units of blocks. At intra/inter prediction stage 72, each block can be encoded using either intra-frame prediction (i.e., within a single frame) or inter-frame prediction (i.e. from frame to frame). In either case, a prediction block can be formed. In the case of intra-prediction, a prediction block can be formed from samples in the current frame that have been previously encoded and reconstructed. In the case of inter-prediction, a prediction block can be formed from samples in one or more previously constructed reference frames.
Next, still referring to
Quantization stage 76 converts the transform coefficients into discrete quantum values, which are referred to as quantized transform coefficients, using a quantizer or quantization level. The quantized transform coefficients are then entropy encoded by entropy encoding stage 78. The entropy-encoded coefficients, together with other information used to decode the block, which may include for example the type of prediction used, motion vectors, and quantizer, are then output to compressed bitstream 88. Compressed bitstream 88 can be formatted using various techniques, such as variable length encoding (VLE) and arithmetic coding. Compressed bitstream 88 can also be referred to as an encoded video stream and the terms will be used interchangeably herein.
The reconstruction path in
Other variations of encoder 70 can be used to encode compressed bitstream 88. For example, a non-transform based encoder 70 can quantize the residual signal directly without transform stage 74. In another embodiment, an encoder 70 can have quantization stage 76 and dequantization stage 80 combined into a single stage.
Decoder 100, similar to the reconstruction path of encoder 70 discussed above, includes in one example the following stages to perform various functions to produce an output video stream 116 from compressed bitstream 88: an entropy decoding stage 102, a dequantization stage 104, an inverse transform stage 106, an intra/inter prediction stage 108, a reconstruction stage 110, a loop filtering stage 112 and a deblocking filtering stage 114. Other structural variations of decoder 100 can be used to decode compressed bitstream 88.
When compressed bitstream 88 is presented for decoding, the data elements within compressed bitstream 88 can be decoded by entropy decoding stage 102 to produce a set of quantized transform coefficients. Dequantization stage 104 dequantizes the quantized transform coefficients, and inverse transform stage 106 inverse transforms the dequantized transform coefficients to produce a derivative residual that can be identical to that created by the inverse transform stage 82 in encoder 70. Using header information decoded from compressed bitstream 88, decoder 100 can use intra/inter prediction stage 108 to create the same prediction block as was created in encoder 70, e.g., at intra/inter prediction stage 72. At reconstruction stage 110, the prediction block can be added to the derivative residual to create a reconstructed block. Loop filtering stage 112 can be applied to the reconstructed block to reduce blocking artifacts. Deblocking filtering stage 114 can be applied to the reconstructed block to reduce blocking distortion, and the result is output as output video stream 116. Output video stream 116 can also be referred to as a decoded video stream and the terms will be used interchangeably herein.
Other variations of decoder 100 can be used to decode compressed bitstream 88. For example, decoder 100 can produce output video stream 116 without deblocking filtering stage 114.
The order in which steps are included in method of operation 500 is exemplary; the order of the steps can be changed without departing from the meaning of the disclosed implementations. For example, method of operation 500 shows the direct motion prediction being calculated for the macroblocks of the frame before superblocks are formed and then encoded. Alternatively, direct motion prediction can be calculated for the macroblocks as a superblock is formed, and the resulting superblock can be encoded before the next macroblocks are processed.
At step 502, a frame of a video stream is received by a computing device, such as transmitting station 12 that is implementing method of operation 500. The stream has a plurality of macroblocks that in some cases can be organized into frames. Each frame can capture a scene with multiple objects, such as people, background elements, graphics, text, a blank wall, or anything that can be represented in video data. Video data can be received in any number of ways, such as by receiving the video data over a network, over a cable, or by reading the video data from a primary memory or other storage device, including a disk drive or removable media such as a CompactFlash (CF) card, Secure Digital (SD) card, or the like.
In some implementations, a frame of the video stream may be further subdivided into segments or slices that can be encoded and decoded separately. These segments or slices can represent subsets of the image data contained in a frame of the video stream or can represent the image data in multiple resolutions, for example. Disclosed implementations can operate on macroblocks in frames, segments or slices. Although examples used herein will refer to frames, the terms frames, segments or slices can be used interchangeably.
At step 504, a macroblock including, for example, 16×16 pixels is selected from the video frame. Macroblocks of a video frame can be selected in raster scan order starting from the upper left corner of the frame and proceeding along rows from left to right until the bottom right corner is reached, although other scan orders can be used. As used herein, the term “select” means to identify, construct, determine, specify or otherwise select in any manner whatsoever. At step 506, one or more motion vectors are calculated. A motion vector predicts the contents of a macroblock by comparing the contents of the macroblock with the translated contents of a corresponding block from a frame that can occur either before or in some cases after the frame containing the macroblock being processed (i.e., the current macroblock). If the difference between a translated macroblock from a temporally displaced frame and the current macroblock is sufficiently small, only the subtracted difference between the two macroblocks, called the residue or residual, is encoded. The encoded residue can include fewer bits when encoded, thereby saving bits in the encoded video stream.
At step 508, further bits in the encoded video stream can be saved by comparing the calculated motion vectors themselves to previously calculated motion vectors from the temporally displaced frame to see if direct motion vector prediction, also called direct prediction, can be used. For example, if the object represented in the portion of the video stream included in the current macroblock is moving smoothly and continuously with respect to the video frame, it can be expected that the motion vector that optimally predicts the current macroblock may also have predicted the temporally displaced macroblock. If the comparison between the motion vector of the current block and the motion vector of the temporally displaced block are similar within predetermined limits, the previously calculated motion vector can be re-used. Therefore, the motion vector of the current macroblock does not have to be included in the video stream and the macroblock can be encoded using direct prediction.
At step 510, method of operation 500 has determined that the motion vector of the current macroblock is similar enough to the motion vector of the temporally displaced block that direct prediction can be used. The current macroblock is designated as a direct prediction macroblock. Whether or not direct prediction can be used in response to the query of step 508, processing advances to step 512 where the macroblock can be indicated as using a type of motion prediction other than direct prediction. At step 512, method of operation 500 checks to see if any more macroblocks remain to be processed. If so, method of operation 500 loops back to select the next macroblock for processing at step 504. If no more macroblocks remain to be processed, method of operation 500 proceeds to step 514.
At step 514, each macroblock of the frame is again selected, for example, in raster scan order. At step 516, the current macroblock is tested to see if it has been designated at step 510 as being direct prediction macroblocks. If the current macroblock is a direct prediction macroblock, and if at least one adjacent macroblock is also a direct prediction macroblock, it is defined as a superblock by method of operation 500 and processing advances to step 518 to begin determining if other blocks should be added to the current superblock. If a macroblock is a direct prediction macroblock but is not adjacent to any other direct prediction macroblocks, the macroblock may be encoded and included in the bitstream as a macroblock rather than a superblock since there would be no bit savings in designating a macroblock as a superblock containing only one macroblock.
At step 518, method of operation 500 selects a macroblock from the frame adjacent to the current superblock and, at step 520, the selected macroblock is tested to see if it is a direct prediction macroblock. If the selected macroblock is a direct prediction macroblock, it is combined with the current superblock at step 522. At step 524, method of operation 500 checks to see if any more macroblocks remain adjacent to the current expanded superblock up to a size limit, such as 64×64 pixels. If yes, the method of operation loops back to step 518 to select the next adjacent macroblock. The particular order in which adjacent macroblocks are considered is not limited. In one example, each adjacent block is tested and combined with the current superblock along a row or column until a block is reached that is not a direct prediction block in response to the query of step 520 or until the current superblock has reached its size limit. Once done with the current superblock in step 524, processing advances to step 526 to determine whether there is a block of the frame not included in a superblock. If there is, method of operation loops back to step 514, where the next macroblock not already included in a superblock is selected (e.g., in raster scan order).
If all macroblocks have been selected and tested to determine if they can be included in a superblock, the macroblocks and superblocks of the frame can be encoded by computing device 12 using an encoder such as encoder 70 in step 528. At step 530, method of operation 500 can insert bits into the encoded video stream to indicate which superblocks have been encoded using direct prediction and to indicate the size of the superblocks in order to permit a decoder to properly decode the frame. As used herein, the term “indicate” means to signify, identify, determine, specify, designate or otherwise indicate in any manner whatsoever.
In one example, a one-bit field can be defined in the frame or segment/slice header to indicate the direct prediction mode according to the following definitions in Table 1.
In an implementation, four superblock modes can be defined as shown in Table 2, below.
Three out of the four possible superblock modes are mapped into three superblock variable-length codes of “0,” “10,” and “11”. The shortest 1-bit superblock variable length code, “0”, can be used to map the most often used mode, while the other two 2-bit superblock variable length codes will be used for the second and third most frequently used sizes, respectively. For the fourth and less most frequently used superblock sizes, the superblock mode and its size can be specified by an indication field (e.g., indicating the coding mode) and two 2-bit fixed length fields as a multiplier to the macroblocks used to form the superblock. For example, a 32×32-bit superblock can have two 2-bit multiplier fields of binary 01 in the X-direction and binary 01 in the Y-direction. In another example, a superblock with a rectangular size of 64×32 pixels can have two 2-bit fields including a binary 11 for the X-direction and binary 01 for the Y-direction. Examples of other superblocks sizes are described in relation to
Table 3 is an example of a 19-bit frame or segment/slice header “1 10 01 01 00 10 10 01 11 01” that specifies three superblock modes.
This design permits using three different modes out of the four wherein the three modes can be mapped to use different superblock size. This permits using different superblock sizes to optimize the coding performance based on the statistics of each individual frame or segment/slice. It saves bits by introducing the capability to use the shortest 1-bit code to indicate the most often used superblock mode and its size. The choice of using different block sizes to represent a coded superblock is left to the optimization of different algorithms based on complexity and performance tradeoffs. Disclosed implementations use the largest superblock possible to the cover the areas that can use direct prediction by the previous frame motion vectors while keeping the residual errors under a predetermined threshold to minimize total bit spending.
As the superblocks are encoded and inserted into the encoded video stream, the individual superblock headers will contain one of the superblock variable length codes “0”, “10” or “11” to indicate to the decoder how many macroblocks and in which configuration they are in. In cases where the superblock is of a size and configuration not covered by the three codes given above, a code will be included that indicates the size and configuration of the superblock in four-bit format, examples of which are described in relation to
The order in which steps included in method of operation 600 are presented is exemplary; the order of the steps can be changed without departing from the meaning of the disclosed implementations. For example, method of operation 600 shows macroblocks being selected and then decoded. However, all of the macroblocks can be selected or identified from the superblocks before being decoded.
At step 602, a frame of a video stream having superblocks is received by the computing device performing method of operation 600. At step 604, the frame, segment or slice header information describing the superblocks included in the frame as described in relation to Tables 1, 2 and 3, above, is determined by reading the header bits and decoding the information as described above.
At step 606, a block is selected from the frame. The block can be a macroblock or a superblock. Blocks can be read from the frame data in raster scan order as described above, or other scan orders can be used. If the block is not a superblock at next step 608, processing advances to step 610 to decode the macroblock (e.g., according to intra,inter or direct prediction). In contrast, if the block is a superblock at step 608, processing advances to step 612, where bits of the superblock header are read to determine the size and configuration of the superblock according to the encoding described above and in relation to
At step 618, a query is made to determine if all macroblocks belonging to the current superblock are decoded. If not, method of operation 600 loops back to step 614 to select another macroblock for processing. If all macroblocks of the current superblock are decoded, method of operation 600 tests at step 620 to see if all blocks of the current frame are decoded. If not, method of operation 600 loops back to select another block for processing at step 606. If all blocks are decoded, method of operation 600 ends.
As described in relation to Tables 1, 2 and 3 above, in cases where these superblock sizes are among the first, second or third most frequently used sizes in a frame, they can be denoted by a one- or two-bit code. Where the superblock sizes are the fourth or less frequently used sizes in a frame, they would be denoted by 4-bit codes, examples of which are shown above.
According to the teachings of the present disclosure, a variable-size superblock for direct mode prediction is disclosed. This allows the use of larger size superblocks in direct mode prediction and thus saves mode indication overhead. Also, it gives the flexibility to use different sizes to cover different areas where motion vectors exhibit different behaviors. If direct mode will not save anything in a segment, a slice or a frame, the design provides flexibility to not use direct prediction mode at the slice, segment or frame levels, thus only wasting 1-bit for the given region. It gives the opportunity to fully explore the direct prediction mode for bit-savings frame by frame or slice by slice, and segment by segment.
The choice of using different block size to represent a coded superblock can be left to the optimization of different algorithms based on complexity and performance trade off. Desirably, a superblock as large as possible to cover the areas that can use direct prediction by the previous frame motion vectors can be used while keeping the residual errors under a certain threshold so that the total bit spending is reduced and/or minimized. Superblocks can also be implemented as non-rectangular, although the examples herein include rectangular superblocks. In such cases, bits in the superblock header may define a non-rectangular arrangement of macroblocks that are all to be encoded using direct prediction.
In some implementations, a macroblock not using direct prediction may be adjacent to or surrounded by macroblocks using direct prediction. In this case, the macroblock not using direct prediction can be changed to use direct prediction in order to permit a larger superblock to be constructed. The residual error of the macroblock that has been changed to direct prediction may be greater than without direct prediction. However, the savings in bits to be encoded in the bitstream due to forming a superblock can be large enough to fully offset the increase in bits due to the residual error, making such a change desirable.
The implementations of encoding and decoding described above illustrate some examples of encoding and decoding techniques. However, it is to be understood that encoding and decoding, as those terms are used in the claims, could mean compression, decompression, transformation, or any other processing or change of data.
The words “example” or “exemplary” are used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “example’ or “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the words “example” or “exemplary” is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X includes A or B” is intended to mean any of the natural inclusive permutations. That is, if X includes A; X includes B; or X includes both A and B, then “X includes A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form. Moreover, use of the term “an embodiment” or “one embodiment” or “an implementation” or “one implementation” throughout is not intended to mean the same embodiment or implementation unless described as such.
The implementations of transmitting station 12 and/or receiving station 30 (and the algorithms, methods, instructions, etc., stored thereon and/or executed thereby, including by encoder 70 and decoder 100) can be realized in hardware, software, or any combination thereof. The hardware can include, for example, computers, intellectual property (IP) cores, application-specific integrated circuits (ASICs), programmable logic arrays, optical processors, programmable logic controllers, microcode, microcontrollers, servers, microprocessors, digital signal processors or any other suitable circuit. In the claims, the term “processor” should be understood as encompassing any of the foregoing hardware, either singly or in combination. The terms “signal” and “data” are used interchangeably. Further, portions of transmitting station 12 and receiving station 30 do not necessarily have to be implemented in the same manner.
Further, for example, transmitting station 12 or receiving station 30 can be implemented using a general purpose computer/processor with a computer program that, when executed, carries out any of the respective methods, algorithms and/or instructions described herein. In addition or alternatively, for example, a special purpose computer/processor can be utilized that can contain other hardware for carrying out any of the methods, algorithms, or instructions described herein.
Transmitting station 12 and receiving station 30 can, for example, be implemented on computers in a video conferencing system. Alternatively, transmitting station 12 can be implemented on a server and receiving station 30 can be implemented on a device separate from the server, such as a hand-held communications device (i.e., a cell phone). In this instance, transmitting station 12 can encode content using an encoder 70 into an encoded video signal and transmit the encoded video signal to the communications device. In turn, the communications device can then decode the encoded video signal using a decoder 100. Alternatively, the communications device can decode content stored locally on the communications device, i.e., content that was not transmitted by transmitting station 12. Other suitable transmitting station 12 and receiving station 30 implementation schemes are available. For example, receiving station 30 can be a generally stationary personal computer rather than a portable communications device and/or a device including an encoder 70 may also include a decoder 100.
Further, all or a portion of implementations of the present disclosure can take the form of a computer program product accessible from, for example, a computer-usable or computer-readable medium. A computer-usable or computer-readable medium can be any device that can, for example, tangibly contain, store, communicate, or transport the program for use by or in connection with any processor. The medium can be, for example, an electronic, magnetic, optical, electromagnetic, or a semiconductor device. Other suitable mediums are also available.
The above-described aspects, implementations and embodiments have been described in order to allow easy understanding of the present invention and do not limit the present invention. On the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims, which scope is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structure as is permitted under the law.
This application is a continuation of U.S. patent application Ser. No. 13/570,496, which was filed Aug. 9, 2012.
Number | Name | Date | Kind |
---|---|---|---|
20060256868 | Westerman | Nov 2006 | A1 |
20110249743 | Zhao | Oct 2011 | A1 |
Number | Date | Country | |
---|---|---|---|
Parent | 13570496 | Aug 2012 | US |
Child | 15143940 | US |