The invention relates in general to digital video encoding and decoding and in particular to digital video encoding and decoding which exploits frame-level parallelism.
Video compression is a process where, instead of transmitting a full set of data for each picture element or pixel on a display for each frame, a greatly reduced amount of data can be coded, transmitted, and decoded to achieve the same perceived picture quality. Generally, a pixel is a small dot on a display wherein hundreds of thousands of pixels make up the entire display. A pixel can be represented in a signal as a series of binary data bits. Compression of data often utilizes the assumption that data for a single pixel can be correlated with a neighboring pixel within the same frame and the pixel can also be associated with itself in successive frames. A frame is a segment of data required to display a single picture or graphic. A series of consecutive frames is required to make a video image.
Since a value of a pixel can be predicted with some statistical error using neighboring pixels or pixels in consecutive frames, most video encoders use a two-stage hybrid coding scheme to compress and decompress video signals. Such a hybrid process combines a spatial transform coding for a single frame (reproducing pixel data based on neighboring pixels) with temporal prediction for the succession of frames (reproducing pixel data as it changes between frames).
Spatial transform coding can reduce the number of bits used to describe a still picture. Spatial transformation or intra-coding can include transforming image data from a spatial domain into a frequency-domain utilizing a discrete cosine transform (DCT), wavelets, or other processes. Subsequently, resulting coefficients can be quantized where low-frequency coefficients usually have a higher precision than high frequency coefficients. Afterwards, lossless entropy coding can be applied to the coefficients. By using transform coding, significant lossy image compression can be achieved whose characteristics can be adjusted to provide a pleasing visual perception for viewers.
Likewise, temporal prediction in streaming video can provide intra-coded frames to establish a baseline refresh, and then successive frames can be described digitally by their difference from another frame. This process is referred to as “inter-coding.” The “difference” data or signal, which has significantly less energy (which results in less data) than the full data set, is usually transformed and quantized similarly to the intra-coded signal but with different frequency characteristics. Inter-coding can provide a superior result over intra-coding if motion compensated prediction is combined with inter-coding. In this case, an unrestricted reference frame area in a reference frame is searched to locate an area which matches as closely as possible the texture of an area to be coded for a current frame. Then, the difference signals and the calculated motion can be transmitted in an inter-coded format. Traditional systems often restrict certain regions from being utilized as baseline data to reduce error propagation. All such encoding (i.e., intra-coding and inter-coding) is often referred to generically as data compression.
Inter-coded transmissions utilize predicted frames, where predicted frames occur when the full set of data is not transmitted, but information regarding how the frame differs from a reference frame is utilized to “predict” and correspondingly construct the current frame. As stated above, intra-frame encoding is the creation of encoded data from a single image frame where inter-frame encoding is the creation of encoded data from two or more consecutive image frames. The temporal prediction of frames is theoretically lossless, but such prediction can lead to a serious degradation in video quality when transmission errors occur and these transmission errors get replicated in consecutive frames. For example, if an error occurs in some content and subsequent transmissions rely on the content to predict future data, the error can multiply causing widespread degradation of the video signal. In order to avoid infinite error propagation, intra-coded reference frames can be utilized to periodically refresh the data. The intra-coded data can be decoded “error-free” because it is independent of the previous possibly corrupt frames. Furthermore, the intra-coded frames can be used as an “entry point” or start point for decoding a compressed video data stream.
In short, a video stream consists of a series of frames. Compressed video signals consist of intra-coded frames (“intra-frames”) and inter-coded frames (“inter-frames). Intra-frame coding compresses each frame individually where each frame is intra coded without any reference to other frames. However, inter-frames are coded based on some relationship to other frames and exploit the interdependencies of a sequence of frames using prediction and compensation as described above. Intra-frames are termed I-frames. Inter-frames are termed P-frames, if one reference frame is used for the prediction. Inter-frames are termed B-frames, if two or more reference frames are used.
In an encoder as shown in
If the encoder shown in
Some video standards allow combining inter-prediction and intra-prediction within a predicted frame on macroblock level. That is, in this case some macroblocks can be intra-coded and some can be inter-coded. However, to ease understanding of the present invention, a combination of intra- and inter-prediction within a frame is not explained further. However, the lack of further explanation must not be understood to limit a scope of the present invention.
The reconstructed difference frame 225 is computed from the output signal by applying inverse quantization and inverse transformation using the inverse quantization module 222 and the inverse orthogonal transformation module 224, respectively. Deblocking modules and other optional modules (depending on the video standard used) are not explained further but must not be understood to limit a scope of the present invention. The reconstructed current frame 227 is generated by an adder 226 from the predicted frame 213 and the reconstructed difference frame 225. The reconstructed current frame 227 then is stored in the frame buffer 230 which also holds the reconstructed previous frame 231.
A frame can be divided into slices. A slice can be a vertical area of a frame. A simple example of such a slice of a frame is to horizontally split the frame in the middle. The upper part of the frame and the lower part of the frame can be of equal size and can be encoded or, if supported by an encoded video stream, decoded independently. The independent decoding requires that there are no dependencies between slices of a frame.
A macroblock is a block of data that describes a group of spatially adjacent pixels. A macroblock usually defines pixels in a rectangular region of the frame where the data in a macroblock can be processed together and somewhat separately from other macroblocks. Thus, a frame can be divided into numerous macroblocks and macroblocks are often defined in a matrix topology where there are x and y macroblocks. A macroblock can therefore have a designation such as, for example, (2, 3) where x and y can range from 1 to Z.
Often, the most obvious distortion or artifacts in a video stream are the appearance of square or rectangular blocks in the decoded frame. Such artifacts are characteristic to block-based compression methods that make use of macroblocks. To reduce blocking distortion, a deblocking filter is applied to each decoded macroblock. The deblocking filter is applied after the inverse transformation in the encoder and in the decoder. The appearance of the decoded frames is improved by smoothing the block edges. The frame filtered by this technique is used for motion-compensated prediction of future images.
A Group of Pictures (GOP) is a group of successive frames within a coded video stream. The GOP specifies the order in which frames are arranged. The frames of the GOP can be intra-frames or inter-frames. An arrangement of frames of subsequent GOPs can be kept constant for the whole video stream or for parts of the video stream.
Although state-of-the-art video compression standards provide high compression rates and even allow one to maintain high picture quality, the compression standards require a lot of computational resources to encode the frames with a reasonable size and quality. High data throughput and high processing power are required. Therefore, parallelization is of high importance in architectures used for video encoding and decoding. However, common standards for video encoding and decoding strongly restrict the degree of parallelization. Therefore, one of the basic problems in video encoder and decoder development is to find the best way to split the encoding task of a video stream for parallel processing. A variety of parallelism methods are available in the art. Methods have to handle high data throughput and the high processing power requirements. However, some disadvantages of known methods are asymmetric usage of the parallel processing units, reduced compression efficiency, and complex handling of boundary regions between parallel-processed data blocks. The present invention presents a new parallelization method that overcomes disadvantages of known solutions.
Known kinds of parallelization mechanisms exploit parallelism in space, function, and time. In spatial parallelism, a frame is split into regions where each region is processed independently from one another. In functional parallelism, each processing unit runs different functions. For instance, one processing unit runs motion estimation and another processing unit runs a discrete cosine transformation, etc. Temporal parallelism exploits time displacement of frames. These kinds of parallelization mechanisms may be applied separately or can be combined.
Nang et al. discloses in, “An Effective Parallelizing Scheme of MPEG-1 Video Encoding,” state-of-the-art mechanisms of parallization methods which are commonly in contemporaneous use and to which practitioners are often referred. Four parallization methods are identified as attractive parallelisms and are discussed in a series of experiments: macroblock-level, slice-level, frame-level, and GOP-level parallelism. Although published in 1997, this publication of Nang et al. is still cited by a series of other publications in the art and the parallization methods discussed are still applied in today's video encoders and decoders.
Iwata et al. describes in U.S. Pat. No. 6,870,883, entitled “Parallel Encoding and Decoding Processor System and Method,” a macroblock-level and slice-level parallel method which uses parallel processors to encode or decode macroblocks of a frame in parallel.
In general, macroblock parallelism enables encoding and decoding on parallel processing systems. However, strong dependencies exist between successive macroblocks. Further, macroblock parallelism results in a large amount of data that has to be transferred between processors. Due to strong differences in the computational time between macroblocks, the method disclosed by Iwata et al. has poor processor load. Other implementations which exploit macroblock-level parallelism even need additional and sophisticated workload balancing mechanisms to distribute the tasks among the processors.
Another method of parallelization is slice-level parallelism. Each frame is divided into two or more slices. Common video standards enable one to independently encode or decode the slices of a frame. The data dependencies between the slices are removed by not applying deblocking or prediction beyond slice borders. This result in greatly reduced compression efficiency and visual artifacts at slice borders. Optionally deblocking at slice borders must be done sequentially without efficient usage of the parallel units.
Knee et al. discloses in U.S. Pat. No. 5,640,210 entitled “High Definition Television Coder/Decoder which Divides an HDTV Signal into Slices for Individual Processing,” a method and apparatus that exploits slice-level parallelism. Each slice is coded or decoded by a single sub-coder or sub-decoder. Every Nth slice is coded or decoded by the same sub-coder/sub-decoder. The method of Knee et al. is shown in
Kimura et al. discloses in U.S. Pat. No. 7,701,160 entitled “Image Encoding and Decoding Apparatus,” an image encoding or decoding apparatus having a circuit for dividing a frame into slices to conduct an encoding or decoding operation for each slice, a code integrating circuit for integrating the codes of the respective slices, and a shared memory for storing frames locally.
However, problems of approaches exploiting slice-level parallelism appear at the slice boundaries: no prediction is possible beyond slice boundaries and with each additional slice the efficiency of exploiting spatial or temporal similarities is reduced. Furthermore, deblocking at the slice boundaries cannot be performed since it requires data from both slices for the filter operations. This results in visible blocking artifacts at the slice borders (sharp horizontal borders are visible in the image).
GOP-level parallelism is a kind of temporal parallelism. A decoder which exploits GOP-level parallelism splits the incoming stream into independent GOPs and decodes them on independent decoders. The parallel-decoded GOPs are put in the correct order by an output module. Such a system is disclosed in U.S. Pat. No. 5,883,671 to Keng et al. entitled “Method and Apparatus for Partitioning Compressed Digital Video Bitstream for Decoding by Multiple Independent Parallel Decoders.”
Tiwari et al. discloses in U.S. Pat. No. 5,694,170 entitled “Video Compression Using Multiple Computing Agents,” a system and method which uses multiple processors to perform video compression. A video sequence (i.e., an input signal) is partitioned into sub-sequences. Processing assignments for the sub-sequences are distributed among a plurality of processors. A frame type (whether it is an I-frame, P-frame, or B-frame) is then determined for each frame in each sub-sequence. Each frame is then compressed in accordance to its associated frame type.
Advantages of GOP-level parallelism include easy implementation and complex dependencies among the frames can be circumvented easily. However, disadvantages of GOP-level parallelism include a very high latency since parallel processing units operate on whole GOP data sets (thereby making such a system unsuitable for real-time live video streaming), high memory requirements, and difficult rate control issues.
Hence, favorable parallelism would include lower minimal requirements of memory sizes, higher throughput, and increased processing power of the parallel processing elements. Moreover, favorable methods of parallelism should support real-time steaming and even enable high picture quality without artifacts as they occur in known methods which exploit slice-level parallelism.
Frame level-parallelism allows parallel encoding and/or decoding of whole frames instead of groups of frames (GOPs) or parts of it (e.g., slices or macroblocks). The most obvious issues in frame-level parallelism are the interdependencies of frames in predicted video coding to previous reconstructed frames.
A method and processor to encode and/or decode a video stream exploiting frame-level parallelism is disclosed. The frames of the video stream are encoded and/or decoded using M processing units where each processing unit processes one different frame at a time. Each processing unit writes the reconstructed frame to a frame buffer. The next processing unit can use the reconstructed frame from that frame buffer as a reference frame.
The parallel encoding or decoding stages may be coupled via the frame buffers. A stage can read data from the input and reconstructed previous frames from previous stages by accessing the frame buffers of previous stages and can write a reconstructed frame to a frame buffer which is assigned to the current processing unit. A processing unit which runs an encoder or a decoder can start the encoding and/or decoding process once sufficient data of the reconstructed previous frame are available. A stream merging unit merges the output of all stages to at least one output stream.
In one exemplary embodiment, the present invention is a video encoding apparatus which includes a divider circuit configured to divide an input video stream into a plurality of single frames and a plurality of encoding units coupled to the divider. Each of the plurality of encoding units is configured to receive at least one of the plurality of single frames. Each of a plurality of shared information memory units are coupled to and associated with at least one of the plurality of encoding units. Each of the plurality of encoding units is configured to write a reconstructed frame to at least one of the plurality of shared information memory units, and each of the plurality of shared information memory units is configured to distribute a frame reproduced by an associated encoding unit to at least one other encoding unit. A control unit is coupled to the plurality of encoding units and configured to determine which frame of a sequence of frames is processed by which of the plurality of encoding units. A stream merging unit is configured to assemble the coded signals produced by the encoding units into at least one output stream.
In another exemplary embodiment, the present invention is a method of encoding a video sequence exploiting frame-level parallelism. The method includes partitioning a video sequence into a plurality of frames, distributing the plurality of frames among a plurality of encoding units, encoding the plurality of frames according to a picture type assigned to each of the plurality of frames, assigning at least one shared information memory unit from a plurality of shared information memory units to each of the plurality of encoding units, storing reconstructed frames created by the plurality of encoding units to at least one of the plurality of shared information memory units, controlling which frame of the plurality of frames is processed by a specific one of the plurality of encoding units, and determining when a sufficient amount of data of the reconstructed frames are available for encoding in each of the information memory units.
In another exemplary embodiment, the present invention is a video decoding apparatus coupled to receive an encoded input video stream having a sequence of coded frames. The apparatus includes a divider circuit configured to divide the input video stream into a plurality of single coded frames and transfer at least one of the plurality of single coded frames to a plurality of decoding units and a plurality of shared information memory units. Each of the plurality of shared information memory units is coupled to and associated with at least one of the plurality of decoding units. Each of the plurality of decoding units is configured to write a reconstructed frame to at least one of the plurality of shared information memory units and each of the plurality of shared information memory units is configured to distribute a frame reproduced by an associated decoding unit to at least one other decoding unit. A control unit is coupled to the plurality of decoding units and configured to determine which frame of a sequence of frames is processed by which of the plurality of decoding units. A stream merging unit is configured to assemble the coded signals produced by the decoding units into at least one output stream.
In another exemplary embodiment, the present invention is a method of decoding a coded video sequence exploiting frame-level parallelism. The method includes partitioning a video sequence into a plurality of coded frames, distributing the plurality of coded frames among a plurality of decoding units, decoding the plurality of coded frames according to a picture type assigned to each of the plurality of coded frames, assigning at least one shared information memory unit from a plurality of shared information memory units to each of the plurality of decoding units, storing reconstructed frames created by the plurality of decoding units to at least one of the plurality of shared information memory units, controlling which frame of the plurality of coded frames is processed by a specific one of the plurality of decoding units, and determining when a sufficient amount of data of the reconstructed frames are available for decoding in each of the information memory units.
The appended drawings illustrate exemplary embodiments of the invention and are not to be considered as limiting a scope of the present invention.
In the following description, a new method and apparatus to encode and/or decode a video stream exploiting frame-level parallelism is disclosed. The frames of the video stream are encoded and/or decoded using M processing units where each processing unit processes one different frame at a time. Each processing unit can write a reconstructed frame to a frame buffer. A subsequent processing unit can use the reconstructed frame from that frame buffer as a reference frame. Hence, the processing units are connected via the frame buffers. The frame processing occurs time-displaced and can start when sufficient input data are available and, if necessary, if sufficient data of the reconstructed previous frame which was calculated in a previous stage are available.
As explained above with reference to
Together, the encoder stage 1 and the encoder stage 2 shown in
When an encoder has coded a certain amount of image data of a frame N, e.g., encoder 1 has encoded and reconstructed the upper half input image N, in most cases enough reconstructed image data are available for the motion estimation step in the encoding process of image N+1. The reference frame area for the motion estimation in the encoding process of image N+1, thus, has to be limited to the image data already processed.
Once the second frame is received, encoder P2 can start the encoding process of frame 2. In other embodiments of the disclosure, the encoder P2 can start with the encoding process when only portions of the frame are available. Hence, the encoder P2 can use just parts of the reconstructed frame 1 calculated by encoder P1. Therefore, both encoders use a limited reference frame area as discussed above. The encoder P2 can use up to 2T of time to encode a frame as well. As soon as the encoder P1 has completed the encoding of frame 1 it can start with the encoding process of frame 3 using the reconstructed frame 2 stored in the frame buffer 330-2 (
The encoder stage 1 can forward its reconstructed frame to the encoder stage 2 using the first frame buffer 430-1, the encoder stage 2 can forward its reconstructed frame to the encoder stage 3 using the second frame buffer 430-2, and the encoder stage 3 can forward its reconstructed frame back to the encoder stage 1 using the third frame buffer 430-3.
In general, the input video streams can be distributed on M processors. Each frame can be encoded by a single processor. No slices within frames are required and, hence, no boundary effects appear. The encoding process of a frame and even the decode process in a decoder can start with two conditions being fulfilled: that sufficient data of the input frame and sufficient data of the reference frame (the reconstructed previous frame) are available. The portion of the reference frame which is required for the encoding or decoding process is the reference frame area which is scanned for motion estimation. Motion estimation only needs the reference frame area of the reconstructed reference frame to determine the motion information. In the embodiment shown in
Another issue in video encoding and decoding is the scheduling of deblocking. Deblocking is required in some profiles of video standards like ITU H.264 and is used to enhance the image quality and to remove blocking artifacts as discussed above. The H.264 standard defines that deblocking operations shall be performed after the complete frame was reconstructed. This means that the whole reference frame must be stored and deblocking can only start after the last macroblock in the frame has been processed. However, if the macroblocks are stored in a strictly increasing order (each line from left to right and the lines from top down to the bottom) the deblocking operation can be performed immediately when the macroblock was reconstructed. This approach is called deblocking-on-the-fly and is state-of-the-art. Deblocking-on-the-fly leads to lower memory transfers and lower memory requirements and can only be applied on whole frames.
Although slice-level parallelism enables high parallelism, deblocking cannot be performed at the slice boundaries. Existing approaches use tricks to overlap the slice areas to overcome this problem. However, the deblocking of a lower slice can start only when deblocking of the upper slice is completed. Hence, deblocking-on-the-fly cannot be performed for all slices of a frame in slice-level parallelism. Consequently, this leads to higher memory usage and higher memory transfers as all slices that could not be deblocked on the fly have to be reloaded and deblocked after the whole frame has been reconstructed.
One of the advantages of the proposed method and system of the present invention enables deblocking-on-the-fly as whole frames are processed instead of slices. Therefore, the memory transfers can be kept small. Moreover, as the present invention uses frame-level parallelism, no slices are required which leads to a better coding efficiency. In addition to the aforementioned advantages, the present method has lower latency than GOP-level parallelism. The degraded coding efficiency of slice-level parallelism is analyzed by Chen et al. in “Towards Efficient Multi-Level Threading of H.264 Encoder on Intel Hyper-Threading Architectures.”
Each of the M processors (in
Each processor writes the reconstructed frame to a memory which can be accessed by one or more other processors. According to
The output streams of the processors are forwarded to stream buffers 142, 144, 146 and a stream merging unit 112 merges the single streams to a single output stream. In
The controller 114 controls the M processors (in the example of
In the example shown in
The first frame in the example depicted above by means of the processing architecture given in
However, the assignment of intra-mode frames need not be static as the handling of the next frame shows. Once the processor P1122 has finished the encoding process of the first frame it can wait until the whole fourth frame (or wait until sufficient data of the fourth frame) is available. As the fourth frame has to be encoded as a P-frame, the processor P1122 also waits until sufficient data of the reconstructed frame are available from the processor P3126. The processor P1122 then writes the output stream to the stream buffer 142 and the reconstructed frame to the first memory 132. In particular embodiments, the reconstructed frame is not written to the first memory 132 in a case where it is not needed. This can happen, for example, when the next frame (as the fifth frame in the example shown in
The architectures of
Other advantages of the disclosed method are that no slices are required for parallelization, resulting in a better coding performance and better image quality as artifacts such as horizontal lines do not appear in the image. Moreover, deblocking on the fly is possible, resulting in fewer memory transfers. Another advantage over GOP-level parallelism is that the present method has lower latency. The processors have M times T of time to compute a frame. Since the method of the invention exploits temporal parallelism, other methods of parallelism as spatial parallelism or functional parallelism can still be applied.
The only limitations to the decoders according to the present invention shown in
Despite these limitations the decoders of the present invention have the same advantages as the encoders and allow one to exploit frame-level parallelism and to combine frame-level parallelism with other forms of parallelism, e.g., spatial parallelism and/or functional parallelism. Moreover, artifacts known from slice-level parallelism are avoided and deblocking-on-the-fly is possible.
The hardware architectures as shown in
The present invention is described above with reference to specific embodiments thereof. It will, however, be evident to a skilled artisan that various modifications and changes can be made thereto without departing from the broader spirit and scope of the present invention as set forth in the appended claims. For example, particular embodiments describe a number of processors and memory units per stage. A skilled artisan will recognize that these numbers are flexible and the quantities shown herein are for exemplary purposes only. Additionally, a skilled artisan will recognize that various numbers of stages may be employed for various video stream sizes and applications. These and various other embodiments are all within a scope of the present invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
This application claims priority from U.S. Provisional Patent Application Ser. No. 60/871,141 entitled “Method and Apparatus for Encoding and Decoding of Video Streams,” filed Dec. 21, 2006 and which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60871141 | Dec 2006 | US |