Embodiments relate generally to image encoding methods, image decoding methods, image encoding apparatuses, and image decoding apparatuses.
Software-based realtime video encoding is challenging on current desktop computers due to the encoder complexity and the large amount of video data to be processed. The introduction of the H.264/AVC (Advanced Video Coding) have demonstrated significant improvement in video compression performance over previous coding standards such as H.263++ and MPEG-4. Recently, the Joint Video Team of the ITU-T VCEG and the ISO/IEC MPEG have also standardized the Scalable Video Coding (SVC) as an extension of the H.264/AVC standard to provide efficient support for spatial, temporal and quality scalability. Though video scalability techniques have been proposed in the past, such as the scalable profiles for MPEG-2, H.263, and MPEG-4 Visual, they are less efficient and more complex than the SVC. Coding efficiency gains of H.264 and SVC come with the price of high computational complexity, which is challenging for software encoders on the personal computers. The H.264 encoder is 8 times more complex than the MPEG-2 encoder, and 5 to 10 times more complex than the H.263 encoder. Furthermore, with the trend towards high definition resolution video, current high performance uniprocessor architectures are not capable of performing real-time encoding using H.264 or SVC. In W. S. Lee, Y. H. Tan, J. Y. Tham, K. Goh, and D. Wu, “Lacing an improved motion estimation framework for scalable video coding”, in: ACM International Conference on Multimedia, pp. 769-772, October 2008, a method of parallel encoding macroblocks has been investigated.
In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments of the invention are described with reference to the following drawings, in which:
In various embodiments, encoding of an image may be performed in various steps, for example forward motion estimation, backward motion estimation, and intra-prediction. In various embodiments, the encoding is parallelized by performing any one of the steps, for which the corresponding input data is available, even if input data for the other steps is not available. Thereby, encoding of an image may be started before all of the input data is available, which allows for a higher degree of parallelization compared to commonly used methods.
In various embodiments, an image may be an image that is part of a sequence of images.
In various embodiments, an image may be an image of a video sequence.
In various embodiments, each image may have relations to other images in temporal dimension, temporal direction or, time.
The following detailed description refers to the accompanying drawings that show, by way of illustration, specific details and embodiments in which the invention may be practiced. Other embodiments may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the invention. The various embodiments are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments. The following detailed description therefore, is not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.
Various embodiments are provided for devices or apparatuses, and various embodiments are provided for methods. It will be understood that basic properties of the devices also hold for the methods and vice versa. Therefore, for sake of brevity, duplicate description of such properties may be omitted.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration”. Any embodiment or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or designs.
The image encoding apparatus according to various embodiments may include a memory which is for example used in the processing carried out by the image encoding apparatus. A memory used in the embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).
The image decoding apparatus according to various embodiments may include a memory which is for example used in the processing carried out by the image decoding apparatus. A memory used in the embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).
In various embodiments, in the first partial encoding step 102, partially encoded image data may be generated based on first input data after the entire first input data is available.
In various embodiments, in the second partial encoding step 104, second partially encoded image data may be generated based on second input data after the entire second input data is available; before the entire first input data is available.
In various embodiments, the first input data may include at least one of encoded image data of a preceding image and encoded image data of a successive image.
In various embodiments, a preceding image may be an image that, in a temporal sequence of images to be encoded, is preceding the image currently to be encoded, while a successive image is succeeding the image currently to be encoded.
In various embodiments, the second input data may include at least one of encoded image data of a preceding image and encoded image data of a successive image.
In various embodiments, the first input data may include encoded image data of a preceding image and the second input data may include encoded image data of a successive image.
In various embodiments, the first input data may include encoded image data of a successive image and the second input data may include encoded image data of a preceding image.
In various embodiments, the first partial encoding step 102 may include at least one of a forward motion estimation step and a backward motion estimation step.
In various embodiments, the second partial encoding step 104 may include at least one of a forward motion estimation step and a backward motion estimation step.
In various embodiments, the first partial encoding step 102 may include a forward motion estimation step and the second partial encoding step 104 may include a backward motion estimation step.
In various embodiments, the first partial encoding step 102 may include a backward motion estimation step and the second partial encoding step 104 may include a forward motion estimation step.
In various embodiments, in a third partial encoding step, third partially encoded image data may be generated based on the first partially encoded image data and the second partially encoded image data. In various embodiments, in the encoded image data generating step 106, the encoded image data may be generated based on the first partially encoded image data, the second partially encoded image data and the third partially encoded image data.
In various embodiments, the third partial encoding step may include a bi-directional motion estimation step.
In various embodiments, the third partial encoding step may include an intra-prediction step.
In various embodiments, the encoded image data generated in the encoded image data generating step 106 may include at least one of encoded frame data, encoded slice data and encoded macroblock data. The various structures of frame data, slice data and macroblock data will be explained later in more detail.
In various embodiments, the first input data may include at least one of decoded image data of a preceding image and decoded image data of a successive image.
In various embodiments, the second input data may include at least one of decoded image data of a preceding image and decoded image data of a successive image.
In various embodiments, the first input data may include decoded image data of a preceding image and the second input data may include decoded image data of a successive image.
In various embodiments, the first input data may include decoded image data of a successive image and the second input data may include decoded image data of a preceding image.
In various embodiments, the first partial decoding step 202 may include at least one of a forward motion prediction step and a backward motion prediction step.
In various embodiments, the second partial decoding step 204 may include at least one of a forward motion prediction step and a backward motion prediction step.
In various embodiments, the first partial decoding step 202 may include a forward motion estimation step and the second partial decoding step 204 may include a backward motion estimation step.
In various embodiments, the first partial decoding step 202 may include a backward motion estimation step and the second partial decoding step 204 may include a forward motion estimation step.
In various embodiments, in a third partial encoding step, third partially decoded image data may be generated based on the first partially decoded image data and the second partially decoded image data. In various embodiments, in the decoded image data generating step 206, the decoded image data may be generated based on the first partially decoded image data, the second partially decoded image data and the third partially decoded image data.
In various embodiments, the decoded image data generated in the decoded image data generating step 206 may include at least one of decoded frame data, decoded slice data and decoded macroblock data.
Although the first partial encoder 302, the second partial encoder 304 and the encoded image data generator 306 are shown as being separate in
In various embodiments, the first input data may include at least one of encoded image data of a preceding image and encoded image data of a successive image.
In various embodiments, the second input data may include at least one of encoded image data of a preceding image and encoded image data of a successive image.
In various embodiments, the first input data may include encoded image data of a preceding image and the second input data may include encoded image data of a successive image.
In various embodiments, the first input data may include encoded image data of a successive image and the second input data may include encoded image data of a preceding image.
In various embodiments, the first partial encoder 302 may be configured to perform at least one of a forward motion estimation and a backward motion estimation.
In various embodiments, the second partial encoder 304 may be configured to perform at least one of a forward motion estimation step and a backward motion estimation step.
In various embodiments, the first partial encoder 302 may be configured to perform a forward motion estimation and the second partial encoder 304 may be configured to perform a backward motion estimation.
In various embodiments, the first partial encoder 302 may be configured to perform a backward motion estimation and the second partial encoder 304 may be configured to perform a forward motion estimation.
In various embodiments, the image encoding apparatus 300 may further include a third partial encoder (not shown) configured to generate third partially encoded image data based on the first partially encoded image data and the second partially encoded image data. In various embodiments, the encoded image data generator may be configured to generate the encoded image data based on the first partially encoded image data, the second partially encoded image data and the third partially encoded image data.
In various embodiments, the encoded image data generated by the encoded image data generator 306 may include at least one of encoded frame data, encoded slice data and encoded macroblock data.
In various embodiments, the first input data may include at least one of decoded image data of a preceding image and decoded image data of a successive image.
In various embodiments, the second input data may include at least one of decoded image data of a preceding image and decoded image data of a successive image.
In various embodiments, the first input data may include decoded image data of a preceding image and the second input data may include decoded image data of a successive image.
In various embodiments, the first input data may include decoded image data of a successive image and the second input data may include decoded image data of a preceding image.
In various embodiments, the first partial decoder 402 may be configured to perform at least one of a forward motion prediction and a backward motion prediction.
In various embodiments, the second partial decoder 404 may be configured to perform at least one of a forward motion prediction and a backward motion prediction.
In various embodiments, the first partial decoder may 402 be configured to perform a forward motion prediction and the second partial decoder 404 may be configured to perform a backward motion prediction.
In various embodiments, the first partial decoder 402 may be configured to perform a backward motion prediction and the second partial decoder 404 may be configured to perform a forward motion prediction.
In various embodiments, the image decoding apparatus 400 may further include a third partial decoder (not shown) configured to generate third partially decoded image data based on the first partially decoded image data and the second partially decoded image data. In various embodiments, the decoded image data generator 406 may be configured to generate the decoded image data based on the first partially decoded image data, the second partially decoded image data and the third partially decoded image data.
In various embodiments, the decoded image data generated by the decoded image data generator 406 may include at least one of decoded frame data, decoded slice data and decoded macroblock data.
In various embodiments, an image encoding method may be provided, including a first partial encoding step, wherein first partially encoded image data is generated based on first input data as soon as the first input data is available; a second partial encoding step, wherein second partially encoded image data is generated based on second input data as soon as the second input data is available; and an encoded image data generating step, wherein encoded image data is generated based on the first partially encoded image data and the second partially encoded image data.
In various embodiments, an image decoding method may be provided, including a first partial decoding step, wherein first partially decoded image data is generated based on first input data as soon as the first input data is available; a second partial decoding step, wherein second partially decoded image data is generated based on second input data as soon as the second input data is available; and a decoded image data generating step, wherein decoded image data is generated based on the first partially decoded image data and the second partially decoded image data.
In various embodiments, an image encoding apparatus may be provided, including a first partial encoder configured to generate first partially encoded image data based on first input data as soon as the first input data is available; a second partial encoder configured to generate second partially encoded image data based on second input data as soon as the second input data is available; and an encoded image data generator configured to generate encoded image data based on the first partially encoded image data and the second partially encoded image data.
In various embodiments, an image decoding apparatus may be provided, including a first partial decoder configured to generate first partially decoded image data based on first input data as soon as the first input data is available; a second partial decoder configured to generate second partially decoded image data based on second input data as soon as the second input data is available; and a decoded image data generator configured to generate decoded image data based on the first partially decoded image data and the second partially decoded image data.
Software-based realtime video encoding may be challenging on current desktop computers due to the encoder complexity and the large amount of video data to be processed. Methods and apparatuses according to various embodiments may parallelize a video encoder to exploit multi-core architecture to scale its speed performance to available processors. Methods and apparatuses according to various embodiments may combine data and functional decomposition to yield as many concurrent tasks as possible to maximize processing scalability, i.e. the performance of the video encoder and decode may be enhanced, e.g. proportionally or almost proportionally, as more processors becomes available.
Coding efficiency gains of H.264 and SVC may come with the price of high computational complexity, which may be challenging for software encoders on the personal computers. The H.264 encoder is 8 times more complex than the MPEG-2 encoder, and 5 to 10 times more complex than the H.263 encoder. Furthermore, with the trend towards high definition resolution video, current high performance uniprocessor architecture may be not capable of performing real-time encoding using H.264 or SVC.
Various speed performance enhancing methods may be used for the H.264 encoder. In an example, complexity reduction algorithms may be used to minimize computations in the encoder. Examples are fast mode decisions and fast motion estimation techniques. These methods may compromise compressed video quality for speed. In another example, the encoder may be parallelized, as will be explained later.
All the performance optimization methods for H.264 may be applicable to SVC. However, the SVC may be at least a fold more computationally demanding than H.264 due to its additional coding modes specifically for inter-layer prediction, and the need to process multiple layers of video data. This added complexity to the SVC also may offer new task and data structures that are exploitable for encoder performance enhancements in terms of parallelization.
According to various embodiments, a method is provided to parallelize the hierarchical b-pictures (HB) encoding structure, as will be explained in more detail below, that may be used in the SVC for temporal scalability. It may be considered as a new coding structure that is introduced in SVC. By exploiting the HB structure, the methods according to various embodiments may maximize the number of concurrent encoding task in SVC, and thus may allow a parallelized SVC encoder that may scale well with additional processors.
Video layers 502 at different resolutions may be provided, for example a first layer 504 (which may also be referred to as Layer 1) and a second layer 506 (which may also be referred to as Layer 0).
Each layer may include one or more groups of pictures (GOP). For example, for the second layer 506 the group of pictures in the layer are shown in
Each GOP may include one or more frames. For example, for the first GOP 508, the frames in the GOP are shown in
Each frame may include one or more slices. For example, for the first frame 518, the slices in the frame are shown in
Each slice may include one or more macroblocks. For example, for the first slice 526, the macroblocks in the slice are shown in
In H.264 the same data structure as illustrated in
SVC may organize the compressed file into a base layer that is H264/AVC (Advanced Video Coding) encoded, and enhancement layers that may provide additional information to scale the base or preceding video layer in quality, spatial or temporal resolution. The original input video sequence may be down-scaled in order to obtain the pictures for all different spatial layers, which may result in spatial scalability. In each layer, a hierarchical B-pictures (HB) decomposition, which may also be referred to as a motion-compensated pyramid decomposition, may be performed for each group of pictures to provide temporal scalability.
It will be understood that in various embodiments, a picture may be considered as being equivalent to a frame.
The second picture 604 to the ninth picture 618 may for a group of pictures of length 8.
For example, the first picture 602 may be an I-picture. It is to be noted that, for example in case where the first picture 602 preceding the present GOP, the first picture 602 may be a P-picture when seen from the previous GOP. Likewise, the ninth picture 618 may be considered a P-picture (for the present GOP) or an I-picture (for the succeeding GOP). For example, the second picture 604 to the eighth picture 616 may be B-pictures. In various embodiments, each of the first picture 602 and the ninth picture 618 (the last picture in the GOP) may either be a I-picture or P-picture. In various embodiments, the first frame of the first GOP may have to be an I-picture. Subsequently, the choice of I-picture or P-picture for the ninth picture 618 may be up to the encoder runtime configuration of the encoder. In various embodiments, a P-picture may be used. In various embodiments, an I-picture may be used when the encoder desires to remove any dependency between the current and next GOP. Examples of such a situation may be during scene change (for example for efficiency purposes) and due to encoder's parameter such as IDR Refresh Interval.
Thus, the second picture 604 may depend on the first picture 602 as shown by arrow 620 and on the third picture 604 as shown by arrow 622. Furthermore, the third picture 606 may depend on the first picture 602 as shown by arrow 636 and on the fifth picture 610 as shown by arrow 638. Furthermore, the fourth picture 608 may depend on the third picture 606 as shown by arrow 624 and on the fifth picture 610 as shown by arrow 626. Furthermore, the fifth picture 610 may depend on the first picture 602 as shown by arrow 644 and on the ninth picture 618 as shown by arrow 646. Furthermore, the sixth picture 612 may depend on the fifth picture 610 as shown by arrow 628 and on the seventh picture 614 as shown by arrow 630. Furthermore, the seventh picture 614 may depend on the fifth picture 610 as shown by arrow 640 and on the ninth picture 618 as shown by arrow 642. Furthermore, the eighth picture 616 may depend on the seventh picture 614 as shown by arrow 632 and on the ninth picture 618 as shown by arrow 634. Furthermore, the ninth picture 618 may depend on the first picture 602 as shown by arrow 648.
The display order of the first picture 602 to the ninth picture 618 may be 0, 1, 2, 3, 4, 5, 6, 7, 8, i.e. the first picture 602 may be displayed as 0th picture, the second picture 604 may be displayed as the 1st picture, the third picture 606 may be displayed as the 2nd picture, the fourth picture 608 may be displayed as the 3rd picture, the fifth picture 610 may be displayed as the 4th picture, the sixth picture 612 may be displayed as the 5th picture, the seventh picture 614 may be displayed as the 6th picture, the eighth picture 616 may be displayed as the 7th picture, and the ninth picture 618 may be displayed as the 8th picture.
The coding order of the first picture 602 to the ninth picture 618 may be 0, 5, 3, 6, 2, 7, 4, 8, 1, i.e. the first picture 602 may be coded as 0th picture, the second picture 604 may be coded as 5th picture, the third picture 606 may be coded as 3rd picture, the fourth picture 608 may be coded as 6th picture, the fifth picture 610 may be coded as 2nd picture, the sixth picture 612 may be coded as 7th picture, the seventh picture 614 may be coded as 4th picture, the eighth picture 616 may be coded as 8th picture, and the ninth picture 618 may be coded as 1st picture.
In the embodiment shown in
In various embodiments, the number of possible temporal levels may be limited by the number of pictures in a GOP. For example, let N be number of pictures in GOP. Then the number of temporal level may range from 0 to T, where T=log 2(N) and T may be rounded down to the nearest integer.
SVC may use similar intra and inter-prediction techniques as in H.264/AVC for each picture frame. Additionally, in SVC, inter-layer prediction mechanisms may be used to reduce data redundancy between different layers. This may be achieved by reusing motion, residual and partitioning information of the lower spatial layers to predict enhancement layer pictures. When inter-layer prediction is used for a macroblock, only the residual signal (or prediction error) may be encoded. Such a macroblock is signaled by the syntax element base mode flag. The prediction mechanism used by a inter-layer predicted macroblock is determined by the mode of its corresponding 8×8 block in the reference layer.
An Inter-Layer Intra Prediction may be applied as follows: When the corresponding 8×8 block in the reference layer is intra-coded, the reconstructed data may be upsampled (for example by using a 4-tap FIR (finite impulse response) for luma samples and bilinear filter for chroma samples) and used as an intra prediction for the macroblock in the current layer.
An Inter-Layer Motion Prediction may be applied as follows: If the co-located 8×8 block in the reference layer is inter-coded, its motion and partitioning information may be scaled and used for the enhancement layer macroblock. Optionally, quarter-pel (quarter-pixel) motion vectors may be computed and coded for refinement since the motion vector of the previous layer has only up to half-pel precision in the current layer due to spatial resolution difference.
An Inter-Layer Residual Prediction may be applied as follows: Signaled by the syntax residual prediction flag, inter-coded macroblocks in the enhancement layer may utilize the bilinear-upsampled residual information of the co-located 8×8 block (intra or inter-coded) from the reference layer as prediction, so that only the difference signal may be coded in the enhancement layer.
In SVC four additional coding modes may be used specifically for macroblocks in the enhancement layers. Analysis performed on the SVC shows that, for encoding two spatial layers at CIF (Common Intermediate Format, for example corresponding to a resolution of 352 pixels×288 pixels) and QCIF (Quarter CIF, for example corresponding to a resolution of 176 pixels×144 pixels) resolutions, the intra and inter-frame coding may be respectively 13.04-28.00% (avg. 20.43%) and 58.06-259.95% (avg. 133.32%) more complex than without the inter layer prediction. In view of the complexity assessment in and the trend towards high definition (HD) video coding, where HD resolution is 9 to 20 times that of CIF resolution, it may be impossible for SVC encoding to achieve real-time performance on uni-processor architecture. However, with multiple-processors or multi-core technologies becoming increasingly common in consumer desktop computers, parallelizing the SVC encoder may be a feasible way to enhance its performance.
SVC may be considered as an extension of H.264. Thus parallelism techniques for H.264 encoder may also be applicable to it. The H.264 encoder may be parallelized either by task-level or data-level decomposition. For task-level decomposition, the H.264 encoding process may be partitioned into independent tasks such as the motion estimation and intra-prediction.
A speedup may be achieved with wavefront encoding. The encoding of a frame may start only after the encoding of the previous frame is finished.
Each task may represent a processing stage of the data and may be assigned to different thread/processor. There may be several difficulties of using task-level decomposition in parallelizing the video encoder. Scalability may be a problem in the task-level decomposition approach. Scalability may denote the quality of an application to expand efficiently to accommodate greater computing resources so as to improve application performance. One of the inhibiting factor to scalability may be that task-level decomposition may desire significant synchronization and communication between tasks for moving data from one processing stage to the other. Another factor may be load imbalance. It may be important to distribute work evenly across threads/processors to avoid blocking tasks. However, this may be very difficult because the execution time for each task depends on the nature of data being processed and is not known a priori. Thus for these difficulties, task-level decomposition may be not a popular approach to parallelize the H.264 encoder. A parallelization model for SVC encoder that may combine both the task-level and data-level decomposition will be discussed further below.
In a further approach using data-level decomposition, video data may be partitioned and assigned each to a different processor running the same task. The data structure used in the H.264 video coding may offer different levels of possible data partitioning, i.e. GOP, frame, slice and macroblock, each representing a successively finer data granularity.
On a GOP-level, video sequences may be processed and encoded as a series of GOPs in order to minimize memory load and exploiting inter-frame redundancy for compression. When the key frames of each GOP is an IDR (Instantaneous Data Refresh) picture, the GOPs may be independent and may be processed concurrently. This may be the coarsest grained parallelism for H.264 encoder. This approach may result in high-latency and memory consumption.
On a frame-level, the number of frames that can be coded concurrently may be determined by the prediction structure within each GOP. In H.264, frames, may be encoded using an IBBPBBP . . . structure where the B-pictures may be non-referenced and thus may be processed concurrently after their corresponding referenced P-pictures.
On a slice-level, in H.264 encoding, each picture frame may be partitioned into one or more slices in order to prevent error propagation across the frame in the presence of network transmission errors. Slices within each frame may be independent, i.e., no content of a slice may be used to predict elements of other slices in the same frame. Thus, all slices in the frame may be processed in parallel. The exploitable spatial data dependency within a picture may be limited as the number of slice increases and this may have an adverse effect on rate-distortion performance.
On a macroblock-level, each slice may be processed in blocks of 16×16 pixels called macroblocks. The macroblocks in a slice may be processed in scan order, i.e., top to bottom and left to right. Each macroblock may use information from the previously encoded macroblock for motion vector prediction, intra prediction, and deblocking.
For example with the dependencies shown in
In the macroblock wavefront example shown in
By processing the macroblocks in a wavefront order depicted in
A single approach to parallelism may not offer good scalability on many-core architectures. Generally the solution may be to combine several levels of parallelism. Additionally, pixel-level parallelism may also be applied since most modern processors support SIMD (Single Instruction, Multiple Data) instruction set, which may allow multiple data elements to be processed concurrently with a single instruction. Pixel-level parallelism may be largely an implementation problem and it may be dependent on the CPU architecture platform, i.e., the same pixel-level parallelism technique may not be applied equally across different CPUs such as x86, PowerPC and the Cell processors
Throughout the sub-figures of
In
In
In
In
In
In
As already discussed above, parallelization methods for H.264 may be applicable to SVC. According to various embodiments, methods will be provided to exploit SVC's new coding structures in order to improve its scalability (processing) on multi-core computing platform.
Temporal scalability in SVC may be provided by the hierarchical B-pictures (HB) prediction structure as illustrated in
In 1202, an intra prediction frame task (which may also be referred to as Frame_Task(Intra)) may be started. In 1206, a forward motion estimation task (which may also be referred to as Frame_Task(L0_ME)) may be started. In 1220, a backward motion estimation task (which may also be referred to as Frame_Task(L1_ME)) may be started.
In 1214, processing may end. Likewise, in 1226 processing may end.
In 1204, an intra prediction task (which may be also referred to as MB_Task0(0,0,Intra_Pred)) may be spawned (in other words: started).
In 1212 children tasks (which may also be referred to as children Frame_task), for example, tasks, for which input data is now available, may be spawned. In 1208, a forward motion estimation task (which may also be referred to as MB_Task0(0,0,L0_ME)) may be spawned. In 1222, a backward motion estimation task (which may also be referred to as MB_Task0(0,0,L1_ME)) may be spawned. In 1218, a bi-directional motion estimation task (which may also be referred to as MB_Task0(0,0,Bi_Pred)) may be spawned.
In 1210, it may be checked whether the frame is a P frame. In case it is a P-frame (yes), processing may continue in 1204, where an intra prediction task (which may be also referred to as MB_Task0(0,0,Intra_Pred)) may be spawned. In case it is not a P-frame (no), processing may continue with the check 1216.
In 1216, it may be checked, whether backward motion estimation has been done (e.g. whether L1_ME has been done). In case backward motion estimation has not been done (no), processing ends in 1214. In case backward motion estimation has been done (yes), processing continues in 1218, where a bi-directional motion estimation task (which may also be referred to as MB_Task0(0,0,Bi_Pred)) may be spawned.
In 1214, it may be checked, whether forward motion estimation has been done (e.g. whether L0_ME has been done). In case forward motion estimation has not been done (no), processing ends in 1226. In case forward motion estimation has been done (yes), processing continues in 1218; where a bi-directional motion estimation task (which may also be referred to as MB_Task0(0,0,Bi_Pred)) may be spawned.
It is to be noted, that intra prediction task spawning task 1204, forward motion estimation task spawning task 1208, bi-directional motion estimation task spawning task 1218 and backward motion estimation task spawning task 1222 may be considered as macroblock wavefront tasks.
Throughout the coding example using the task model in accordance with an embodiment shown in
In various embodiments, each frame may have a set of data which may record results from different stages of processing. This may be for the purpose of picking the best result when all necessary processing stage is completed. Two of the key information recorded may be ‘distortion’ (which may measure the quality of the processing) and ‘rate’ (which may measure the cost of resource required to achieve the corresponding level of ‘distortion’). Other parameters may depend of the processing stage, which essentially may include parameters derived from the processing stage that may lead to the corresponding ‘distortion’ and ‘rate’. For example L0_ME may have ‘motion vector’ and ‘partitioning mode’ information; Intra_Pred stage may have ‘intra-prediction mode’ information.
In
In
In
In
In
In
For the graph 1400 is may be assumed that there is no limitation on the number of processors and each macroblock only requires a unit time to complete processing. Therefore, for each unit time step, the completion of all concurrent macroblocks that are being processed may be assumed.
In the third curve 1406, the model using only macroblock wavefront may show a periodic saw-tooth pattern. This may be observed and inferred in accordance with
In the second curve 1408, because of the data-partitioning model may allow concurrent processing of multiple independent frames, the distribution of total number of concurrent macroblocks over time may be like a cumulative-saw-tooth pattern since the number of concurrent frame increases in time (within each GOP). It is to be noted that this is also why processing as represented by the second curve 1408 may take only half the time to process the GOP as compared to processing represented by the third curve 1406.
Similar for the processing represented by the first curve 1410, methods according to various embodiments may be able to spawn more concurrent macroblock in each unit time for the processors to consume (hence explained by the higher peaks and steep slope of the saw-tooth graph). Therefore, processing represented by the first curve 1410 may take even less time to complete processing the GOP than processing represented by the second curve 1408.
To demonstrate the effect of generating more tasks for multi-core scalable video encoding, a simulation was set up as follows:
No. of cycles required to encode a GOP/Total number of tasks. (1)
For example, a signal indicated by the first arrow 2002 may indicate to the first frame 6021 in Layer 1 that the corresponding first frame 6020 in Layer 0 is completed and that first frame 6021 in Layer 1 may begin processing. For example, a signal indicated by a first arrow 2002 may indicate to the first frame 6021 in Layer 1 that the corresponding first frame 6020 in Layer 0 is completed and that the first frame 6021 in Layer 1 may begin processing. For example, a signal indicated by a second arrow 2004 may indicate to the second frame 6041 in Layer 1 that the corresponding second frame 6040 in Layer 0 is completed and that the second frame 6041 in Layer 1 may begin processing. For example, a signal indicated by a third arrow 2006 may indicate to the third frame 6061 in Layer 1 that the corresponding third frame 6060 in Layer 0 is completed and that the third frame 6061 in Layer 1 may begin processing. For example, a signal indicated by a fourth arrow 2008 may indicate to the fourth frame 6081 in Layer 1 that the corresponding fourth frame 6080 in Layer 0 is completed and that the fourth frame 6081 in Layer 1 may begin processing. For example, a signal indicated by a fifth arrow 2010 may indicate to the fifth frame 6101 in Layer 1 that the corresponding fifth frame 6100 in Layer 0 is completed and that the fifth frame 6101 in Layer 1 may begin processing. For example, a signal indicated by a sixth arrow 2012 may indicate to the sixth frame 6121 in Layer 1 that the corresponding sixth frame 6120 in Layer 0 is completed and that the sixth frame 6121 in Layer 1 may begin processing. For example, a signal indicated by a seventh arrow 2014 may indicate to the seventh frame 6141 in Layer 1 that the corresponding seventh frame 6140 in Layer 0 is completed and that the seventh frame 6141 in Layer 1 may begin processing. For example, a signal indicated by an eighth arrow 2016 may indicate to the eighth frame 6161 in Layer 1 that the corresponding eighth frame 6160 in Layer 0 is completed and that eighth frame 6161 in Layer 1 may begin processing. For example, a signal indicated by a ninth arrow 2018 may indicate to the ninth frame 6181 in Layer 1 that the corresponding ninth frame 6180 in Layer 0 is completed and that the ninth frame 6181 in Layer 1 may begin processing. In case inter-layer prediction is not active in the video encoder, such signals may not be necessary and processing as illustrated in illustration 2000 may not be necessary.
According to various embodiments, a method may be provided to parallel-process video data (for example a method for parallel video encoding; for example a method for parallel video decoding). Using data and task partitioning approach, for example a data and functional partitioning approach, for example a data and task parallelization model for hierarchical prediction structure in video coding, the method may maximize the number of concurrent tasks for encoding a group of video frames to obtain good performance scalability of video encoder with additional processors or processor-cores. The method may be applied to the hierarchical coding structure that is used in Scalable Video Coding (SVC) standard.
According to various embodiments, parallelism in SVC's hierarchical B-pictures (HB) coding structure may be utilized.
According to various embodiments, the number of concurrent jobs may be maximized from the HB structure: by using data portioning (on a frame, macroblock level) and by functional partitioning (using intra-prediction, motion estimation, etc.).
According to various embodiments, a good speedup vs. number of CPUs performance may be achieved.
While the invention has been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.
Number | Date | Country | Kind |
---|---|---|---|
61/144851 | Jan 2009 | US | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SG10/00010 | 1/15/2010 | WO | 00 | 9/29/2011 |