The present invention relates to methods for performing video stitching in continuous-presence multipoint video conferences. In multipoint video conferences a plurality of remote conference participants communicate with one another via audio and video data which are transmitted between the participants. The location of each participant is commonly referred to as a video conference end-point. A video image of the participant at each respective end-point is recorded by a video camera and the participant's speech is likewise recorded by a microphone. The video and audio data recorded at each end-point are transmitted to the other end-points participating in the video conference. Thus, the video images of remote conference participants may be displayed on a local video monitor to be viewed by a conference participant at a local video conference end-point. The audio recorded at each of the remote end-points may likewise be reproduced by speakers located at the local end-point. Thus, the participant at the local end-point may see and hear each of the other video conference participants, as may all of the participants. Similarly, each of the participants at the remote end-points may see and hear all of the other participants, including the participant at the arbitrarily designated local end-point.
In a point-to-point video conference the video image of each participant is displayed on the video monitor of the opposite end-point. This is a straight forward proposition since there are only two end-points and the video monitor at each end-point need only display the single image of the other participant. In multipoint video conferences, however, the several video images of the multiple conference participants must somehow be displayed on a single video monitor so that a participant at one location can see and hear the participants at all of the other multiple locations. There are two operating modes that are commonly used to display the multiple participants participating in a multipoint video conference. The first is known as Voice Activation (VA) mode, wherein the image of the participant who is presently speaking (or the participant who is speaking loudest) is displayed on the video monitors of the other end-points. The second is Continuous Presence (CP) mode.
In CP mode multiple images of the multiple remote participants are combined into a single video image and displayed on the video monitor of the local end-point. If there are 5 or fewer participants in the video conference, the 4 (or fewer) remote participants may be displayed simultaneously on a single monitor in a 2×2 array, as shown in
Each end-point includes a number of similar components. The components that make up end points 22, 24, 26, and 28 are substantially the same as those of end-point 20 which are now described. End-point 20 includes a video camera 30 for recording a video image of the corresponding participant and a microphone 32 for recording his or her voice. Similarly, end-point 20 includes a video monitor 34 for displaying the images of the other participants and a speaker 36 for reproducing their voices. Finally, end-point 20 includes a video conference appliance 38, which controls 30, 32, 34 and 36, and moreover, is responsible for transmitting the audio and video signals recorded by the video camera 30 and microphone 32 to a multipoint control unit 40 (MCU) and for receiving the combined audio and video data from the remote end-points via the MCU.
There are two ways of deploying a multipoint control unit (MCU) in a multipoint video conference: In a centralized architecture 39 shown in
To ensure compatibility of video conferencing equipment produced by diverse manufacturers, audio and video coding standards have been developed. So long as the coded syntax of bitstream output from a video conferencing device complies with a particular standard, other components participating in the video conference will be capable of decoding it regardless of the manufacturer.
At present, there are three video coding standards relevant to the present invention. These are ITU-T H.261, ITU-T H.263 and ITU-T H.264. Each of these standards describes a coded bitstream syntax and an exact process for decoding it. Each of these standards generally employs a block based video coding approach. The basic algorithms combine inter-frame prediction to exploit temporal statistical dependencies and intra-frame prediction to exploit spatial statistical dependencies. Intra-frame or I-coding is based solely on information within the individual frame being encoded. Inter-frame or P-coding relies on information from other frames within the video sequence, usually frames temporally preceding the frame being encoded.
Typically a video sequence will comprise a plurality of I and P coded frames, as shown in
According to each of these standards a video encoder receives input video data as video frames and produces an output bitstream which is compliant with the particular standard. A decoder receives the encoded bitstream and reverses the encoding process to re-generate each video frame in the video sequence. Each video frame includes three different sets of pixels Y, Cb and Cr. The standards deal with YCbCr data in a 4:2:0 format. In other words, the resolution of the Cb and Cr components is ¼ that of the Y component. The resolution of the Y component in video conferencing images is typically defined by one of the following picture formats:
According to the H.261 video coding standard, a frame in a video sequence is segmented into pixel blocks, macroblocks and groups of blocks, as shown in
The syntax of an H.261 bitstream is shown in
At the GOB layer 76, each GOB data block comprises header information 92 and a plurality of macroblock data blocks 94, 96, and 98. Since each GOB comprises 3 rows of 11 macroblocks each, the GOB layer 76 will include a total of upto 33 macroblock data blocks. This number remains the same regardless of whether the video frame is a CIF or QCIF picture. At the macroblock layer 78, each macroblock data block comprises macroblock header information 100 followed by six pixel block data blocks, 102, 104, 106, 108, 110 and 112, one for the Y component of each of the four Y pixel blocks that form the macroblock, one for the Cb component and one for the Cr component. At the block layer 88, each block data includes transform coefficient data 113 followed by End of the Block marker 114. The transform coefficients are obtained by applying an 8×8 DCT transform on the 8×8 pixel data for intra macroblocks (i.e. macroblocks where no motion compensation is required for decoding) and on the 8×8 residual data for inter macroblocks (i.e. macroblocks where motion compensation is required for decoding). The residual is the difference between the raw pixel data and the predicted data from motion estimation.
H.263 Video Coding
H.263 is similar to H.261 in that it retains a similar block and macroblock structure as well as the same basic coding algorithm. However, the initial version of H.263 included four optional negotiable modes (annexes) which provide better coding efficiency. The four annexes to the original version of the standard were unrestricted motion vector mode; syntax-based arithmetic coding mode; advanced prediction mode; and a PB-frames mode. What is more, version two of the standard included additional optional modes including: continuous presence multipoint mode; forward error correction mode; advanced intro coding mode; deblocking filter mode; slice structured mode; supplemental enhancement information mode; improved PB-frames mode; reference picture mode; reduced resolution update mode; independent segment decoding mode; alternative inter VLC mode; and modified quantization mode. A third most recent version includes an enhanced reference picture selection mode, a data partitioned slice mode; and an additional supplemental enhancement information mode. H.263 supports SQCIF, QCIF, CIF, 4CIF, 16 CIF, and custom picture formats.
Some of the optional modes commonly used in the video conferencing context include: Unrestricted motion vector mode (Annex D), advanced prediction mode (Annex F), advanced intra-coding mode (Annex I), deblocking filter mode (Annex J) and modified quantization mode (Annex T). In the unrestricted motion vector mode, motion vectors are allowed to point outside the picture. This allows for good prediction if there is motion along the boundaries of the picture. Also, longer motion vectors can be used. This is useful for larger picture formats such as 4CIF and 16CWF and for smaller picture formats when there is motion along the picture boundaries. In the advanced prediction mode (Annex F) four motion vectors are allowed per macroblock. This significantly improves the quality of motion prediction. Also, overlapped block motion compensation can be used which reduces blocking artifacts. Next, in the advanced intra coding mode (Annex I) compression for intra macroblocks is improved. Prediction from neighboring intra macroblocks, modified inverse quantization of intra blocks, and from a separate VLC table is used for intra coefficients. In the deblocking filter mode (Annex J), an in-loop filter is applied to the boundaries of the 8×8 blocks. This reduces blocking artifacts leading to poor picture quality and inaccurate prediction. Four motion vectors are allowed per macroblock. This significantly improves the quality of motion prediction. Motion vectors are allowed to point outside the picture. This allows for good prediction if there is motion along the boundaries of the picture. Finally in the modified quantization mode (Annex T), arbitrary quantizer selection is allowed at the macroblock level which allows for a more precise rate control.
The syntax of an H.263 bitstream is illustrated in
A significant difference between H.261 and H.263 video coding is the GOB structure. In H.261 coding, each GOB is 3 successive rows of 11 consecutive macroblocks, regardless of the image type (QCIF, CIF, 4CIF, etc.). In H.263, however, a QCIF GOB is a single row of 11 macroblocks, whereas a CIF GOB is a single row of 22 macroblocks. Other resolutions have yet different GOB definitions. This leads to complications when stitching H.263 encoded pictures in the compressed domain as will be described in more detail with regard to existing video stitching methods.
H.264 Coding
H.264 is the most recently developed video coding standard. Unlike H.261 and H.263 coding, H.264 has a more flexible block and macroblock structure, and introduces the concept of slices and slice groups. According to H.264, a pixel block may be defined as one of a 4×4, 8×8, 16×8, 8×16 or 16×16 array of pixels. Like in H.261 and H.263, a macroblock comprises a 16×16 array of Y pixels and corresponding 8×8 arrays of Cb and Cr pixels. In addition, a macroblock partition is defined as a block of luma samples and two corresponding blocks of chroma samples resulting from a partitioning of a macroblock; a macroblock partition is used as a basic unit for inter prediction. A slice group is defined as a subset of macroblocks that is a partitioning of the frame, and a slice is defined as an integer number of consecutive macroblocks in raster scan order within a slice group.
Macroblocks are distinguished based on how they are coded. In the Baseline profile of H.264, macroblocks which are coded using motion prediction based on information from other frames are referred to as inter- or P-macroblocks (In the Main and Extended profiles, there is also a B-macroblock; only Baseline profile is of interest in the context of video conference applications). Macroblocks which are coded using only information from within the same slice are referred to as intra- or I-macroblocks. An I-slice contains only I macroblocks, which are coded using only information from within the same frame are referred to as intra- or I-macroblocks. An I-slice contains only I-macroblocks, while a P-slice may contain both I and P macroblocks. An H.264 video sequence 154 is shown in
A network abstraction layer unit stream 168 for a video sequence encoded according to H.264 is shown in
Approaches to Video Stitching
Referring back to
Conceptually, the pixel domain is straightforward and may be implemented irrespective of the coding standard used. The pixel domain approach is illustrated in
Although easy to understand, a pixel domain approach is computationally complex and memory intensive. Encoding video data is a much more complex process than decoding video data, regardless of the video standard employed. Thus, the step of re-encoding the combined video image after spatially composing the CIF image in the pixel domain greatly increases the processing requirements and cost of the MCU 40. Therefore, pixel domain video stitching is not a practical solution for low-cost video conferencing systems. Nonetheless, useful concepts can be derived from an understanding of pixel domain video stitching. Since the ideal stitched picture represents the best quality image possible after decoding the four individual QCIF data streams, it can be used as an objective benchmark for determining the efficacy of different methods for performing video stitching. Any subsequent coding of the ideal stitched picture will result in some degree of data loss and a corresponding degradation of image quality. The amount of data loss between the ideal stitched picture and a subsequently encoded and decoded image serves as a convenient point of comparison between various stitching methods.
Because of the processing delays and added complexities of re-encoding the ideal stitched video sequence inherent to the pixel domain approach, a more resource efficient approach to video stitching is desirable. Hence, a compressed domain approach is desirable. Using this approach, video stitching is performed by directly manipulating the incoming QCIF bitstreams while employing a minimal amount of decoding and re-encoding. For reasons that will be explained below, pure compressed domain video stitching is possible only with H.261 video coding.
As has been described above with regard to the bitstream syntax of the various coding standards, a coded video bitstream contains two types of data: (i) headers—which carry key global information such as coding parameters and indexes; and (ii) the actual coded image data themselves. The decoding and re-encoding present in the compressed domain approach involves decoding and modifying changes some of the key headers in the video bitstream but not decoding the coded image data themselves. Thus, the computational and memory requirements of the compressed domain approach are a fraction of those of the pixel domain approach.
The compressed domain approach is illustrated in
To accomplish the mapping of the QCEF GOBs from pictures A, B, C, and D into the stitched CIF image 244, the header information in the QCIF images 236, 238, 240, 242 must be altered as follows. First, since the four individual QCIF images are to be combined into a single image, the picture header information 84 (see
It should be noted that in using the compressed domain approach only the GOB header and picture header information need to be re-encoded. This provides a significant reduction in the amount of processing necessary to perform the stitching operation as compared to stitching in the pixel domain. Unfortunately, true compressed domain video stitching is only possible for H.261 video coding.
With H.263 stitching the GOB sizes are different between QCIF images and CIF images. As can be seen in
Similar complications arise when performing compressed domain stitching on H.264 coded images. In H.264 video sequences the presence of new image data in adjacent quadrants changes the intra or inter predictor of a given block/macroblock in several ways with respect to the ideal stitched video sequence. For example, since H.264 allows motion vectors to point outside a picture's boundaries, a QCIF motion vector may point into another QCIF picture in the stitched image. Again, this can cause unacceptable noise at or near the image boundaries that can propagate through the frame. Additional complications may also arise which make compressed domain video stitching impractical for H.264 video coding.
Additonal problems arise when implementing video stitching on real world applications. The MCU (or MCUs) controlling a video conference negotiate with the various endpoints involved in the conference in order to establish various parameters that will govern the conference. For example, such mode negotiations will determine the audio and video codecs that will be used during the conference. The MCU(s) also determine the nominal frame rates that will be employed to send video sequences from the end points to video stitcher in the MCU(s). Nonetheless, the actual frame rates of the various video sequences received from the endpoints may vary significantly from the nominal frame rate. Furthermore, the packetization process of the transmission network over which the video streams are transmitted may cause video frames to arrive at the video stitcher in erratic bursts. This can cause significant problems for the video sticher which, under ideal conditions would assemble stitched video frames in one-to-one synchrony with the frames comprising the individual video sequence received from the endpoints.
Another real world problem for performing video stitching in continous presence multipoint video conferences is the problem of compensating for data that may have been lost during transmission. The severity of data loss may range from lost individual pixel blocksthrough the loss of entire video frames. The video stitcher must be capable of detecting such data loss and compensating for the lost data in a manner that has as negligible an impact on the quality of the stitched video sequence as possible.
Finally, some of the annexes to ITU-T H.263 afford the opportunity to perform video stitching in a manner that is almost entirely within the compressed. Also, video data that is transmitted over IP networks afford other possibilities for performing video stitching in a simple and less expensive way.
Improved methods for performing video stitching are needed. Ideally such methods should be capable of being employed regardless of the video codec being used. Such methods are desired to have low processing requirements. Further, improved methods of video stitching should be capable of drift free stitching so that encoder-decoder mismatch errors are not propagated throughout the image and from one frame to another within the video sequence. Improved video stitching methods must also be capable of compensating for and concealing lost data, including lost pixel blocks, lost macroblocks and even entire lost video frames, finally, improved video stitching methods must be sufficiently robust to handle input video streams having diverse and variable frame rates, and be capable of dealing with video streams that enter and drop out of video conferences at different times.
The present invention relates to a drift-free hybrid approach to video stitching. The hybrid approach represents a compromise between the excessive processing requirements of a purely pixel domain approach and the difficulties of adapting the compressed domain approach to H.263 and H.264 encoded bitstreams.
According to the drift-free hybrid approach, incoming video bitstreams are decoded to produce pixel domain video images. The decoded images are spatially composed in the pixel domain to form an ideal stitched video sequence including the images from multiple incoming video bitstreams. Rather than re-encoding the stitched pixel domain ideal stitched image as done in pixel domain stitching, the prediction information from the individual incoming bitstreams is retained. Such prediction information is encoded into the incoming bitstreams when the individual video images are first encoded prior to being received by the video stitcher. While decoding the incoming video bitstreams, this prediction information is regenerated. The video stitcher then creates a stitched predictor for the various pixel blocks in a next frame of a stitched video sequence depending on whether the corresponding macroblocks were intra-coded or inter-coded. For an intra-coded macroblock, the stitched predictor is calculated by applying the retained intra prediction information on the blocks in its causal neighborhood (The causal neighborhood is already decoded before the current block). For an inter-coded macroblock, the stitched predictor is calculated from a previously constructed reference frame of the stitched video sequence. The retained prediction information from the individual decoded video bitstreams is applied to the various pixel blocks in the reference frame to generate the expected blocks in the next frame of the stitched video sequence.
The stitched predictor may differ from a corresponding pixel block in the corresponding frame of the ideal stitched video sequence. These differences can arise due to possible differences between the reference frame of the stitched video sequence and the corresponding frames of the individual video bitstreams that were decoded and spatially composed to create the ideal stitched video sequence. Therefore, a stitched raw residual block is formed by subtracting the stitched predictor for a corresponding pixel block in the corresponding frame of the ideal stitched video sequence. The stitched raw residual block is forward transformed, quantized and entropy encoded before being added to the coded stitched video bitstream.
The drift-free hybrid stitcher then acts essentially as a decoder, inverse transforming and dequantizing the forward transformed and quantized stitched raw residual block to form a stitched decoded residual block. The stitched decoded residual block is added to the stitched predictor to create the stitched reconstructed block. Because the drift-free hybrid stitcher performs substantially the same steps on the forward transformed and quantized stitched raw residual block as are performed by a decoder, the stitcher and decoder remain synchronized and drift errors are prevented from propagating.
The drift-free hybrid approach includes a number of additional steps over a pure compressed domain approach, but they are limited to decoding the incoming bitstreams; forming the stitched predictor; forming the stitched raw residual, forward and inverse transform and quantization, and entropy encoding. Nonetheless these additional steps are far less complex than the process of completely re-encoding the ideal stitched video sequence. The main computational bottlenecks such as motion estimation, intra prediction estimation, prediction mode estimation, and rate control are all avoided by re-using the parameters that were estimated by the encoders that produced the original incoming video bitstreams.
Detailed steps for implementing drift-free stitching is provided for H.263 and H.264 bitstreams. In error-prone environments, it is pointed out that the responsibility of error concealment lies at the decoder part of the overall stitcher, and hence error-concealment procedures are provided as part of a complete stitching solution for H.263 and H.264. In addition, alternative (not-necessarily drift-free) stitching solutions are provided for H.263 bitstreams. Additional features and advantages of the present invention are described in, and will be apparent from, the following Detailed Description of the Invention and the figures.
The present invention relates to a improved methods for performing video stitching in multipoint video conferencing systems. The method includes a hybrid approach to video stitching that combines the benefits of pixel domain stitching with those of the compressed domain approach. The result is an effective inexpensive method for providing video stitching in multi-point video conferences. Additional methods include a lossless method for H.263 video stitching using annex K; a nearly compressed domain approach for H.263 video stitching without any of its optional annexes; and an alternative practical approach to the H.263 stitching using payload header information in RTP packets over IP networks.
I. Hybrid Approach to Video Stitching
The drift-free hybrid approach provides a compromise between the excessive amounts of processing required to re-encode an ideal stitched video sequence assembled in the pixel domain, and the synchronization drift errors that may accumulate in the decoded stitched video sequence when using coding methods that incorporate motion vectors and other predictive techniques when performing video stitching in the compressed domain. Specific implementations of the present invention will vary according to the coding standard employed. However, the general drift-free hybrid approach may be applied to video conferencing systems employing any of the H.261, H.263 or H.264 and other video coders.
The general drift-free hybrid approach to video stitching will be described with reference to
The method for creating the stitched video sequence is summarized in the flow chart shown in
This process is shown in more detail in the block diagram of
In a typical video conference arrangement the stitched video bitstream 336 is transmitted from an MCU to one or more video conference appliances at various video conference end-points. The video conference appliance at the end-point decodes the stitched bitstream and displays the stitched video sequence on the video monitor associated with the end-point. According to the present invention, in addition to transmitting the stitched video bitstream to the various end-point appliances, the MCU retains the output data from the forward transform and quantization block 332. The MCU then performs substantially the same steps as those performed by the decoders in the various video conference end-point appliances to decode the stitched raw residual block and generate the stitched predicted block 324 for frame (n+1) 318 of the stitched video sequence. The MCU constructs and retains the next frame in the stitched video sequence so that it may be used as a reference frame for predicting blocks in one or more succeeding frames in the stitched video sequence. In order to construct the next frame 318 of the stitched video sequence, the MCU de-quantizes and inverse transforms the forward transformed and quantized stitched raw residual block in block 338. The output of the de-quantizer and inverse transform block 338 generates the stitched decoded residual block 340. The stitched decoded residual block 340 generated by the MCU will be substantially identical to that produced by the decoder at the end-point appliance. The MCU and the decoder having the stitched predicted block 324, construct the stitched reconstructed block 344 by adding the stitched decoded residual block 340 to the stitched predicted block at summing junction 342. Recall that the stitched raw residual block 330 was formed by subtracting the stitched predicted block 324 from the ideal stitched block 320. Thus, adding the stitched decoded residual block 340 to the stitched predicted block 324 produces a stitched reconstructed block 344 that is very nearly the same as the ideal stitched block 320. The only differences between the stitched reconstructed block 344 and the ideal stitched block 320 result from the data loss in quantizing and dequantizing the data comprising the stitched raw residual block 330. The same process takes place at the decoders.
It should be noted that in generating the stitched predicted block 324, the MCU and the decoder are operating on identical data that are available to both. The stitched sequence reference frame 314 is generated in the same manner at both the MCU and the decoder. Furthermore, the forward transformed and quantized residual block is inverse transformed and de-quantized to produce the stitched decoded residual block 340 in the same manner at the MCU and the decoder. Thus, the stitched decoded residual block 340 generated at the MCU is also identical to that produced by the end-point decoder. Accordingly, the stitched reconstructed block 344 of frame (n+1) of the stitched video sequence 310 resulting from the addition of the stitched predicted block 324 and the stitched decoded residual block 340 will be identical at both the MCU and the end-point appliance decoder. Differences will exist between the ideal stitched block 320 and the stitched reconstructed block 344 due to the loss of data in the quantization process. However, these differences will not accumulate from frame to frame because the MCU and the decoder remain synchronized, operating on the same data sets from frame to frame.
Compared to a pure compressed domain approach, the drift-free hybrid approach of the present invention requires the additional steps of decoding the incoming QCIF bitstreams; generating the stitched prediction block; generating the stitched raw residual block; forward transforming and quantizing the stitched raw residual block; entropy encoding the result of forward transforming and quantized stitched raw residual block; and inverse transforming and de-quantizing this result. However, these additional steps are far less complex than performing a full fledged re-encoding process as required in the pixel domain approach. The main computational bottlenecks of the full re-encoding process such as motion estimation, intra prediction estimation, prediction mode estimation and rate control are completely avoided. Rather, the stitcher re-uses the parameters that were estimated by the encoders that produced the QCIF bitstreams in the first place. Thus, the drift-free approach of the present invention presents an effective compromise between the pixel domain and compressed domain approaches.
From the description of the drift-free hybrid stitching approach, it should be apparent that the approach is not restricted to a single video coding standard for all the incoming bitstreams and the outgoing stitched bitstream. Indeed, the drift-free stitching approach will be applicable even when the incoming bitstreams conform to different video coding standards (such as two H.263 bitstreams, one H.261 bitstream and one H.264 bitstream); moreover, irrespective of the video coding standards used in the incoming bitsreams, the outgoing stitched bitstream can be designed to conform to any desired video coding standard. For instance, the incoming bitstreams can all conform to H.263, while the outgoing stitched bitstream can conform to H.264. The decoding portion of the drift-free hybrid stitching approach will decode the incoming bitstreams using decoders conforming to the respective video coding standards; the prediction parameters decoded from these bitstreams are then appropriately translated for the outgoing stitched video coding standard (e.g. if an incoming bitstream is coded using H.264 and the outgoing stitched bitstream is H.261, then multiple motion vectors for different partitions of a given macroblock in the incoming side have to be suitably translated to a single motion vector for the stitched bitstream); finally, the steps for forming the stitched predicted blocks and stitched decoded residual, and generating the stitched bitstream proceed according to the specifications of the outgoing video coding standard.
II. H.264 Drift-Free Hybrid Approach
An embodiment of the drift-free hybrid approach to video stitching may be specially adapted for H.264 encoded video images. The basic outline of the drift-hybrid stitching approach applied to H.264 video images is substantially the same as that described above. The incoming QCIF bitstreams are assumed to conform to the Baseline profile of H.264, and the outgoing CIF bitstream will also conform to the Baseline profile of H.264 (since the Baseline profile is of interest in the context of video conferencing). The proposed stitching algorithm produces only one video sequence. Hence, only one sequence parameter set is necessary. Moreover, the proposed stitching algorithm uses only one picture parameter set that will be applicable for every frame of the stitcher output (e.g. every frame will have the same slice group structure, the same chroma quantization parameter index offset, etc.) The sequence parameter set and picture parameter set will form the first two NAL units in the stitched bitstream. Subsequently, the only kind of NAL units in the bitstream will be Slice Layer without Partitioning NAL units. Each stitched picture will be coded using four slices, with each slice corresponding to a stitched quadrant. The very first outgoing access unit in the stitched bitsteam is an IDR access unit and by definition consists of four I-slices (since it conforms to the Baseline profile), and except in the very first access units of the stitched bitstream, all other access units will contain only P-slices. Each stitched picture in the stitched video sequence is sequentially numbered using the variable frame_index, starting with 0. That is, frame_index=0 denotes the very first picture (IDR) picture, while frame_index=1 denotes the first non-IDR access unit and so on.
A. H.264 Stitching Process in a Simple Stitching Scenario
The following outlines the detailed steps for the drift-free H.264 stitcher to produce each NAL unit. A simple stitching scenario is assumed where four input streams have exactly the same frame rate and arrive perfectly synchronized in time with respect to each other without encountering any losses during transmission. Moreover, the four input streams start and stop simultaneously; this implies that the IDR picture for each of the four streams arrive at the stitcher at the same instant, and the stitcher stitches these four IDR pictures to produce the outgoing IDR picture. At the next step, the stitcher is invoked with the next four access units from the four input streams, and so on. In addition, the simple stitching scenario also assumes that the incoming QCIF bitstreams always have the syntax elements ref_pic_list_reordering_flag—10 and adaptive_ref_pic_marking_mode_flag set to 0. In other words, no reordering of reference picture lists or memory_management_control_operation (MMCO) commands are allowed in the simple scenario. The stitching steps will be enhanced in a later section to handle general scenarios. Note that even though the stitcher produces only one video sequence, each incoming bitstream is allowed to contain more than one video sequence. Whenever necessary, all slices in an IDR access unit in the incoming bitstreams will be converted to P-slices.
1. Sequence Parameter Set RBSP NAL Unit:
This will be the very first NAL unit in the stitched bitstream. The stitched bitstream continues to conform to the Baseline profile; this corresponds to a profile_idc of 66. The level_idc is set based on the expected output bitrate of the stitcher. As a specific example, the nominal bitrate of each incoming QCIF bitstream is assumed to be 80 kbps; for this example, a level of 1.3 (i.e. level_idc=13) is appropriate for the stitched bitstream because this level accommodates the nominal output bitrate of 4 times the input bitrate of 80 kbps and allows some excursion beyond it. When the nominal bitrate of each incoming QCIF bitstream is different from 80 kbps, the outgoing level can be appropriately determined in a similar manner. The MaxFrameNum to be used by the stitched bitstream is set to the maximum possible value of 65536. One or more of the incoming bitstreams may also use this value, hence short-term reference pictures could come from as far back as 65535 pictures. Picture order count type 2 is chosen. This implies that the picture order count is 2×n, for the stitched picture whose frame_index is n. The number of reference frames is set to the maximum possible value of 16 because one or more of the incoming bitstream may also use this value. No gaps are allowed in frame numbers, hence the value of syntax element frame_num for a slice in the stitched picture given by frame_index n will be given by n % MaxFrameNum, which is equal to n&0×FFFF (where 0×FFFF is hexadecimal notation for 65535). The resolution of a stitched picture will be CIF, i.e., width is 352 pixels and height is 288 pixels.
Throughout this discussion any syntax element for which there is no ambiguity is not explicitly mentioned, e.g. frame_mbs only_flag is always 1 for the baseline profile, and reserved zero—5 bits is always 0. Therefore these syntax elements are not explicitly mentioned below. Based on the above discussion, the syntax elements are set as follows.
The syntax elements are then encoded using the appropriate variable length codes (as specified in sub clauses 7.3.2.1 and 7.4.2.1 of the H.264 standard ) to produce the sequence parameter set RBSP. Subsequently, the sequence parameter set RBSP is encapsulated into a NAL unit by adding emulation_prevention_three_bytes whenever necessary (according to NAL unit semantics specified in sub clauses 7.3.1. and 7.4.1 of the H.264 standard).
2. Picture Parameter Set RBSP NAL Unit:
This will be the second NAL unit in the stitched bitstream. Each stitched picture will be composed of four slice groups, where the slice groups are spatially correspond to the quadrants corresponding to the individual bitstreams. The number of active reference pictures is chosen as 16, since the stitcher may have to refer to all 16 reference frames, as discussed before. The initial quantization parameter for the picture is set to 26 (as the midpoint in the allowed quantization parameter range of 0 through 51); individual quantization parameters for each macroblock will be modified as needed at the macroblock layer inside slice layer without partitioning RBSP. The relevant syntax elements are set as follows:
The syntax elements are then encoded using the appropriate variable length codes (as specified in sub clauses 7.3.2.2 and 7.4.2.2 of the H.264 standard ) to produce the picture parameter set RBSP. Subsequently, the picture parameter set RBSP is encapsulated into a NAL unit by adding emulation-prevention-three_bytes whenever necessary (according to NAL unit semantics specified in sub clauses 7.3.1 and 7.4.1 of the H.264 standard).
3. Slice Layer Without Partitioning RBSP NAL Unit:
All the NAL units in the stitched bitstream after the first two are of this type. Each stitched picture is coded as four slices with each slice representing a quadrant, i.e., each slice coincides with the entire slice group as set in the picture parameter set RBSP above. A slice layer without partitioning RBSP has two main components: slice header and slice data.
The slice header consists of slice-specific syntax elements, and also syntax elements needed for reference picture list reordering and decoder reference picture marking. The relevant slice-specific syntax elements are set as follows for the stitched picture for which frame_index equals n:
The relevant syntax elements for reference picture list reordering are set as follows: ref_pic_list_reordering_flag—10: 0
The relevant syntax elements for decoded reference picture marking are set as follows:
The above steps set the syntax elements that constitute the slice header. Before setting the syntax elements for slice data, the following process must be performed on each macroblock of the CIF picture to obtain the initial settings for certain parameters and syntax elements (these settings are “initial” because some of these settings may eventually be modified as discussed below). The syntax elements for each macroblock of the stitched frame are set next by using the information (syntax element or decoded attribute) from the corresponding macroblock in the current ideal stitched picture. For this purpose, the macroblock/block that is spatially located in the ideal stitched frame at the same position as the current macroblock/block in the stitched picture will be referred to as the co-located macroblock/block. Note that the word co-located used here should not be confused with the word co-located used in the context of decoding of direct mode for B-slices, in subclause 8.4.1.2.1 in the H.264 standard.
For frame_index equal to 0 (i.e. the IDR picture produced by the stitcher), the syntax element mb_type is set equal to mb_type of the co-located macroblock.
For frame_index not equal to 0 (i.e. non-IDR picture produced by the stitcher), the syntax element mb_type is set as follows:
If co-located macroblock belongs to an I-slice, then set mb_type equal to 5 added to the mb_type of the co-located macroblock.
Otherwise, if co-located macroblock belongs to a P-slice, then set mb_type equal to mb_type of the co-located macroblock. If the inferred value of mb_type of the co-located macroblock is P_SKIP, set mb_type to −1.
If the macroblock prediction mode (given by MbPartPredMode( ), as defined in Tables 7-8 and 7-10 in the H.264 standard) of the mb_type set above is Intra—4×4, then for each of the constituent 16 4×4 luma blocks set the intra 4×4 prediction mode equal to that in the collocated block of the ideal stitched picture. Note that the actual intra 4×4 prediction mode is set here, and not the syntax elements prev_intra4×4_pred_mode_flag or rem_intra4×4_pred_mode.
If the macroblock prediction mode of the mb_type set above is set to Intra—4×4 or Intra—16×16, then the syntax element intra_chroma_pred_mode is set equal to intra_chroma_pred_mode of the co-located macroblock.
If the macroblock prediction mode of the mb_type set above is not Intra—4×4 or Intra16×16 and if number of macroblock partitions (given by NumMbPart( ), as defined in Table 7-10 in the H.264 standard) of the mb_type is less than 4, then for each of the partitions of the macroblock set the reference picture index equal to that in the co-located macroblock partition. If the mb_type set above does not equal −1 (implying that the macroblock is not a P_SKIP), then both components of the motion vector must be set equal to those in the co-located macroblock partition of the ideal stitched picture. Note that the actual motion vector is set here, not the mvd—10 syntax element. If the mb_type equals −1 (implying P_SKIP), then both components of the motion vector must be set to the predicted motion vector using the process outlined in sub clause 8.4.1.3 of the H.264 standard. If the resulting motion vector takes any part of the current macroblock outside those boundaries of the current quadrant which are shared by other quadrants, the mb_type is changed from P_SKIP to P_L0—16×16.
If the macroblock prediction mode of the mb_type set above is not Intra—4×4 or Intra—16×16 and if number of macroblock partitions of the mb_type is equal to 4, then for each of the four partitions of the macroblock. The syntax element sub_mb_type is set equal to that in the co-located partition of the ideal stitched picture. Then, for each of the sub macroblock partitions, the reference picture index and both components of the motion vector are set equal to those in the co-located sub macroblock partition of the ideal stitched picture. Again, the actual motion vector is set here and not the mvd—10 syntax element.
The parameter MbQpY is set equal to the luma quantization parameter used in residual decoding process in the co-located macroblock of the ideal stitched picture. If no residual was decoded for the co-located macroblock (e.g. if coded_block_pattern was 0 and the macroblock prediction mode of the mb_type set above is not INTRA—16×16, or it was a P_SKIP macroblock), then MbQpY is set to the MbQpY of the previously coded macroblock in raster scanning order inside that quadrant. If the macroblock is the very first macroblock of the quadrant, then the value of (26+pic_init_qp_minus26+slice_qp_delta) is used, where pic_init_qp_minus26 and slice_qp_delta are the corresponding syntax elements in the corresponding incoming bitstream. After completing the above initial settings, the following process is performed over each macroblock for which mb_type is not equal to I_PCM.
The stitched predicted blocks are now formed as follows. If the macroblock prediction mode of the mb_type set above is Intra—4×4, then for each of the 16 constituent 4×4 luma blocks in 4×4 luma block scanning order, perform Intra 4×4 prediction (according to the process defined in sub clause 8.3.1.2 of the H.264 standard ), using the Intra—4×4 prediction mode set above using the neighboring stitched reconstructed blocks already formed prior to the current block in the stitched picture. If the macroblock prediction mode of the mb_type set above is Intra—16×16, perform Intra—16×16 prediction (according to the process defined in sub clause 8.3.2 of H.264 ), using the intra 16×16 prediction mode information contained in the mb_type as set above, using the neighboring stitched reconstructed macroblocks already formed prior to the current block in the stitched picture. In either of the above two cases, perform intra prediction process for chroma samples, according to the process defined in sub clause 8.3.3 of the H.264 standard using already decoded blocks/macroblocks in a causal neighborhood of the current block/macroblock. If the macroblock prediction mode of the mb_type is neither Intra—4×4 nor Intra—16×16, then for each constituent partition in scanning order, perform inter prediction (according to the process defined in sub clause 8.4.2.2 of the H.264 standard ), using the motion vector and reference picture index information set above. The reference picture index set above is used to select a reference picture according to the process described in sub clause 8.4.2.1 of the H.264 standard, but applied on the stitched reconstructed video sequence instead of the ideal stitched video sequence.
The stitched raw residual blocks are formed as follows. The 16 stitched raw residual blocks are obtained by subtracting the corresponding predicted block obtained as above from the co-located ideal stitched block.
The quantized and transformed coefficients are formed as follows. Use the forward transform and quantization process (appropriately designed for each macroblock type logically equivalent to the implementation in H.264 Reference Software ), to obtain quantized transform coefficients.
The stitched decoded residual blocks are formed as follows. According to the process outlined in sub clause 8.5 of the H.264 standard, decode the quantized transform coefficients obtained in the earlier step. This forms the 16 decoded stitched decoded residual luma blocks, and the corresponding 4 stitched decoded Cb blocks and 4 Cr blocks.
The stitched reconstructed blocks are formed as follows. The stitched decoded residual blocks obtained above are added to the respective stitched predicted blocks to form the stitched reconstructed blocks for the given macroblock.
Once the entire stitched picture is reconstructed, a deblocking filter process is applied using the process outlined in sub clause 8.7 of the H.264 standard. This is followed by a decoded reference picture marking process as per sub clause 8.2.5 of the H.264 standard. This yields the stitched reconstructed picture.
The relevant syntax elements needed to encode the slice data are as follows:
Slice data specific syntax elements are set as follows:
Macroblock layer specific syntax elements are set as follows:
Macroblock prediction specific syntax elements are set as follows:
Sub-macroblock prediction specific syntax elements are set as follows:
Residual block CAVLC specific syntax elements are set as follows:
B. H.264 Stitching Process in a General Stitching Scenario
The previous section provided a detailed description of H.264 stitching in the simple stitching scenario where the incoming bitstreams are assumed to have identical frame rates and all of the video frames from each bitstream are assumed to arrive at the stitcher at the same time. This section adds further enhancements to the H.264 stitching procedure for a more general scenario in which the incoming video streams may have different frame rates, with video frames that may be arriving at different times, and wherein video data may occasionally be lost. Like in the simple scenario, there will continue to be two distinct and different operations that take place within the stitcher, namely, decoding the incoming QCIF video bitstreams and the rest of the stitching procedure. The decoding operation entails four logical decoding processes, i.e., one for each incoming stream. Each of these processes or decoders produces a frame at the output. The rest of the stitching procedure takes the available frames, and combines and codes them into a stitched bitstream. The distinction between the decoding step and the rest of the stitching procedure is important and will be maintained throughout this section.
In the simple stitching scenario, the four input streams would have exactly the same frame rate (i.e. the nominal frame rate agreed to at the beginning of the video conference) and the video frames from the input streams would arrive at the stitcher perfectly synchronized in time with respect to one another without encountering any losses. In reality, however, videoconferencing appliances or endpoints join/leave multipoint conferences at different times. They produce wavering non-constant frame rates (dictated by resource availability, texture and motion of the scene being encoded, etc), and bunch packets together in time (instead of spacing them apart uniformly), and so forth. The situation is exacerbated by the fact that the network introduces a variable amount of delay on the packets as well as packet losses. A practical stitching system therefore requires a robust and sensible mechanism forhandling the inconsistencies and vagaries of the separate video bitstreams received by the stitcher.
The following issues need to be considered in developing a proper robust stitching methodology:
According to the present invention the stitcher employs the following techniques in order to address the issues described above:
In the simple scenario the endpoints produce streams at unvarying nominal frame rates and packets arrive at the stitcher at uniform intervals. In these conditions the stitcher can indeed operate at the nominal frame rate at all times. In reality, however, the frame rates produced by the various endpoint can vary significantly around the nominal frame rate and/or on average can be substantially higher than the nominalframe rate. According the present invention, the stitcher is designed to stitch a frame in the stitched video sequence whenever two complete access units, i.e., frames, are received in any incoming stream. This means that the stitcher will attempt to keep pace with a faster than nominal frame rate seen in any of the incoming streams. However, it should be kept in mind that in a real world system the stitcher has access to only a finite amount of resources, the stitcher can only stitch as fast as the resources will allow it. Therefore, a protection mechanism is provided in the stitching design through the specification of the maximum stitching frame rate parameter, fmax. In this case, whenever one of the incoming streams tries to drive up the stitching frame rate beyond fmax, the stitcher drops packets corresponding to complete access unit(s) in the offending stream so as to not exceed its capability. Note, however, that the corresponding frame still needs to be decoded by the decoder portion of the stitcher, although this frame is not used to form a stitched CIF picture.
In order to get a better idea of what exactly goes into stitching together the incoming streams, it is instructive to look at some illustrative examples.
In this case, the stitcher can produce stitched frames at the nominal frame rate with the frames stitched together at different time instants as follows:
Now, consider the case of asynchronous incoming streams illustrated in
At time instant t−3, new frames are available from each of the streams, i.e., A0, B0, C0, D0 and therefore are stitched together. But at t−2, new frames are available from streams A and D, i.e., A1, D1 but not from B and C. Therefore, the temporally previous frames from these streams, i.e., B0, C0 are repeated at t−3. In order to repeat the information in the previous quadrant, some coded information has to be invented by the stitcher so that the stitched stream carries this information. The H.264 standard offers a relatively easy solution to this problem through the availability of the concept of a P_SKIP macroblock. A P_SKIP macroblock carries no coded residual information and is intended as a copying mechanism from the most recent reference frame into the current frame. Therefore, a slice (quadrant) consisting of all P_SKIP macroblocks will provide an elegant and inexpensive solution to repeating a frame in one of the incoming bitstreams. The details of the construction of such a coded slice, referred to as MISSING_P_SLICE_WITH_P_SKIP_MBS, is described below.
In the following discussion, the stitching of asynchronous incoming streams is described in a more detailed manner. The discussion assumes a packetized video stream, comprising a collection of coded video frames with each coded frame packaged into one or more IP packets for transmission. This assumption is consistent with most real world video conference applications. Consider the example shown in
The stitching at various time instants proceeds as follows:
Some important observations regarding this example are:
Stitching cannot be performed after reception of C4 (second complete access unit following C3) since that would exceed fmax.
When a multipoint call is established, all of the endpoints involved do not join at the same time. Similarly, some of the endpoints may quit the call before the others. Therfore, whenever a quadrant is empty i.e. no participant is available to be displayed in that quadrant, some information needs to be displayed by the stitcher. This information is usually in the form of a gray image or a static logo. As a specific example, a gray image will be assumed for the detailed description here. However, any other image can be substituted by making suitable modifications without departing from the spirit and scope of the details presented here. Such a gray frame has to be coded as a slice and inserted into the stitched stream. Following are the three different types of coded slices (and the respective scenarios where they are necessary) that have to be devised:
Although it is possible to use MISSING_P_SLICE_WITH_I_MBS in non-IDR stitched frames for as long as necessary, it is advantageous to use MISSING_P_SLICE_WITH_P_SKIP_MBS because it consumes less bandwidth and more importantly, it is much easier to decode for the endpoints receiving the stitched stream.
The parameter slice_ctr takes the values 0, 1, 2, 3 corresponding respectively to the quadrants A, B, C, D shown in
The MISSING_IDR_SLICE is constructed such that when it is decoded, it produces an all-gray quadrant whose Y, U, and V samples are all equal to 128. The specific syntax elements for the MISSING_IDR_SLICE are set as follows:
Slice Header syntax elements:
Decoded reference picture marking syntax elements are set as follows:
Marcoblock layer syntax elements are set as follows:
Macroblock prediction syntax elements are set as follows:
The MISSING_P_SLICE_WITH_I_MBS is constructed such that when it is decoded, it produces an all-gray quadrant whose Y, U, and V samples are all equal to 128. The specific syntax elements for the MISSING_P_SLICE_WITH_I_MBS are set as follows:
Slice Header syntax elements are set as follows:
Reference picture reordering syntax elements are set as follows:
Decoded reference picture marking syntax elements are set as follows:
Slice data syntax elements are set as follows:
Macroblock layer syntax elements are set as follows:
Macroblock prediction syntax elements are set as follows:
Note that instead of MISSING_P_SLICE_WITH_I_MBS, a MISSING_I_SLICE_WITH_I_MBS could also be alternatively used (with a minor change in mb_type setting).
The MISSING_P_SLICE_WITH_P_SKIP_MBS is constructed such that the information for the slice (quadrant) is copied exactly from the previous reference frame. The specific syntax elements for the MISSING_P_SLICE_WITH_P_SKIP_MBS are set as follows:
Slice header syntax elements are set the same as that of
Slice data syntax elements are set as follows:
One interesting problem that arises in stitching asynchronous streams is that the multi-picture reference buffer seen by the stitching operation will not be aligned with those seen by the individual QCIF decoders. In other words, assume that a given macroblock partition in a certain QCIF picture in one of the incoming streams used a particular reference picture (as given by the ref_idx—10 syntax element coded for that macroblock partition) for inter-prediction. This same picture then goes on to occupy a quadrant in the stitched CIF picture. The reference picture in the stitched reconstructed video sequence that is referred to by the stored ref_idx—10 may not temporally match the reference picture that was used for generating the ideal stitched video sequence. However, having said this, the proposed drift-free stitching approach (the drift here referring to that between the stitcher and the CIF decoder) will handle this scenario perfectly well. The only penalty paid for not making an attempt to try and align the reference buffers of the incoming and the stitched streams is an increase in the bitrate of the stitched output. This is because the different reference picture used along with the original motion vector during stitching may not provide a good prediction for a given macroblock partition. Therefore, it is well worth the effort to accomplish as much alignment of the reference buffers as possible. Specifically, this alignment will involve altering the syntax element ref_idx—10 found in inter-coded blocks of the incoming picture so as to make it consistent with the stitched stream.
In order to keep the design simple, it is desired that the stitched output bitstream not use reference picture reordering or MMCO commands (as in the simple stitching scenario). As a result, a similar alignment issue can occur when the incoming QCIF pictures use reference picture reordering in their constituent slices and/or MMCO commands, even if there was no asynchrony in the incoming streams. For example, in the incoming stream, ref_idx—10=2 in one QCIF slice may refer to the reference picture that was decoded temporally immediately prior to it. But since there is no reordering of reference pictures in the stitched bitstream, ref_idx—10=2 will refer to the reference picture that is three pictures temporally prior to it. Even more serious alignment issues arise when incoming QCIF bitstreams use MMCO commands.
The alignment issues described above can be addressed by mapping the reference picture buffers between the four incoming streams and the stitched stream, asset forth below. Prior to that, however, it is important to review some of the properties of the stitched stream with respect to inter prediction:
As for mapping short-term reference pictures in the incoming streams to those in the stitched stream, each short-term reference picture can be uniquely identified by frame_num. Therefore, a mapping can be established between the frame_num of each of the incoming streams and the stitched stream. Four separate tables are maintained at the stitcher, each carrying the mapping between one of the incoming streams and the stitched stream. When a frame is stitched, the ref_idx—10 found in each inter-coded block of the incoming QCIF picture is altered using the appropriate table in order to be consistent with the stitched stream. The tables are updated, if necessary, each time a stitched frame is generated.
It would be useful at this time to understand the mapping set forth previously thorough an example.
One consequence of the modification of ref_idx—10 syntax element is that a macroblock that was originally of type P—8×8ref0 needs to be changed to P—8×8 if the new ref_idx10 is not 0.
The above procedure for mapping of short-term reference pictures from incoming streams to the stitched bitstream need to be augmented in cases where an incoming QCIF frame is decoded but is dropped from the output of the stitcher due to limited resources at the stitcher. Recall, resource limitations may force the stitcher to maintain its output frame rate below fmax (as discussed earlier). As an example, continuing beyond the example shown in Table 1, suppose incoming frame_num=19 for the given incoming stream is decoded but is dropped from the stitcher output, and instead incoming frame_num=20 is stitched into stitched CIF frame_num=41. Suppose a macroblock partition in the incoming frame_num=20 used the dropped picture (frame_num=19) as reference. In this case, a mapping from incoming frame_num=19 would need to be artificially created such that it maps to the same stitched frame_num as the temporally previous incoming frame_num. In the example, the temporally previous incoming frame_num is 18, and that maps to stitched frame_num of 40. Hence, the incoming frame_num=19 will be artificially mapped to stitched frame_num of 40.
The long-term reference pictures in the incoming streams are mapped to the short-term reference pictures in the stitched CIF stream as follows. The ref_idx—10 of a long-term reference picture in any of the incoming streams is mapped to min(15, num_ref_idx—10_active—minus1). The minimum of 15 and num_ref_idx—10_active_minus1 is needed because the number of reference pictures in the stitched stream does not reach 16 until that many pictures are output by the stitcher. The rationale of picking the 15th slot in the reference picture list is that such a slot is reasonably expected to contain the temporally oldest frame. Since no long-term pictures are allowed in the stitched stream, the temporally oldest frame in the reference picture buffer is the logical choice to approximate a long-term picture in an incoming stream.
This completes the description of H.264 stitching in a general scenario. Note that the above description will be easily applicable to other resolutions such as for stitching four CEF bitstreams to a 4CIF bitstream with minor changes in the details.
A simplification in H.264 stitching is possible when one or more incoming quadrants are coded using only I-slices and the total number of slice groups in the incoming quadrants is less than or equal to 4 plus the number of incoming quadrants coded using only I-slices, and furthermore all the incoming quadrants that are coded using only I-slices have the same value for the syntax element chroma_qp_index_offset in their respective picture parameter sets (if there is only one incoming quadrant that is coded using only I-slices, the condition on the syntax element chroma_qp_index_offset is automatically satisfied). As a special example, the conditions for the simplified stitching are satisfied when the stitcher produces the very first IDR stitched picture and the incoming quadrants are also IDR pictures with the total number of slice groups in the incoming quadrants being less than or equal to 8 and the incoming quadrants using a common value for chroma_qp_index_offset. When the conditions for the simplified stitching are satisfied, there is no need for forming the stitched raw residual, and subsequently forward transforming and quantizing it, in the quadrants that were coded using only I-slices. For these quadrants, the NAL units as received from the incoming streams can therefore be sent out by the stitcher with only a few changes in the slice header. Note that more than one picture parameter sets may be necessary—this is because if the incoming bitstreams coded using only I-slices has a slice group structure different from interleaved (i.e. slice_group_map_type is not 0), the slice group structure for those quadrants can not be captured using the slice group structure derived using the syntax element settings described above for the picture parameter set for the stitched bitstream. The few changes required to the slice header will be as follows—firstly, the first_mb_in_slice syntax element has to be appropriately mapped from the QCIF to point to the correct location in the CIF picture; secondly, if incoming slice_type was 7, it may have to be changed to 2 (both 2 and 7 represent I-slice, but 7 means that all the slices in the picture are of type 7, which will not be true unless all the four quadrants use only I-slices); pic_parameter_set_id may have to be changed from its original value to point to the appropriate picture parameter set that is used in the stitching direction; thirdly, slice_qp_delta may have to be appropriately changed so that the SliceQPY computed as 26+pic_init_qp_minus26+slice_qp_delta (with pic_init_qp_minus26 as set in the stitched picture parameter set in use) equals the SliceQPY that was used for this slice in the incoming bitstream; furthermore, frame_num and contents of ref_pic_list_reordering and dec_ref_pic_marking syntax structures have to be set as described in detail earlier under the settings for slice layer without partitioning RBSP NAL unit. In addition, further simplification can be accomplished by setting disable_deblocking_filter_idc to 1 in the slice header. The stitched reconstructed picture is obtained as follows: For the quadrants that were coded using only I-slices in the incoming bitstreams, the corresponding QCIF pictures obtained “prior to” the deblocking step in the respective decoders are placed in the CIF picture; other quadrants (i.e. not coded using only I-slices) are formed using the method described in detail earlier that constructs the stitched reconstructed blocks; the CIF picture thus obtained is deblocked to produce the stitched reconstructed picture. Note that because there is no inter-coding used in I-slices, the decoder of the stitched bitstream produces a picture identical to the stitched picture obtained in this manner. Hence, the basic premise of drift-free stitching is maintained. However, note that the incoming bitstream still has to be decoded completely because it has to be retained for referencing future ideal pictures. When the total number of slice groups in the incoming quadrants is greater than 4 added to the number of incoming quadrants coded using only I-slices, the above simplification will not apply to some or all such quadrants because slice groups in some or all quadrants will need to be merged to keep the total number of slice groups within the stitched picture at or below 8 in order to conform to the Baseline profile.
C. Error Concealment Procedure Used in the Decoder for H.264 Stitching in a General Stitching Scenario
In the detailed description of H.264 stitching in a general scenario, it was indicated that it is the individual decoder's responsibility to make a decoded frame available and indicate as such to the stitching operation. The details of error concealment used by the decoder described next. This procedures assumes that incoming video streams are packetized using Real Time Protocol (RTP) in conjunction with User Datagram Protocol (UDP) and Internet Protocol (EP), and that the packets are sent over an IP-based LAN build over Ethernet (MTU=1500 bytes). Furthermore, a packet received at the decoder is assumed to be correct and without any bit errors. This assumes that any packet corrupted during transmission will be detected and dropped by an underlying network mechanism. Therefore, the error is entirely in the form of packet losses.
In order to come up with effective error concealment strategies, it is important to understand the different types of packetization that are performed by the H.264 encoders/endpoints. The different scenarios of packetization are listed below (note: a slice is a NAL unit):
1. Slice→1 Packet
This type of packetization is commonly used for a P-slice of a picture. Typically, for small picture resolutions such as QCEF and relatively error-free transmission environments, only one slice is used per picture and therefore a packet contains an entire picture.
According to RTP payload format for H.264, this is “single NAL unit packet” because a packet contains a single whole NAL unit in the payload.
2. Multiple Slices→>1 Packet
This is used to pack (some or all) the slices in a picture in to a packet. Since pictures are generated at different time instants, only slices from the same picture are put in to a packet. Trying to put slices from more than one picture in to a packet will introduce delay which is undesirable in applications such as videoconferencing.
According to RTP payload format for H.264, this is “single-time aggregation packet”.
3. Slice→Multiple Packets
This happens when a single slice is fragmented over multiple packets. It is typically used to pack an I-slice. Coded I-slices are typically large and therefore sit in multiple packets or fragments. It is important to note here that loss of a single packet or fragment means that the entire slice has to be discarded.
According to RTP payload format for H.264, this is “fragmentation unit”.
From the above discussion, it can be summarized that the loss of two types of video coding units has to be dealt with in error concealment at the decoder, namely,
An important aspect of error concealment is that it is important to know whether the lost slice/picture was intra-coded or inter-coded. Intra-coding is typically employed by the encoder at the beginning of a video sequence, where there is a scene change, or where there is motion that is too fast or non-linear. Inter-coding is performed whenever there is smooth, linear motion between pictures. Spatial concealment is better suited for intra-coded coding units and temporal concealment works better for inter-coded units.
It is important to note the following properties about an RTP stream containing coded video:
Using the above, it is easy to group the packets belonging to a particular picture as well as determine which packets got lost (corresponding to missing sequence numbers) during transmission.
Slice loss concealment procedure is described next. Slices can be categorized as I, P, or IDR. An IDR-slice is basically an I-slice that forms a part of an IDR picture. An IDR picture is the first coded picture in a video sequence and has the ability to do an “instantaneous refresh” of the decoder. When transmission errors happen, the encoder and decoder lose synchrony and errors propagate due to motion prediction that is performed between pictures. An IDR-picture is a very potent tool in this scenario since it “resynchronizes” the encoder and the decoder.
In dealing withslices lost, it is assumed that a picture consists of multiple slices and that at least one slice has been received by the decoder (otherwise, the situation is considered as picture loss rather than a slice loss). In order to conceal slice losses effectively, it is important to determine whether the lost slice was an I, P, or IDR slice. A lost slice in a picture is declared to be of type:
A lost slice can be identified as I or P with certainty only if one of the received slices has a slice_type of 7 or 5, respectively. When one of the received slices has a slice_type of 2 or 0, no such assurance exists. However, having said this, it is very likely that in an interactive real-time application such as videoconferencing that all the slices in a picture are of the same slice_type. For e.g., in the case of a scene change, all the slices in the picture will be coded as I-slices. It should be remembered that a P-slice can be composed entirely of I-macroblocks. However, this is a very unlikely event. It is important to note that scattered I-macroblocks in a P-slice are not precluded since this is likely to happen with forced intra-updating of macroblocks (as an error-resilience measure), local characteristics of the picture, etc.
If the lost slice is determined to be an I-slice, spatial concealment can be performed while if it is a P-slice, temporal concealment can beemployed. Spatial concealment referes to the concealment of missing pixel information in a frame using pixel information from within that frame while temporal concealment makes use of pixel information from other frames (typically the reference frames used in inter prediction). The effectiveness of spatial or temporal concealment depends on factors such as:
The following pseudo-code summarizes the slice concealment methodology:
The above algorithm does not employ any spatial concealment. This is because spatial concealment is most effective only in concealing isolated lost macroblocks. In this scenario, a lost macroblock is surrounded by received neighbors and therefore spatial concealment will yield good results. However, if an entire slice containing multiple macroblocks is lost, spatial concealment typically does not have the desired conditions to produce useful results. Taking into account the relative rareness of I-slices in the context of videoconferencing, it would make sense to solve the problem by requesting an IDR-picture through the H.241 signaling mechanism.
The crux of temporal concealment involves estimating the motion vector and the corresponding reference picture of a lost macroblock from its received neighbors. The estimated information is then used to perform motion compensation in order to obtain the pixel information for the lost macroblock. The reliability of the estimate depends among other things on how many neighbors are available. The estimation process, therefore, can be greatly aided if the encoder pays careful attention to the structuring of the slices in the picture. Details of the implementation of temporal concealment are provided in what follows. While decoding, a macroblock map is maintained and it is updated to indicate that a certain macroblock has been received. Once all of the information for a particular picture has been received, the map indicates the positions of the missing macroblocks. Temporal concealment is then initiated for each of these macroblocks. The temporal concealment technique described here is similar in spirit to the technique proposed in W. Lam, A. Reibman and B. Liu “Recover of Lost or Erroneously Received Motion Vectors”, the teaching of which is incorporated herein by reference.
The following discussion explains the procedure of obtaining the motion information of the luma part of a lost macroblock. The chroma portions of the lost macroblock derive their motion information from the luma portion as described in the H.264 standard.
First, the ref_idx—10 (reference picture) of each available neighbor is inspected and the most commonly occurring ref_idx—10 chosen as the estimated reference picture. Then, from those neighbors whose ref_idx—10 is equal to the estimated value, the median of their motion vectors is found to be the estimated motion vector for the lost macroblock.
Next we consider the picture loss concealment procedure. This deals with the contingency of losing an entire picture or multiple pictures. The best way to conceal the loss of a picture is to copy the pixel information from the temporally previous picture. The loss of pixel information, however, is only one of the many problems resulting from picture loss. In compensating for picture loss, it is important to determine the number of pictures that have been lost in transit at a given time. This information can then be used to shift the multi-picture reference buffer appropriately so that subsequent pictures do not incorrectly reference pictures in this buffer. When gaps in frame numbers are not allowed in the video stream, it is possible to determine from the frame_num of the current slice and that of the previously received slice as to how many frames/pictures were lost in transit. However, if gaps in frame num are in fact allowed, then even with the knowledge of the exact number of packets lost (through RTP sequence numbering), it is not possible to determine the number of pictures lost. Another important piece of information that is lost with a picture is whether it was a short-term reference, long-term reference, or a non-reference picture. A wrong guess of any of the parameters mentioned before may cause serious non-compliance problems to the decoder at some later stage of decoding.
The following approach is taken to combat loss of picture or pictures:
By placing a lost picture in the ShortTermReferencePicture buffer, a sliding window process is assumed as default in the context of decoded reference picture marking. In case the lost picture had carried MMCO commands, the decoder will likely face a non-compliance problem at some point of time. Requesting an IDR-picture in such a scenario is an elegant and effective solution. Receiving the IDR-picture clears all the reference buffers in the decoder and re-synchronizes it with the encoder.
The following is a list of conditions under which an IDR-picture (accompanied by appropriate parameter sets) is requested by initiating a videoFastUpdatePicture command through the H.241 signaling mechanism.
Another embodiment of the present invention applies the drift-free hybrid approach to video stitching to H.263 encoded video images. In this embodiment, four QCIF H.263 bitstreams are to be stitched into an H.263 CIF bitstream. Each individual incoming H.263 bitstream is allowed to use any combination of Annexes among the H.263 Annexes D, E, F, I, J, K, R, S, T, and U, independently of the other incoming H.263 bitstreams, but none of the incoming bitstreams may use PB frames (i.e. Annex G is not allowed). Finally, the stitched bitstream will be compliant to the H.263 standard without any Annexes. This feature is desirable so that all H.263 receivers will be able to decode the stitched bitstream.
The stitching procedure proceeds according to the general steps outlined above. First decode the QCIF frames from each of the four incoming H.263 bitstreams. Form the ideal stitched video picture by spatially composing the decoded QCIF pictures. Next, store the following information for each of the four decoded QCIF frames:
Note that this is the actual quantization parameter that was used to decode the macroblock, and not the differential value given by the syntax element DQUANT. If the COD for the given macroblock is 1 and the macroblock is the first macroblock of the picture or if it is the first macroblock of the GOB (if GOB header was present), then the quantization parameter stored is the value of PQUANT or GQUANT in the picture or GOB header respectively. If the COD for the given macroblock is 1 and the macroblock is not the first macroblock of the picture or of the GOB (if GOB header was present), then the QUANT stored for this macroblock is equal to that of the previous macroblock in raster scanning order.
The next step is to form the stitched predicted blocks. For each macroblock for which the stored macroblock type is either INTER or INTER+Q or INTER4V or INTER4V+Q, motion compensation is carried out using bilinear interpolation as defined in sub clause 6.1.2 of the H.263 standard to form the prediction for the given macroblock. The motion compensation is performed on the actual stitched video sequence and not on the ideal stitched video sequence. Once the stitched predictor has been determined, the stitched raw residual and the stitched bitstream may be formed. For each macroblock in raster scanning order, the stitched raw residual is calculated as follows: For each macroblock, if the stored macroblock type is either INTRA or INTRA+Q, the stitched raw residual is formed by simply copying the co-located macroblock (i.e. having the same macroblock address) in the ideal stitched video picture; Otherwise, if the stored macroblock type is either INTER or INTER+Q or INTER4V or INTER4V+Q, then the stitched raw residual is formed by subtracting the stitched predictor from the co-located macroblock in the ideal stitched video picture.
The differential quantization parameter DQUANT for the given macroblock (except when the macroblock is the first macroblock in the picture) is formed by subtracting the QUANT value of the previous macroblock in raster scanning order (with respect to CIF picture resolution) from the QUANT of the given macroblock, and then clipping the result to the range {−2, −1, 0, 1, 2}. If this DQUANT is not 0, and the stored macroblock type is INTRA (value=3), the macroblock type must be changed to INTRA+Q (value=4). Similarly, if this DQUANT is not 0, and the stored macroblock type is INTER (value=0) or INTER4V (value=2), the macroblock type must be changed to INTER+Q (value=1). The stitched raw residual is then forward discrete cosine transformed (DCT) according to the process defined by Step A.2 in Annex A of H.263, and forward quantized using a quantization parameter obtained by adding the DQUANT set above to the QUANT of the previous macroblock in raster scanning order in the CIF picture (Note that this quantization parameter is guaranteed to be less than or equal to 31 and greater than or equal to 1). The QUANT value of the first macroblock in the picture is assigned to the PQUANT syntax element in the picture header. The result is then de-quantized and inverse transformed, and then added to stitched predicted blocks to produce the stitched reconstructed blocks. These stitched reconstructed blocks finally form the stitched video picture that will be used as a reference while stitching the subsequent picture.
Next a six-bit coded block pattern is computed for the given macroblock. The Nth bit of the six-bit coded block pattern will be 1 if the corresponding block (after forward transform and quantization in the above step) in the macroblock has at least one non-INTRADC coefficient (N=5 and 6 represent chroma blocks, while N=1,2,3,4 represent the luma blocks). The CBPC is set to the first two bits of the coded block pattern and CBPY is set to the last four bits of the coded block pattern. The value of COD for the given macroblock is set to 1 if all of these four conditions are satisfied: CBPC is 0, CBPY is 0, the DQUANT as set above is 0, and the luma motion vector is (0, 0). Otherwise, set COD to 0, and conditionally modify the macroblock type as follows: If the macroblock type is either INTER+Q (value=1), or INTER4V (value=2), or INTER4V+Q (value=3), and if DQUANT is set above to 0, then the macroblock type must be changed to INTER (value=0). If the macroblock type is INTRA+Q (value=4), and if DQUANT is set above to 0, then the macroblock type must be changed to INTRA (value=3). Note that the macroblock type for the first macroblock in the picture is always set to either INTRA or INTER.
If the COD of the given macroblock is set as 0, the differential motion vector data MVD is formed by first forming the motion predictor for the given macroblock using the luma motion vectors of its neighbors, according to the process defined in 6.1.1 of H.263, assuming that the header of the current GOB is empty.
The stitched bitstream is formed as follows: At the picture layer, the optional PLUSPTYPE is never used (i.e. Bits 6-8 in PTYPE are never set to “111”). These bits are set based on the resolution of the stitched output, e.g., if stitched picture resolution is CIF, then bits 6-8 are ‘011’. Bit 9 of PTYPE is set to “0” INTRA (I-picture) if this is the very first output stitched picture, otherwise it is set to “1” INTER (P-picture). CPM is set to off. No annexes are enabled. The GOB layer is coded without GOB headers. In the macroblock layer the syntax element COD is first coded. If COD=O, the syntax elements MCBPC, CBPY, DQUANT, MVD (which have been set earlier) are entropy encoded according to Tables 7, 8, 9, 12, 13 and 14 in the H.263 standard. In the block layer, if COD=O, entropy encode the forward transformed and quantized residual blocks, using Tables 15, 16 and 17 in the H.263 standard, based on coded block pattern information. Finally, the forward transformed and quantized residual coefficients are dequantized and inverse transformed, the result is added to the stitched predicted block to obtain the stitched reconstructed block, thereby completing the loop of
It is pointed out here that for H.263 stitching in a general scenario where incoming bitstreams are not synchronized with respect to each other and are transmitted over error-prone conditions, techniques similar to those described later for H.264 can be employed. In fact, the techniques for H.263 will be somewhat simpler. For example, there is no concept of coding reference picture index in H.263 since always the temporally previous picture is used in H.263. The equivalent of MISSING_P_SLICES_WITH_P_SKIP_MBS (see later) can be devised by simply setting COD to 1 in macroblocks of an entire quadrant. Also, like in H.264, the error concealment is the responsibility of the H.263 decoder, and an error concealment procedure for H.263 decoder is described separately towards the end of this invention.
IV. Error Concealment for H.263 Decoder
The error concealment for H.263 decoder described here starts with similar assumptions as in H.264. As in the case of H.264, it is important to note the following properties about an RTP stream containing coded video:
Using the above, it is easy to group the packets belonging to a particular picture as well as determine which packets got lost (corresponding to missing sequence numbers) during transmission.
In order to come up with effective error concealment strategies, it is important to understand the different types of RTP packetization that is expected to be performed by the H.263 encoders/endpoints. For videoconferencing applications that utilize a H.263 baseline video codec, the RTP packetization is carried out in accordance with internet engineering tak force, RFC 2190: RTP payload format for H.263 video streams, September 1997, in either mode A or mode B (as described earlier).
For mode A, the packetization is carried out on GOB or picture boundaries. The use of GOB headers or sync markers is highly recommended when mode A packetization is used. The primary advantages in this mode is the low overhead of 4 bytes per RTP packet and the simplicity of RTP encapsulation of the payload. The disadvantages are the granularity of the payload size that can be accommodated (since the smallest payload is the compressed data for an entire GOB) and poor error resiliency. If GOB headers are used, we can identify those GOBs which the RTP packet contains information about and thereby infer the GOBs for which no RTP packets have been received. For the MBs that correspond to the missing GOBs, temporal or spatial error concealment is applied. The GOB headers also help initialize the QUANT and MV information for the first macroblock in the RTP packet. In the absence of GOB headers, only picture or frame error concealment is possible.
For mode B, the packetization is carried out on MB boundaries. As a result, the payload can range from the compressed data of a single MB to the compressed data of an entire picture. An overhead of 8 bytes per RTP packet is used to provide for the starting GOB and MB address of the first MB in the RTP packet as well as its initial QUANT and MV data. This makes it easier to recover from missing RTP packets. The MBs corresponding to these missing RTP packets are inferred and temporal or spatial error concealment is applied. Note that picture or frame error concealment is needed only if an entire picture or frame is lost irrespective of whether GOB headers or sync markers are used.
In the case of H.263, there is no distinction between frame or picture loss error concealment and treatment of missing access units or pictures due to asynchronous reception of RTP packets. In this respect, H.263 and H.264 are fundamentally different. This fundamental difference is due to the multiple reference pictures in the reference picture list utilized by H.264 while the H.263 baseline's reference picture is confined to its immediate predecessor. A dummy P picture all of whose MBs have COD=1 is used instead of the “missing” frame for purposes of frame error concealment.
Temporal error concealment for missing MBs is carried out by setting COD to 0, mb_type to INTER (and hence DQUUANT to 0), and all coded block patterns CBPC, CBPY, and CBP to 0. The differential motion vectors in both direction are also set to 0. This ensures that the missing MBs are reconstructed with the best estimate of QUANT and MV that H.263 can provide. It is important to note, however, that in many cases one can do better than using the MV and QUANT information of all the MB's neighbors as in
As in H.264, we have not employed any spatial concealment in H.263. The reason for this is the same as that in H.264. Spatial concealment is most effective only in concealing isolated lost macroblocks when it is surrounded by received neighbors. However, in situations where an entire RTP packet containing multiple macroblocks is lost, spatial concealment typically the desired conditions to produce useful results using spatial concealment are not present.
In a few instances, we can neither apply picture/frame error concealment nor temporal/spatial error concealment. These instances occur when we have parts or an entire I picture is missing. In such cases, a videoFastUpdatePicture command is initiated using H.245 signaling to request an I-frame to refresh the decoder.
V. Alternative Practical Approaches for H.263 Stitching
Video stitching of H.263 video streams using the drift-free hybrid approach has been described above. The present invention further encompasses a number of the alternative practical approaches to video stitching for combining H.263video sequences. Three such approaches are:
A. Alternative Practical Approach for H.263 Stitching Employing Annex K
This method employs Annex K (with the Rectangular Slice submode) of the H.263 standard. Each component picture is assumed to have rectangular slices numbered from 0 to [9k-1] with widths 11i ( i.e., the slice width indication SWI is [11i-1]) where k is 1, 2, or 2 and i is 1, 2, or 4 corresponding to QCIF, CIF, or 4CIF component picture resolution, respectively. The MBA numbering for these slices will be 11ij where j is the slice number.
The stitching procedure is as follows:
Alternatively, invoke the Arbitrary Slice Ordering submode of Annex K (by modifying the SSS field of the stitched picture to “11”) and arrange the slices in any order
For the sake of simplicity of explanation, the stitching procedure assumed the width of a slice to be equal to that of a GOB as well as the same number of slices in each component picture. Although such assumptions would make the stitching procedure at the MCU uncomplicated, stitching can still be accomplished without these assumptions.
Note that this stitching approach is quite simple but may not be used when Annex D, F, or J (or a combination of these) is employed except when Annex R is also employed. Annexes D, F, and J cause a problem because they allow the motion vectors to extend beyond the boundaries of the picture. Annex J causes an additional problem because the deblocking filter operates across block boundaries and does not respect slice boundaries. Annex R solves these problems by extrapolating the appropriate slice in the reference picture to form predictions of the pixels which reference the out-of-bounds region and restricting the deblocking filter operation across slice boundaries.
B. Nearly Compressed Domain Approach for H.263 Stitching
This approach is performed in the compressed domain and entails the following main steps:
This approach is meant for the baseline profile of H.263, which does not include any of the optional coding tools specified in the annexes. Typically, in continuous presence multipoint calls, H.263 annexes are not employed in the interest of inter-operability. In any event, since the MCU is the entity that negotiates call capabilities with the endpoint appliance, it can ensure that no annexes or optional modes are used.
The detailed procedure isas follows. As in
The following procedure is employed to avoid incorrect motion vector prediction in the stitched picture. According to the H.263 standard, the motion vectors of macroblocks are coded in an efficient differential form. This motion vector differential, MVD, is computed as: MVD=MV−MVpred, where MVpred is the motion vector predictor for the motion vector MV. MVpred is formed from the motion vectors of the macroblocks neighboring the current macroblock. For example, MVpred=Median (MV1, MV2, MV3), where MV1 (left macroblock), MV2 (top macroblock), MV3 (top right macroblock) are the three candidate predictors in the causal neighborhood of MV (see
The above prediction process causes trouble for the stitching procedure at some of the component picture boundaries, i.e., wherever the component pictures meet in the stitched picture. These arise because component picture boundaries are not considered as picture boundaries by the decoder (which has no conception of the stitching that took place at the MCU). Next, the component pictures may skip some GOB headers, but the existence of such GOB headers impacts the prediction process. These factors cause the encoder and the decoder to lose synchronization with respect to the motion vector prediction. Accordingly, errors will propagate to other macroblocks through motion prediction in subsequent pictures.
To solve the problem of incorrect motion vector prediction in the stitched picture, the following steps have to performed during stitching:
The following procedure is used to avoid the use of the incorrect quantizer in the stitched picture. In the H.263 standard, every picture has a PQUANT (picture-level quantizer), GQUANT (GOB-level quantizer), and a DQUANT (macroblock-level quantizer). PQUANT (mandatory 5-bit field in the picture header) and GQUANT (mandatory 5-bit field in the GOB header) can take on values between 1 and 31 (both values inclusive) while DQUANT (2-bit field present in the macroblock depending on the macroblock type) can take on only 1 of 4 different values {−2, −1, 1, 2}. DQUANT is essentially a differential quantizer in the sense that it changes the current value of QUANT by the number it specifies. When encoding or decoding a macroblock, the QUANT value set via any of these three parameters will be used. It is important to note that while the picture header is mandatory, the GOB header may or may not be present in a GOB. GQUANT and DQUANT are made available in the standard so that flexible bitrate control may be achieved by controlling these parameters in some desired way.
During stitching, the three quantization parameters have to be handled carefully at the boundaries of the left-side and right-side QCWF GOBs. Without this procedure, the QUANT value used for a macroblock while decoding it may be incorrect starting with the left-most macroblock of the right-side QCIF GOB.
The algorithm outlined below can be used to solve the problem of using incorrect quantizer in the stitched picture. Since each GOB in the stitched CIF picture shall have a header (and therefore a GQUANT), the DQUANT adjustment can be done for each pair of QCIF GOBs separately. The parameter i denotes the macroblock index taking on values from 0 through 11 corresponding to the right-most macroblock of the left-side QCWF GOB through to the last macroblock of the right-side QCIF GOB. The parameters MB[i], quant[i], and dquant[i] denote the data, QUANT, and DQUANT corresponding to i th macroblock, respectively. For each of the 18 pairs of QCIF GOBs, do the following on the right-side GOB macroblocks:
An example of using the above algorithm is shown in
In P-pictures, many P-macroblocks do not carry any data. This is indicated by the COD field in the macroblock being set to 1. When such macroblocks lie near the boundary between the left- and the right-side QCIF GOBs, it is possible to take advantage of them by re-encoding them as macroblocks with data, i.e., change COD field to 0, which leads to the following further additions to the macroblock:
Note that, we can do this for such macroblocks regardless of whether they lie on the left side or right side of the boundary. Furthermore, if there are consecutive such macroblocks on either side of the boundary, then we can take advantage of the entire string of such macroblocks. Finally, we note that for some P-macroblocks, we may have the COD field set to 0 but there may be no transform coefficient data, as indicated by a zero Coded Block Pattern for both luminance and chrominance. We can take advantage of macroblocks of this type in the same manner, if they lie near the boundary except that we retain the original value of the differential motion vector in the last step instead of setting it to 0.
One way to improve the above algorithm is to have a process to decide whether to re-quantize and re-encode macroblocks in the left-side or the right-side GOB instead of always choosing to do the macroblocks in the right-side GOB. When the QUANT values used on either side of the boundary between the left and right side QCEF GOBs differ by a large amount, then the loss in quality due to the re-quantization process can be noticeable. Under such conditions, the following approach is used to mitigate the loss in quality:
This approach increases the complexity of the algorithm by a negligible amount since we can compute this measure of quality of stitching after the pair of QCIF GOBs have been decoded but prior to its stitching. Hence, the decision to distribute the re-quantization and re-encoding on either side of the boundary of the QCIF GOBs can be made prior to its stitching. Finally, this situation happens very rarely (less than 1% of the time). For all of these reasons, this approach has been incorporated into the stitching algorithm.
The basic idea of the simplified compressed domain H.263 stitching, consisting of the three main steps (i.e. parsing of the individual QCIF bitstream, differential motion vector modification and DQUANT modification), has been described in D. J. Shiu, C. C. Ho, and J. C. Wu, “A DCT-Domain H.263 Based Video Combiner for multipoint Continuous Presence Video Conferencing”, Proc. IEEE Conf. Multimedia Computing and Systems (ICMCS 1999), Vol. 2 pp. 77-81, Florence, Italy June 1999, the teaching of which is incorporated herein by reference. However, the specific details for DQUANT modification as proposed here are unique to the present invention.
C Detailed Description of Alternative Practical Approach for H.263 Stitching Using H.263 Payload Header in RTP Packet
In the case of videoconferencing over IP networks, the audio and video information is transported using the real time protocol (RTP). Once the appliance has encoded the input video frame into H.263 bitstream, it is packaged as RTP packets according to RFC 2190. Each such RTP packet consists of a header and a payload. The RTP payload contains the H.263 payload header, and the H.263 bitstream payload.
Three formats, Mode A, ModeB and Mode C, are defined for the H.263 payload header:
First, it has to be determined as to which of the three modes is suitable for packetization of the stitched bitstream. Since the PB-frames option is not expected to be used in videoconferencing for delay reasons, mode C can be eliminated as a candidate. In order to figure out whether mode A or mode B is suitable, the discussion of H.263 stitching from the previous section has to be recalled. During stitching, each pair of GOBs from the two QCIF quadrants is merged into a single CIF GOB. Two issues arise out of such a merging process:
The incorrect motion vector prediction problem can be solved rather easily by re-computing the correct motion vector predictors (in the context of the CIF picture) and thereafter the correct differential motion vectors to be coded into the stitched bitstream. The incorrect quantizer use problem is unfortunately not as easy to solve. The GOB merging process leads to DQUANT overloading in some rare cases thereby requiring re-quantization and re-encoding of the affected macroblocks. This may lead to a loss of quality (however small) in the stitched picture which is undesirable. This problem can be prevented only if DQUANT overloading can somehow be avoided during the process of merging the QCIF GOBs. One solution to this problem would be to figure out a way of setting QUANT to the desired value right before the start of the right-side QCIF GOB in the stitched bitstream. However, since the right-side QCIF GOB is no longer a GOB in the CIF picture, a GOB header cannot be inserted to provide the necessary QUANT value through GQUANT. This is exactly where mode B of RTP packetization, as described above, can be helpful. At the output of the stitcher, the two QCIF GOBs corresponding to a single CIF GOB can be packaged into different RTP packets. Then, the 5-bit QUANT field present in the H.263 payload header in mode B RTP packets (but not in mode A packets) can be used to set the desired QUANT value (the QUANT seen in the context of the QCIF picture) for the first MB in the packet containing the right-side QCIF GOB. This will ensure that there is no overloading of DQUANT and therefore no loss in picture quality.
One potential problem with the proposed lossless stitching technique described above is the following. The QUANT assigned to the first MB of the right-side QCIF GOB through the H.263 payload header in the RTP packet will not agree with the QUANT computed by the CIF decoder based on the QUANT of the previous MB and the DQUANT of the current MB (if the QUANT values did agree, there would be no need to insert a QUANT through the H.263 payload header). In this scenario, it is unclear as to which QUANT value will be picked by the decoder for the MB in question. The answer to this question probably depends on the strategy used by the decoder in a particular videoconferencing appliance.
It should be understood that various changes and modifications to the presently preferred embodiments described herein will be apparent to those skilled in the art. Such changes and modifications can be made without departing from the spirit and scope of the present invention and without diminishing its intended advantages. It is therefore intended that such changes and modifications be covered by the appended claims.
The present application claims benefit under 35 U.S.C. section 119(e) of the following U.S. Provisional Patent Applications, the entireties of which are incorporated herein by reference: (i) Application No. 60/467,457, filed May 2, 2003 (“Combining/Stitching of Standard Video Bitstreams for Continuous Presence Multipoint Videoconferenceing”); (ii) Application No. 60/471,002, filed May 16, 2003 (“Stitching of H.264 Bitstreams for Continuous Presence Multipoint Videoconferenceing”); and (iii) Application No. 60/508,216, filed Oct. 2, 2003 (Stitching of Video for Continuous Presence Multipoint Videoconferenceing”).
Number | Date | Country | |
---|---|---|---|
60467457 | May 2003 | US | |
60471002 | May 2003 | US | |
60508216 | Oct 2003 | US |