With the drastic improvements and developments of the network infrastructures, multimedia applications for a variety of devices with different capabilities have become very popular. These devices range from cell phones and PDA's with small screens and restricted processing power to high-end PCs with high-definition displays. These devices are mainly connected to different types of networks with various bandwidth limitations and loss characteristics. Addressing this vast heterogeneity is a considerably tedious task. A highly attractive solution which has been under development for the past 20 years is known as scalable video coding. The term “scalability” here means that certain parts of the bitstream can be removed in order to adapt it to the various requirements of end users as well as to varying network conditions or terminal capabilities.
An H.264/AVC scalable video coding standard, or often referred to as SVC, is an extension (Annex G) of the H.264/AVC video coding for video compression. The H.264/AVC standard is a form of the Moving Pictures Expert Group (MPEG) video compression standard based on motion-compensation. Motion-compensation is a technique often used in video compression in which a frame is described in terms of the transformation with respect to a reference frame. The reference frame may be previous in time or even from the future. This will be described in further detail below.
In MPEG video compression, frames can be grouped into sequences called a group of pictures (GOP). Each coded video stream consists of successive GOPs. A GOP can contain three main frame types: an I-frame, P-frame, or B-frame. An I-frame (intra-coded frame) is a reference frame which represents a fixed image and is independent of other picture types. A P-frame (predictive coded frame) contains motion-compensated information from a preceding I- or P-frame, and therefore is more compressible than an I-frame. For example, it may contain motion vectors that describe a transformation in reference to a preceding I- or P-frame. Finally, a B-frame (bi-directionally predictive coded frame) contains difference information from a preceding and following I- or P-picture within a GOP, and therefore can obtain the highest amount of data compression. An example of the different types of frames within a GOP will now be described in reference to
In this example, a GOP 116 includes six frames: I-frame 102, B-frame 104, B-frame 106, P-frame 108, B-frame 110 and B-frame 112. This particular example video consists of a sailboat slowly moving across an ocean, with a bird flying in the background. I-frame 102 (a reference frame) includes all the pictorial information: sailboat, ocean, sun, flying bird. In contrast, P-frame 108, which is predicted from I-frame 102, only includes the change with respect to I-frame 102; for example, since the sun and ocean do not really move, only the new positions of the sailboat and flying bird (moving objects) are contained within P-frame 108. Similarly, B-frame 106 contains only transformational information, as it is predicted from I-frame 102 and P-frame 108. In this particular example, since the sailboat is moving much slower than the flying bird, the only real difference between I-frame 102 and P-frame 108 is the position of the flying bird as it moves across the frame. Therefore, B-frame 106 contains only information describing the new position of the flying bird. Similarly, B-frame 104 is then predicted from I-frame 102 and B-frame 106, and contains only the change in the flying bird's position. In a similar fashion, B-frames 110 and 112 are predicted from P-frame 108 and I-frame 114. Note that I-frame 114 (the start of the next GOP) illustrates the full image information, now without the bird since it has moved off the frame.
Note that out of the six frames in GOP 114, only one frame (I-frame 102) contains information describing a full image; P-frame 108 contains only the transformation with respect to I-frame 102, B-frame 106 contains only the transformation with respect to I-frame 102 and P-frame 108, and so on. In this manner, it can be seen than since they contain only transformations with respect to a reference frame, P-frames contain much less information than I-frames, and that B-frames contain even less information than P-frames. Thus the use of B-frames and P-frames within a GOP is often implemented in MPEG video compression in order to provide for very high compression ratio.
As mentioned previously, SVC is a standardized extension of an MPEG video compression standard (H.264/AVC). SVC allows for spatial, temporal, and quality scalabilities. Spatial scalability and temporal scalability describe cases in which subsets of the bit stream represent the source content with a reduced picture size (spatial resolution) or frame rate (temporal resolution). With quality scalability, the substream provides the same spatio-temporal resolution as the complete bit stream, but with a lower fidelity—where fidelity is often measured by signal-to-noise ratio (SNR). The SVC design enables the creation of a video bitstream that is structured in layers, consisting of a base layer and one or more enhancement layers. Each enhancement layer either improves the resolution (spatially or temporally) or the quality of the video sequence. The superb adaptability of the SVC and its acceptable coding efficiency make SVC a suitable candidate for many video communication applications such as multi-cast, video surveillance, and peer-to-peer video sharing.
In SVC, temporal scalability is provided by the concept of hierarchical B-frames within a GOP. This will be discussed in further detail later with reference to
Quality scalability can be seen as a special case of spatial scalability with identical picture sizes for base and enhancement layers. The same prediction techniques are utilized except for the corresponding upsampling operations. This type of quality scalability is referred to as coarse-grain quality scalable coding (CGS). Since CGS can only provide a discrete set of decoding points, a variation of the CGS approach, which is referred to as medium-grain quality scalability (MGS), is included in the SVC design to increase the flexibility of the bit stream adaptation.
The coded video data of the SVC are organized into packets with an integer number of bytes, called Network Abstraction Layer (NAL) units. Each NAL unit belongs to a specific spatial, temporal, and quality layer. Moreover, a set of NAL units from all spatial layers having the same temporal instant constitute an Access Unit (AU).
Video stream generator 202 provides the original video stream (containing all the frames with all the information) to video encoder 204 via channel 212. Video encoder 204 then encodes the frames into a bitstream consisting of GOPs (containing I, P, and B-frames) similar to that shown in
A key element of SVC is bit stream extraction. In order to extract a substream with a particular average bit rate and/or resolution, a bit stream extractor is employed. Bit stream extractor 220 lies within video encoder 204, which determines the specific substream to be extracted from the entire coded bit stream by deciding which NAL units to send, depending on the channel bit rate and target resolution for each target. For example, based on the spatial resolution of PC 210 and the bit rate of channel 218, bit stream extractor 220 determines the substream to be sent to PC 210 by deciding which NAL units of the entire coded bitstream are to be sent and which are to be discarded.
It should be noted that bit stream extractor 220 does not need to lie within video encoder 204. In some cases, a bit stream extractor may be a separate device from encoder an encoder.
The issue with bit stream extraction is that there usually exists a huge number of possibilities (especially for MGS coding) that result in, approximately, the same bit rate. A very naïve method is to randomly discard NAL units until the desired bit rate is achieved. Nonetheless, the efficiency of the bit extractor can be substantially improved by assigning a priority identifier to each NAL unit during the encoding or a post-processing operation. The priority identifier is directly related to the contribution of the NAL unit in the overall quality of the video sequence. Therefore, bit stream extractor 220 first discards the NAL units with the lowest priority in order to reach the target bit rate of the given channel. In the example mentioned above regarding the substream sent to PC 210, this may involve discarding all NAL units corresponding to certain enhancement temporal, spatial, and quality levels and only keeping those corresponding to basic temporal, spatial and quality levels.
The problem of optimal extraction of the NAL units is a challenging task due to various temporal and spatial dependencies. The implementation of the different scalabilities of SVC (temporal, quality, spatial) will now be described in further detail in order to examine the various dependencies and to describe the conventional bit stream extraction methods.
As mentioned previously, in SVC temporal scalability is implemented by the use of hierarchical B-frames within a GOP. This will be now discussed in reference to
In this example, the bit stream is broken into four temporal layers—TL0, TL1, TL2 and TL3. Starting from the bottom, Temporal Layer 0 represents the base temporal layer and includes frames 314 and 316, which are either I-frames or P-frames. The next level up, TL1, includes only frame 318. Frame 318 is a B-frame which is predicted from frames 314 and 316, as indicated by the arrows. The next temporal layer, TL2, includes frames 320 and 322, which are B-frames based on 314 and 318 and on 318 and 316, respectively. The final temporal layer, TL3, includes frames 324, 326, 328, and 330 which are all B-frames based on frames from the previous temporal layers as indicated by the arrows.
Diagram 300 illustrates an example of the hierarchical prediction structure implemented in SVC bitstreams. Since, for example, the frames in TL2 are predicted from the frames in TL1 and in TL0, TL2 is said to be dependent on TL1 and TL0; thus a bit stream extractor should only include TL2 if both TL1 and TL0 were also included. Thus for extracting temporally scaled substreams from this bit stream, there are only four possibilities: include TL0 only, include TL0+TL1, include TL0+TL1+TL2, or include TL0+TL1+TL2+TL3. A substream including all temporal layers (TL0+TL1+TL2+TL3) would correspond to the highest temporal level (and would have the maximum frame rate), whereas a substream including only one temporal layer (TL0) would correspond to the lowest temporal level (would have the minimum frame rate).
In
GOP 312 is from
As with temporal scalability, a bit stream extractor can choose to discard SNR quality layers (starting with Quality Layer 3) in order to do scale down the SNR quality as appropriate for the target. For example, a bit stream extractor can choose to discard Quality Layer 3 (GOP 406) and keep only Quality Layers 2, 1, and 0 (GOP 404, 402, 312).
In the example of
Block 400 is from
Full bitstream 500 represents an SVC bitstream that includes all the possible temporal, quality, and spatial layers. As mentioned earlier, to scale down this bitstream, the bit stream extractor must decide which frames, or NAL units, from full bitstream 500 must be discarded. In conventional bit stream extraction methods, scalability is implemented by discarding entire temporal, quality, or spatial layers. This will be further discussed in reference to
Note that unlike full bitstream 500 (which includes all four spatial layers), substream 600 includes only two spatial layers (blocks 400 and 502). So in this case, the bit stream extractor has decided to simply discard the top two spatial layers (blocks 504 and 506 of
Note that in each of blocks 702, 704, 706 and 708, the top two quality layers (Quality Layers 2 and 3) are not present; only Quality Layers 0 and 1 are present. Thus in this case, the bit stream extractor has decided to discard Quality Layers 2 and 3 in each spatial layer. In this manner, full bitstream 500 has been scaled down to a lower quality level.
In each of blocks 802, 804, 806, and 808, Temporal Layer 3 is missing; only Temporal Layers 0, 1, and 2 are present. Thus, in this case, the bit stream extractor has decided to discard Temporal Layer 3 in each quality layer and in each spatial layer. In this manner, full bitstream 500 has been scaled down to a lower temporal level.
Note that
The application/device for which the video is being decoded usually determines the target spatial and temporal resolutions. Therefore, the base layer of each spatial and temporal resolution lower or equal to the target spatial and temporal resolutions have to be included first. Next, for each lower spatial resolution, NAL units of higher quality levels are ordered in increasing order of their temporal level. Finally, for the target spatial resolution, NAL units are ordered based on their quality level and are included until the target bit rate is reached.
A major drawback of this conventional basic extraction method is that its prioritization policy is independent of the video content. Since the distortion of a frame depends on the content of the frame in addition to the quantization parameter used, only a content-aware prioritization policy can ensure optimal extraction. Considering the fact that the standard does not specify the extraction process, one can devise an alternative, more efficient process.
What is needed is a bit stream extraction system and method that can optimally and efficiently extract NAL units from an SVC bit stream.
The present invention provides a system and method to optimally and efficiently extract NAL units from an SVC bit stream, in order to provide a scaled substream that results in minimal distortion for a given bit rate, or that can maximize the resulting bit rate for a given acceptable distortion.
In accordance with an aspect of the present invention, device may be used with a frame generating portion that is arranged to receive picture data corresponding to a plurality of pictures and to generate encoded video data for transmission across a transmission channel having an available bandwidth. The frame generating portion can generate a frame for each of the plurality of pictures to create a plurality of frames. The encoded video data is based on the received picture data. The device includes a distortion estimating portion and inclusion determining portion and an extracting portion. The distortion estimating portion can estimate a distortion. The inclusion determining portion can establish an inclusion boundary based on the estimated distortion. The extracting portion can extract a frame from the plurality of frames based on the inclusion boundary.
Additional advantages and novel features of the invention are set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following or may be learned by practice of the invention. The advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings, which are incorporated in and form a part of the specification, illustrate an exemplary embodiment of the present invention and, together with the description, serve to explain the principles of the invention. In the drawings:
In accordance with an aspect of the present invention, a bit stream extractor is able to efficiently select NAL units within SVC layers to create a scaled substream with the minimum estimated distortion for a given bit rate.
In accordance with another aspect of the present invention, the bit stream extractor is able to efficiently select NAL units within SVC layers to create a scaled substream with the minimum estimated distortion for a given bit rate and to address the problem of known packet losses over networks.
In accordance with another aspect of the present invention, the bit stream extractor is able to efficiently select NAL units within SVC layers to create a scaled substream with the minimum estimated distortion for a given bit rate and to address the problem of a known packet loss gradient over networks.
In accordance with another aspect of the present invention, a bit stream extractor is able to efficiently select NAL units within SVC layers to create a scaled substream with the maximum bit rate for a given maximum distortion.
In accordance with another aspect of the present invention, the bit stream extractor is able to efficiently select NAL units within SVC layers to create a scaled substream with the maximum bit rate for a given maximum distortion and to address the problem of known packet loss over networks.
In accordance with another aspect of the present invention, the bit stream extractor is able to efficiently select NAL units within SVC layers to create a scaled substream with the maximum bit rate for a given maximum distortion and to address the problem of a known packet loss gradient over networks.
As discussed previously, conventional methods of SVC bit stream extraction typically involve discarding entire temporal/spatio/quality layers in order to match a target bit rate and spatial resolution. Since the actual content of the NAL units are not taken into account, the extracted substream (in terms of distortion, etc) will likely be suboptimal. In accordance with an aspect of the present invention, an algorithm can accurately and efficiently approximate both source and expected distortion of the resulting video sequence for any subset of the available NAL units. Then, in accordance with an aspect of the present invention, using a model based on optimizing bit rate and/or distortion, the optimal set of NAL units can be efficiently selected.
An example encoder in accordance with an aspect of the present invention will now be discussed below in reference to
Video stream generator 202 provides the original video stream (containing all the frames with all the information) to video encoder 204 via channel 212. It should be noted that the original video stream is made up of a plurality of individual pictures, each having picture data. As such, video encoder 902 is arranged to receive picture data corresponding to a plurality of pictures. Video encoder 902 is able to generate encoded video data based on the received picture data. The encoded video data will then be able to be transmitted across a transmission channel having an available bandwidth, for example channel 214 to cell phone 206, channel 216 to HDTV 208 and channel 218 to PC 210.
Frame generating portion 906 can generate a frame for each of the plurality of pictures to create a plurality of frames, for example discussed above with reference to
Distortion estimating portion 908 can estimate a distortion. As will be discussed in more detail below, in accordance with an aspect of the present invention, distortion estimating portion 908 can estimate a distortion that transmitted data will encounter when transmitted over a transmission channel, for example, transmission channel 214 to cell phone 206.
Further, in some embodiments, distortion estimating portion 908 can estimate a distortion that transmitted data will encounter over a communication channel having a known amount of packet loss. For example, presume that a user of cell phone 206 is driving in a car along a road with very good cellular reception. It may be known that channel 214 has an amount of packet loss such that cell phone 206 will not receive 1 out of every 50 image frames that were transmitted. Distortion estimating portion 908 will be able to take such a packet loss into account when estimating a distortion.
Still further, in some embodiments, distortion estimating portion 908 can estimate a distortion that transmitted data will encounter over a communication channel having a known packet loss gradien gradient. Again, presume that a user of cell phone 206 is driving in a car along a road with very good cellular reception, during a first time period. Again, it may be known that at the first time period, channel 214 has an amount of packet loss such that cell phone 206 is not receiving 1 out of every 50 image frames that were transmitted. Now, presume that the user of cell phone 206 is driving in the car along the road with very bad cellular reception, during a second time period. It may be known that at the second time period, channel 214 has an amount of packet loss such that cell phone 206 is not receiving 3 out of every 50 image frames that were transmitted. The change in the amount of packet loss in channel 214 from the first time period to the second time period is referred to as the packet loss gradient. Distortion estimating portion 908 will take such a packet loss gradient into account when estimating a distortion.
Inclusion determining portion 910 can establish an inclusion boundary of the substreams of varying spatial, temporal, and quality levels based on the estimated distortion, as will be discussed in more detail below.
Extracting portion 912 can extract a frame from the plurality of frames based on the inclusion boundary. As will be discussed in more detail below, once an inclusion boundary of the substreams of varying spatial, temporal, and quality levels is established (based on the estimated distortion), extracting portion may extract that frames that lie outside of the inclusion boundary. This specific frame selection in accordance with an aspect of the present invention is distinct from the frame selection based on a rigid spatial, temporal or quality level, as discussed above with reference to
As illustrated in the figure, frame generating portion 906, distortion estimating portion 908, inclusion determining portion 910 and extracting portion 912 are individual devices. In some embodiments, video encoder 902 may be implemented on a device that is operable to read a device-readable media having device-readable instructions stored thereon, wherein the device-readable instructions are capable of instructing the device to operate in the manner discussed herein.
In some embodiments, at least two of frame generating portion 906, distortion estimating portion 908, inclusion determining portion 910 and extracting portion 912 are a unitary device. Further, in some embodiments, at least two of video encoder 902 may be implemented on a device that is operable to read a device-readable media having device-readable instructions stored thereon, wherein the device-readable instructions are capable of instructing the device to operate in the manner discussed herein.
In some embodiments, a bit stream extractor 904 is not within a video encoder. In such embodiments, a distinct video encoder is able to first encode an entire set of substreams of varying spatial, temporal, and quality levels. The distinct video encoder may then provide the entire set of substreams to a separate distinct bit stream extractor in accordance with an aspect of the present invention.
A more detailed discussion of example embodiments of a video encoder in accordance with the present invention will now be described.
Suppose that a video provider wants to provide a video having a predetermined guaranteed quality. In other words, the video provider wants to provide a video having a predetermined acceptable (maximum) distortion. In accordance with an aspect of the present invention, a video encoder may determine a specific video stream that will provide a maximum bit rate for transmission for a predetermined acceptable distortion.
Referring to
A case of extracting a substream with the maximum bit rate for a predetermined acceptable (maximum) distortion will be discussed, in accordance an aspect of the present invention will now be discussed below with reference to
In graph 1000, note that the NAL units in Quality Layer 0 have a higher transmission rate than those in Quality Layer 1, and those NAL units in Quality Layer 1 have a higher transmission rate than those in Quality Layer 2, and so forth. This is because as more Quality Layers are included in the extracted substream, more NAL units need to be transmitted, therefore slowing down the transmission rate.
In accordance with an aspect of the present invention, the bit extraction algorithm determines the subset of NAL units that results in the maximum possible bit rate for a given acceptable distortion, using the transmission rates for each NAL as shown in
Note that the graph of
Note that inclusion border 1102 drawn on block 400 is the same as that shown in
Blocks 1402, 1404, 1406, and 1408 correspond to blocks 400, 502, 504, and 506 of full bitstream 1300, except now in each block only the NAL units designated to be included by their respective inclusion borders remain. In this manner, substream 1400 has been extracted from full bitstream 1300, by the use of inclusion maps based on the maximization of the transmission rate of the included NAL units.
Note that the bit stream extraction method of the present invention (as illustrated in
Now suppose that a video provider wants to provide a video stream having a minimized distortion for a given transmission bit rate. In other words, the video provider wants to provide a video, knowing that the transmission rate is limited to a fixed transmission bit rate. In this case, the video provider wants to minimize distortion. In accordance with an aspect of the present invention, a video encoder may determine a video stream that will determine a minimum distortion for a fixed bit rate of transmission.
Referring to
An example method of bit stream extraction to maximize transmission rate for a given distortion in accordance with an aspect of the present invention will now be described with reference to
Method 1500 starts (S1502) and the acceptable (maximum) distortion is determined (S1504). This determination may be made by the system operator.
For example, with reference to
On the other hand, the acceptable distortion to be used by inclusion determining portion 910 may be very large, which would result in a relatively low required transmission rate. In such a case, channel 216 to HDTV 208 and channel 214 to cell phone 206 would be able to support a resulting video stream. However, the resulting video stream provided to either device may have much distortion.
In light of the two examples discussed above, it is clear that the level of acceptable distortion may be adjusted to accommodate an intended transmission rate.
Once the acceptable distortion is determined, the full SVC bitstream (including all temporal, quality, and spatial layers) is generated (S1506). An example of which is full bitstream 500 shown in
Then the transmission rate of each NAL unit in the full SVC bitstream is determined (1508), as illustrated in
With this information, the optimal subset of NAL units resulting in the maximum transmission rate for a given distortion is determined, and an inclusion map for each spatial layer is generated (S1510), as shown in
Then the desired NAL units (as indicated by the inclusion maps) are extracted (S1512), and the output bitstream is generated (S1514). An example of an extracted substream is illustrated in
In method 1500, the acceptable (maximum) distortion is determined (S1504) before the full SVC bitstream is generated (S1506). However, in other example embodiments, the full SVC bitstream may be generated (S1506) before the acceptable (maximum) distortion is determined (S1504).
So far the case of extracting a substream in order to maximize the overall bit rate for a given acceptable distortion has been discussed. The case of extracting a substream in order to minimize distortion for a given bit rate will be now be discussed. The extraction method is similar in that the content of each individual NAL unit is considered and inclusion maps for each spatial layer are drawn to extract the most optimal substream. The inclusion map for each spatial layer is chosen based on the specific subset of NAL units that would result in the least distortion for a given bit rate. An example method of bit stream extraction to minimize distortion for a given bit rate in accordance with an aspect of the present invention will now be described with reference to
Method 1600 starts (S1602) and the available transmission bit rate is determined (S1604).
Referring to
Once the available transmission bit rate is determined, the full SVC bitstream (including all temporal, quality, and spatial layers) is generated (step S1606). An example of which is full bitstream 500 shown in
Then the expected distortion of the resulting video sequence of each possible substream (each possible set of NAL units) is determined (S1608). This will be described in greater detail below.
With this information, the subset of NAL units resulting in the minimum distortion for the given bit rate is determined, and an inclusion map for each spatial layer is generated (S1610). This will be describe in greater detail below.
Then the desired NAL units (as indicated by the inclusion maps) are extracted (S1612), and the output bitstream is generated (S1614). After this, method 1600 ends (S1616).
In method 1600, the available transmission bit rate is determined (S1604) before the full SVC bitstream is generated (S1606). However, in other example embodiments, the full SVC bitstream may be generated (S1606) before the available transmission bit rate is determined (S1604).
A more detailed explanation of the distortion estimation and minimization for optimal bit stream extraction in accordance with aspects of the present invention will now be provided.
As a substitute to the content-independent packet prioritization of conventional extraction methods, a rate-distortion optimized priority-based framework is employed. In such a framework, a priority is computed for a NAL unit, which represents a frame or a portion of a frame (i.e., residual frame) at a given spatio/temporal/quality level. Note that in this scheme, unlike the conventional basic extraction scheme, all pictures of a given layer do not necessarily follow the same prioritization order. In order to efficiently assign Quality Layers, NAL units have to be ordered according to their contribution to the overall quality of the video sequence. When the correct order is obtained, one can assign the Quality Layers to the NAL units either based on a quantization of their indices or based on an iterative merging algorithm. With a iterative merging algorithm, at each iteration the two adjacent quality increments with the minimum increase in the area below a R-D curve are selected and merged into one until the target number of Quality Layers are achieved.
Assuming an optimal order of the NAL units exists, it can be obtained if for any bit rate Rmin<R<Rmax, an optimal subset of the available NAL units can be extracted. Here Rmin and Rmax denote the minimum and maximum possible bit rates of the scalable bit stream, respectively. As a result, the problem of optimal extraction of a substream at a provided bit rate R is considered. Once the solution to this problem is obtained, one can easily order packets and assign Quality Layers.
Let π(n, d, q) represent the NAL unit associated with frame n at spatial resolution d and quality level q (q=0 represents the base quality). Then, any “consistent” subset of quality increments, P, can be uniquely identified by an inclusion map φ defined by
φ(n,d)=|Q(n,d)|, (1)
where Q(n, d):={q:π(n, d, q)εP} and the notation |•| represents the cardinality of a set. The term “consistent” here refers to a set whose elements are all decodable by the scalable decoder (children do not appear in the set without parents). Note that φ(n, d)=0 indicates that no NAL unit for frame n at resolution d has been included in the set. In this case, if d represents the base resolution, it is inferred that the base layer has been skipped and therefore the dependent frames are undecodable. An example of an inclusion map for a single resolution bit stream is discussed above with reference to
The problem of optimal selection of the quality increments with a target rate of RT can be formulated as
where φ is a vector with elements φ(n,d) for all possible n and d. Furthermore, R(φ) and D(φ) denote the average bit rate and distortion of the video sequence computed using the substream associated with selection map φ. Here, φ represents the set of all possible selection maps for which the resulting substream is decodable. In an example embodiment of the present invention, the distortion D is calculated using the mean squared error (MSE) metric with respect to the fully reconstructed video sequence (maximum quality). Note that for most applications, bit extraction is a post-processing operation, and thus the original video signal is not available for quality evaluation.
In principle, a solution to equation (2) can be found using a non-linear optimization scheme. Nevertheless, in order to converge to a solution, many evaluations of the objective function D(φ) are necessary. Unfortunately, because of various spatio/temporal dependencies, each evaluation of D(φ) requires performing motion compensation operations on several images (due to the hierarchical prediction structure) in addition to the computation cost for finding the MSE. Furthermore, motion compensation operations are known to be highly computationally intensive. Therefore, the computational burden of this optimization is unmanageable in practice. In order to overcome this difficulty, a computationally efficient model that provides an accurate estimate of the source distortion for any selection map φεΦ is provided.
As discussed previously, fast evaluation of the average sequence distortion plays an essential role in solving the optimization problem of equation (2). An aspect of the present invention provides an approximation method for the computation of this distortion. The examples discussed will be based on a single-resolution SVC stream. Nonetheless, the calculations can be directly applied to the more general multi-resolution case by imposing the constraint that all quality NAL units associated with lower resolution spatial layers are included before the base quality of a higher resolution. This constraint reduces the degrees of freedom associated with the selection function by one. Hence, it can be denoted by φ(n) since the spatial resolution d in φ(n,d) is fixed. Note that regardless of the number of spatial layers in the SVC bitstream, a target resolution has to be specified in order to evaluate the quality of the reconstructed sequence. The quality increments from spatial layers lower than the target resolution need to be up-sampled to the target resolution to evaluate their impact on the signal quality. Once again it shall be mentioned that throughout the example embodiments discussed herein, the video quality is measured with respect to the fully reconstructed signal.
The base layer of a picture usually contains motion vectors and a coarsely quantized version of its residual signal required for the construction of (at least) a low quality representation of the picture. In addition, for this reconstruction, the decoder also requires the base layer of the pictures used in the prediction of the current picture. Hence, two different distortion models are proposed, based on the status of the base layers. In the first model, the base layer of the frame is available and decodable by the decoder. In the second model, the base layer is either not available or undecodable due to loss of a required base layer. In the second model, an error concealment strategy may be employed, which includes some special considerations.
Since for MGS coding of SVC, motion compensated prediction is conducted using the highest available quality of the reference pictures (except for key frames), propagation of drift has to be taken into account whenever a refinement packet is missing. Let fnd and fn denote a vector representation of the reconstructed n-th frame using all of its quality increments in presence and absence of drift, respectively. Note that although all quality increments of frame n are included for the reconstruction of both fn and fnd, fnd≠fn since in general some quality increment of the parent frames may be missing in the reconstruction of fnd. Moreover, let en(q) represent the error vector introduced by the inclusion of q≦Q quality increments for the n-th frame when no drift exists. This error is referred to as the EL truncation error. Here, Q represents the total number of quality levels (in all layers), hence, en(q)=0.
The total distortion of frame n due to drift and EL truncation (i.e., Dnt) with respect to fn is obtained according to
where Dnd and Dns(q) represent respectively the distortion, i.e., sum of squared errors (SSE), due to drift and EL truncation (associated with the inclusion of q quality increments). The symbol ∥•∥ here represents the l2-norm. Since the Cauchy-Schwartz inequality provides an upper bound for equation (3) the total distortion Dnt can be approximated as
where k is a constant in the range 0≦k≦1 obtained experimentally from test sequences. Consequently, in order to calculate the total distortion, the drift and EL truncation distortions, Dnd and Dns(q), respectively, are needed. Fortunately, the error due to EL truncation, Dns(q), can be easily computed either at the encoder when performing the quantization of the transform coefficients or by a post-processing operation. The drift distortions, on the other hand, depend on the computationally intensive motion compensation operations and propagate from a picture to its descendants. The parent-child relationship of pictures within a GOP will now be discussed in reference to
As shown in
Let the set S={s0, s1, . . . sN} represent the N pictures in the GOP plus the key picture of the preceding GOP denoted by so as portrayed in
Therefore, an approximation to Ddn can be obtained by a second order Taylor expansion of the function F around zero
D
n
d≈γ+ΣiεΛ
Here the coefficients αi and βij are first and second order partial derivatives of F at zero and are obtained by fitting a 2-dimensional quadratic surface to the data points acquired by the decodings of the frames with various qualities. The constant term γ=0, since there is no drift distortion when both reference frames are fully reconstructed, i.e., Dit=0, i=1,2.
Note that technically,
is not a function since the mapping
is not unique because distortions may be due to various error distributions. Therefore, equation (5) can only be justified as an approximation since the errors arising from missing high frequency components are usually widespread throughout the image and follow similar distributions. The coefficients of equation (5) for all frames except key frames, can be obtained by several decodings of different substreams extracted from the global SVC bit stream. Nevertheless, different methods for choosing the data points may exist. For instance, a suitable set of data can be computed using the following steps: first, for each temporal layer T, a random set of the quality increments are discarded from frames in temporal layers T and lower, while keeping all quality increments of the higher layers (to eliminate EL truncation distortion); and second, the resulting bit stream is decoded and all data points are collected: distortion of each frame n in a temporal layer higher than T along with the distortion of the parent frames (which belong to layers T or lower) form a data point
for that frame.
Once the coefficients αi and βij are computed for each frame (except for key frames), the drift distortion of the child frame Dnt can be efficiently estimated for various distortions of the parent frames. The total distortion Dnt is then computed according to equation (4). The computed distortion of this frame is then used (as a parent frame) to approximate the drift distortion of its children. Therefore, the distortion of the whole GOP can be estimated recursively starting from the key frame, which is not subject to drift distortion. An example calculation of estimated distortion in accordance with an aspect of the present invention will now be discussed with reference to
In graph 1800, function 1802 (dotted line) is the estimated distortion for each frame calculated according to equation (5), in accordance with an aspect of the present invention. Function 1804 (solid line) is the actual distortion for each frame. Note that function 1802 closely matches function 1804. Therefore, it can be assumed that the estimation of distortion as discussed in reference to equation (5) is fairly accurate.
In accordance with aspects of the present invention, in addition to the enhancement layer NAL units, base layer NAL units are allowed to be skipped when resources are limited. Moreover, base layer NAL units may be damaged or lost in the channel and therefore would become unavailable to the decoder. In this scenario, all descendants of the frame to which the NAL unit belongs to are also discarded by the decoder. Consequently, the decoder utilizes a concealment technique as an attempt to hide the lost information from the viewer. Here, a simple and popular concealment strategy is employed: the lost picture is replaced by the nearest temporal neighboring picture. To be able to determine the impact of a frame loss on the overall quality of the video sequence, the distortion of the lost frame after concealment needs to be computed.
Let Dn,icon denote the distortion of a frame n concealed using frame i with a total distortion of Dit. Since Dn,icon does not vary greatly with respect to Dit, it can be assumed that a linear relationship exists between them, i.e.,
D
n,i
con≈μi+νiDit, (6)
where μi and νi are constant coefficients calculated for each frame with all concealment options (different i's). For example in diagram 1700 of
During optimization, whenever a frame is skipped or missing, the pre-calculated coefficients μi and νi, associated with the nearest available temporal neighbor i (which has a distortion Dit), are used according to equation (6) to estimate the distortion of the missing frame.
As briefly mentioned earlier, there is the possibility that NAL units may become damaged or lost in transmission. Therefore, the distortion of the video sequence at the decoder consists of transmission errors in addition to the aforementioned distortion due to quantization degradation. Consequently, an efficient and robust video communication system requires a circumspect combination of source optimization techniques that are well-integrated with error control techniques. In accordance with an aspect of the present invention, methods to account for errors resulting from transmission over packet loss networks are implemented in the bit stream extraction algorithm. These will now be discussed below with references to
Furthermore, channel coding of each source packet is carried out by an RS(N, νk) code, where N indicates total number of transport packets. Therefore, the loss probability of each source packet is given by
where, ε denotes the transport packet loss probability and t=N−νk is the maximum number of transport packet losses allowed in order to recover the source packet.
As previously discussed in reference to equation (2), an inclusion function φ is obtained by minimizing the distortion of the extracted video substream for a given bit rate. However, in the presence of packet losses and errors during bit stream transmission, the task of channel coding should also be considered. Further, what should be minimized is the expected distortion of the extracted substream, since the actual distortion cannot be precisely determined due to errors and losses.
In accordance with an aspect of the present invention, the problem of joint source extraction and channel rate allocation may be formulated as follows. Let ψ(n,d,q) denote channel rate allocation associated with NAL unit π(n,d,q). Then, the optimal inclusion and channel rate functions are obtained by
where, ψ* is a matrix with elements of ψ(n,d,q) and Ψ is the set of all possible channel coding rates. Here, due to indeterministic nature of channel losses an expected distortion measure, E{D(φ,ψ)} is assumed for video quality evaluation. Note that the variables φ(n,d) and ψ(n,d,q) are dependant variables since the channel coding rate of a packet not included for transmission, is always equal to 1. In other words, for any possible n and d, there should be ψ(n,d,q)=1 for all q>φ(n,d). An example of inclusion and channel coding rate functions for a single resolution bit stream as determined by equation (8) will now be discussed with reference to
As shown in graph 2000, inclusion function φ(n) 2002 delineates which packets are to be included in the extracted substream. Note that for all excluded packets, ψ(n,d,q)=1, since those packets will not be transmitted.
As mentioned previously, for applications in which transmission over a packet loss network is required, an expected distortion measure should to be considered to evaluate the video quality at the encoder. For these applications, a method is provided to estimate the overall expected distortion of the decoder for the given channel with available Channel State Information (CSI) and error concealment method. In an example embodiment, in accordance with an aspect of the present invention, the expected distortion of a GOP is calculated based on the inclusion function φ(n) of the GOP. As mentioned previously, for the general case, φ(n) specifies number of packets to be sent per frame n. The calculations to follow only account for single resolution bit streams; however, this technique can be applied to the general case if it is assumed the quality increments of lower spatial layers always appear before quality increments of current spatial layer. With this assumption, a 1-dimensional inclusion function φ(n) can describe a substream of the scalable bit stream.
For the following calculations of expected distortion, a generic case is considered, where a packet loss probability of pnq is assigned to the qth quality level packet of frame n. Let {tilde over (D)}n denote the distortion of frame n as seen by the encoder, i.e., {tilde over (D)}n represents a random variable whose sample space is defined by the set of all possible distortions of frame n at the decoder. Then, assuming a total number of Q quality levels exist per frame, the conditional expected frame distortion E{{tilde over (D)}n|BL} given that the base layer is received intact, may be obtained by
where, Dn(q) is the total distortion of frame n reconstructed by inclusion of q>0 quality increments (the superscript t of D is omitted for simplicity).
The first term in equation (9) accounts for cases in which, all (q−1) quality segments have been successfully received but the qth segment is lost, therefore, the reconstructed image quality is Dn(q−1). The second term accounts for the case where all quality increments in the current frame sent by the transmitter (given by φ(n)) are received. Recall that Dn(q) depends on the distortion of the parent frames according to equations (4) and (5) for the cases in which base layer is available and decodable. Unlike these source distortion calculations, exact distortions of the parent frames are not known by the encoder, therefore, Dit in equation (5) has to be replaced with its expected value given the base layer, E{{tilde over (D)}i|BL} for all i ε{sn1, sn2}. Similarly to previous calculations, the expected distortion of the each temporal layer, given the base layers, can be recursively computed starting with the lowest layer. Note that the lowest temporal layer (key frame) does not contain any drift distortions and hence its expected distortion can be computed by itself.
Due to the hierarchical coding structure of the SVC, decoding of the base layer of a frame not only requires the base layer of that frame but also the base layer of all the preceding frames in the hierarchy, which were used in prediction of the current frame. For instance, decoding any of the frames in the GOP demands that the key picture of the preceding GOP, s0, be available at the decoder. For each frame sn εS, a set Δn can be formed consisting of all reference pictures in S that the decoder requires in order to decode a base quality of the frame. Note that for all n there is Λn⊂Δn. Further, in an attempt to better formulate the expected distortion for the general case, a relation≦is defined on the set Δn such that if x,y εΔn and x≦y then y depends on x (y is a descendent of x) due to motion-compensated prediction. It can be verified that set Δn plus the relation≦on the set, form a well-ordered set since all four properties, i.e., reflexivity, antisymmetry, transitivity, and comparability (trichotomy law) hold. Note that because all frames in the GOP depend on the key picture of the preceding GOP and no frame in Δn depends on frame sn, for all n≦N there is
s0x,∀xεΔn(x)
snx,∀xεΔn(x) (10)
In the case that the base layer of any frame xεΔn is lost, the decoder is unable to decode frame n and therefore has to perform concealment from the closest available neighboring frame in display order. Consequently, the expected distortion of the frame n is computed according to
where, k represents the concealing frame, sk, specified as the nearest available temporal neighbor of i, i.e.,
Here, g(x) indicates the display order frame number as defined before. The first term of equation (11) deals with situations in which the base layer of a predecessor frame i is lost (with probability pi0) and thus frame n should be concealed using a decodable temporal neighbor while the second term indicates case in which all base layers are received. A final remark should be made regarding the computation of Dn,kcon based on equation (6). The distortion of frame sk given its base layer, referred to as Dk, may be needed in order to compute Dn,kcon. However, this distortion is a random variable and its exact value is unknown to the encoder. As a result, as before, its expected value E{{tilde over (D)}k} is employed instead. The conditional expected distortions E{{tilde over (D)}n|BL} should therefore be computed for each temporal layer (starting from the coarsest layer) before proceeding to calculations of unconditional expected values.
A challenge in solving the problem considered herein is the efficient evaluation of the sequence average quality for a provided mapping function φ(n) (see equation (2)) as discussed previously. Once the sequence average quality for any mapping function φ(n) is known, in theory, a nonlinear optimization scheme can be applied in order to find the best packet extraction pattern. In practice, however, careful consideration of the optimization method may be needed due to coarse-grain discrete nature of φ(n) and its highly complex relation to the overall distortion. Thus, in accordance with an aspect of the present invention, an example greedy algorithm is presented to efficiently find a solution to this problem.
The optimization can be performed over an arbitrary number of GOPs, denoted by M. Trivially, increasing the optimization window may result in a greater performance gain at a price of higher computational complexity. In the example algorithm in accordance with an aspect of the present invention, the base layer of the key pictures are given the highest priority and therefore are the first packets to be included. Then, packets are added one at a time based on their global distortion gradient. In other words, initially, the mapping function φ(n)=1 if sn is a key frame, otherwise φ(n)=0. Then, at each time step i, a packet π(ni*, φ(ni*)) is added and φ(ni*) is incremented by one where ni* is obtained by
Here, Rs(φ) represents the source rate associated with the current mapping function φ. This process continues until the rate constraint RT is met or all available packets within the optimization window (i.e., MGOPs), are added to the ordering queue.
Now the problem of equation (8) will be addressed. Here, an algorithm similar to the algorithm described for source-optimized bit extraction discussed earlier will be presented. Note that according to equation (9), the expected distortion of the video sequence directly depends on the source mapping function φ(n). Its dependency on packet channel coding rates, on the other hand, is implicit in that equation. The packet loss probabilities, pnq's, used in computation of the expected distortion depend on the channel conditions as well as the particular channel coding and rate employed.
Similar to the source optimization algorithm, an optimization window of M GOPs is considered here. The source mapping function φ(n) initially only includes the base layer of the key pictures with an initial channel coding rate less than 1. Then, at each time step, a decision is made whether to add a new packet to the transmission queue or increase the Forward Error Correction (FEC) protection of an existing packet. Let π(n*, q*) denote an existing packet (i.e., q*<φ(n*)) such that an increase in its channel protection results in the highest expected distortion gradient, δED*, obtained by
where, ED and Rt represent expected distortion and total rate associated with the current φ and ψ. Likewise, among candidate packets for inclusion, let π(n★, q★) denote the one with highest expected distortion gradient, δED★, i.e.,
where, q=φ(n). In cases in which δED*<δED★, the channel protection rate of the already included packet π(n*, q*) is incremented to the next level by padding additional parity bits. On the contrary, when δED*>δED★, the source packet π(n★, q★) is included in the transmission queue with a channel coding rate ψ(n★, φ(n★)) obtained from equation (15). Note that in both scenarios, the corresponding functions φ and ψ are updated according to the changes made to the transmission queue. This process is continued until the bit rate budget for the current optimization window RT is reached.
The performance of the optimized bit extraction scheme for H.264/AVC SVC extension in accordance with an aspect of the present invention will now be evaluated. A simulation was implemented with the reference software Joint Scalable Video Model (JSVM) 9.10. Three video sequences (Foreman, City, and Bus) at display resolution of CIF are considered in the following experiments. Sequences are encoded into two layers, a base layer and a quality layer, with basis quantization parameters QP=36 and QP=24 respectively. Furthermore, the quality layers are divided into 5 MGS layers.
The source extraction scheme in accordance with an aspect of the present invention is compared to two conventional extraction approaches: 1) the JSVM optimized extraction with quality layers, referred to as “JSVM QL”, and 2) the content-independent JSVM basic extraction referred to as “JSVM Basic”. This comparison will now be described in reference to
As demonstrated by graphs 2102, 2104 and 2106 the extraction scheme in accordance with aspects of the present invention outperforms both of the JSVM extraction schemes by a maximum of over 1 dB. The provided gain of the extraction scheme in accordance with aspects of the present invention is mainly due to the accurate estimation of the distortion for any substream, which allows the bit extractor to freely select NAL units with the most contribution to the video quality. The JSVM QL extraction, on the other hand, only orders NAL units within a quality plane and therefore, provides a limited gain. The JSVM basic extraction scheme performs the worst, as expected, since it only uses the high level syntax elements of the NAL units to order them and thus, is unaware of their impact on the quality of the sequence.
To evaluate the performance of the unequal error protection (UEP) scheme, a memoryless channel with a packet loss rate of 10% was considered. The three transmission schemes considered were: 1) joint extraction with UEP in accordance with aspects of the present invention, referred to as “OptExtraction+UEP”; 2) a source extraction scheme in accordance with aspects of the present invention, with the best fixed channel coding rate obtained exhaustively from the set of channel coding rates for each transmission bit rate, referred to as “Opt Extraction+EEP”; and 3) JSVM basic extraction with the best fixed channel coding rate, referred to as “JSVM Basic+EEP”. In order to build a fair comparison criteria, it is assumed that the base layer of the key frames are coded using the lowest channel coding rate and therefore always received intact for all three schemes. The variation in performance of these three extraction and error protection schemes will now be discussed in reference to
As demonstrated by graphs 2202, 2204, and 2206 the joint extraction with UEP of the present invention outperforms the other two schemes. Note that packets in equal error protection schemes may be lost with a constant probability. However, the UEP scheme distributes parity bits such that important packets have smaller loss probabilities and therefore some less important packets have higher loss probabilities. Note that while the UEP scheme in accordance with aspects of the present invention provides added performance, most of the performance gain comes from the optimized source extraction scheme of the present invention.
As discussed above, an accordance with aspects of the present invention, a system and method accurately and efficiently estimates the quality degradation (distortion) resulting from discarding an arbitrary number of NAL units from multiple layers of a bitstream. Then, this estimated distortion is used to assign Quality Layers to NAL units for a more efficient extraction.
The foregoing description of various preferred embodiments of the invention have been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments, as described above, were chosen and described in order to best explain the principles of the invention and its practical application to thereby enable others skilled in the art to best utilize the invention in various embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto.
The present application claims priority from U.S. Provisional Application No. 61/103,355 filed Oct. 7, 2008, the entire disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61103355 | Oct 2008 | US |