The present invention relates to the field of encoder control more specifically, the present invention relates to frame encoding decision for video sequence to achieve a minimum sequence cost within a sequence resource budget.
Video compression takes advantage of these redundancies, with the intention to reduce the size of the video data while maintaining the quality as much as possible. Such compression is referred to as lossy (as opposed to lossless), since the original data cannot be recovered exactly. Most modern codecs take advantage of correlation among video frames and transmit the differences between a current frame and a predicted frame compactly represented by prediction parameters. The predicted frame is usually similar to the original one. The residual information, i.e., prediction error, is transmitted together with the prediction parameters. The error signal usually is much smaller than the original signal and can be compactly represented using spatial transform, such as the discrete cosine transform (DCT) with subsequent quantization of the transform coefficients. The quantized coefficients and the prediction parameters are entropy-encoded to reduce furthermore their redundancy.
Inter-prediction exploits the temporal redundancy by using temporal prediction. Due to high level similarity among consecutive frames, temporal prediction can largely reduce the information required to present the video data. The efficiency of temporal prediction can be further improved by taking into account the object motion in the video sequence. The motion parameters associated with temporal prediction are transmitted to the decoder for reconstructing the temporal prediction. This type of prediction is used in MPEG-like codecs. The process of producing intra-prediction is called motion estimation. Typically, it is performed in a block-wise manner, when many modern codec support motion estimation with blocks of adaptive size.
At the other end, intra-prediction takes advantage of spatial redundancy and predicts portions of a frame (blocks) from neighbor blocks within the same frame. Such prediction is usually aware of spatial structures that may occur in a frame, namely smooth regions and edges. In general, larger block size is more coding efficient for smooth areas and smaller block size is more coding efficient for areas with more texture variations. In the latter case, the prediction based on neighboring blocks can improve coding efficiency and a directional prediction can even further improve the efficiency. Such intra-prediction is used in the recent H.264 Advances Video Codec (AVC) standard.
The actual reduction in the amount of transmitted information is performed in the transmission of the residual. The residual frame is divided into blocks, each of which undergoes a DCT-like transform. The transform coefficients undergo quantization, usually performed by scaling and rounding. Quantization allows represent the coefficients by using less precision, thus reducing the amount of information required. Quantized transform coefficients are transmitted using entropy coding. This type of coding exploits the statistical characteristics of the underlying data.
For color video, often they are represented in the RGB color coordinate. However, in most video coding systems, the encoder usually uses the YCbCr color space because it is a more compact representation. Y is the luminance component (luma) and Cb and Cr are chrominance components (chroma) of the color video. The chroma is typically down-sampled to half frame size in each direction because human eyes are less sensitive to the chroma signals and such format is referred to as the 4:2:0 format.
The performance of a lossy video codec is measured as the tradeoff between the amount of bits required to describe the data and the distortion introduced by the compression, referred to as the rate-distortion (RD) curve. As the distortion criterion, the mean squared error (MSE) is usually used. The MSE is often converted into logarithmic units and represented as the peak signal-to-noise ratio (PSNR),
where d is the MSE and ymax is the maximum allowed value of the luma pixels, typically 255 if the luma data has an 8-bit precision and is represented in the range 0, . . . , 255.
The H.264 AVC is one of the most recent standards in video compression, offering significantly better compression rate and quality compared to the previous MPEG-2 and MPEG-4 standards and targeted to high definition (HD) content. For example, H.264 delivers the same quality as MPEG-2 at a third to half the data rate.
The encoding process can be briefly described as follows: a frame undergoing encoding is divided into non-overlapping macroblocks, each containing 16×16 luma pixels and 8×8 chroma pixels. Within each frame, macroblocks are arranged into slices, where a slice is a continuous raster scan of macroblocks. Each slice can be encoded independently of the other. Main slice types are P and I. An I slice may contain only I macroblocks; a P slice may contain P or I macroblocks. The macroblock type determines the way it is predicted. P refers to inter-predicted macroblocks; such macroblocks are subdivided into smaller blocks. I refers to intra-predicted macroblocks; such macroblocks are divided into 4×4 blocks (the luma component is divided into 16 blocks; each chroma component is divided into 4 blocks). In I mode (intra-prediction), the prediction macroblock is formed from pixels in the neighbor blocks in the current slice that have been previously encoded, decoded and reconstructed, prior to applying the in-loop deblocking filter. The reconstructed macroblock is formed by imitating the decoder operation in the encoder loop. In P mode (inter-prediction), the prediction macroblock is formed by motion compensation from reference frames. The prediction macroblock is subtracted from the current macroblock. The error undergoes transform, quantization and entropy coding. According to the length of the entropy code, the best prediction mode is selected (i.e., the choice between an I or a P macroblock, the motion vectors in case of a P macroblock and the prediction mode in case of an I macroblock). The encoded residual for the macroblock in the selected best mode is sent to the bitstream.
A special type of frame referred to as instantaneous data refresh (IDR) is used as a synchronization mechanism, in which the reference buffers are reset as if the decoder started “freshly” from the beginning of the sequence. IDR frame is always an I-frame. The use of IDR allows, for example, starting decoding a bitstream not necessarily from its beginning. The set of P and I frames between two IDRs is called a group of pictures (GOP). A GOP always starts with an IDR frame. The maximum GOP size is limited by the standard.
The operating point on the RD curve is controlled by the quantization parameter, determining the “aggressiveness” of the residual quantization and the resulting distortion. In the H.264 standard, the quantization parameter is an integer in the range 0, . . . , 51, denoted here by q′. The quantization step doubles for every increment of 6 in q′. Sometimes, it is more convenient to use the quantization step rather than the quantization parameter, computed according to
In the following, the q and q′ are used interchangeably.
Theoretically, optimal resource allocation for a video sequence requires encoding the sequence with different sets of parameters and selecting one achieving the best result. However, such an approach is impossible due to a very large number of possible combinations of parameters, which leads to a prohibitive computational complexity. Suboptimal resource allocation approaches usually try to model some typical behavior of the encoder as function of the encoding parameters. If the model has an analytical expression which can be efficiently computed, the optimization problem can be practically solved using mathematical optimization. However, since the model is only an approximate behavior of the encoder, the parameters selected using it may be suboptimal. The main difference between existing encoders is the decision process carried out by the bitrate controller that produces the encoder control parameters. Usually, the encoder parameters are selected in a way to achieve the best tradeoff between video quality and bitrate of the produced stream. Controllers of this type are referred to as RDO.
Parameters controlled by the bitrate controller typically include: frame type and reference frame or frames if the frame is a P-frame on the sequence level, and macroblock type and quantization parameter for each macroblock on the frame level. For this reason, it is natural and common to distinguish between two levels of bitrate control: sequence-and frame-level. The sequence-level controller is usually responsible for the frame type selection and allocation of the bit budget for the frame, and the frame-level controller is responsible for selection of the quantization parameter for each macroblock within the frame.
In a conventional coding system, the sequence level control is very limited. The frame type of a frame in a sequence is usually based on its order in a GOP according to a pre-determined pattern. For example, the IBBPBBP . . . pattern is often used in MPEG standard where B is a bi-directional predicted frame. The GOP may be formed by partitioning the sequence into GOP of fixed size. However, when a scene change is detected, it may trigger the start of a new GOP for better coding efficiency. The bit rate allocated to each frame usually is based on a rate control strategy. Such resource allocation for sequence encoding only exercises limited resource allocation and there is more room to improve. Based on the discussion presented here, there is a need for an optimal resource allocation for video sequence as well as for video frame. The current invention addresses the resource allocation for video sequence.
The present invention provides an optimal resource allocation for video sequence to achieve the best possible quality within the sequence resource budget.
In an exemplary embodiment of this invention, the encoder provides as the output an action sequence of encoder parameters by selecting the frame type, quantization scale, and encoding complexity of picture to produce an encoded version of the sequence to achieve a minimum sequence cost within the sequence resource budget. Due to the high computational complexity of calculating the resource budget and video quality by actually performing the encoding process, the resource budget and the video quality are all estimated based on respective models. The resource budget used in this invention includes bit rate and encoding time.
The optimal encoding process by selecting all possible combinations of frame type for individual frames of the sequence would require extremely high complexity if exhaustive search is used. The current invention uses a branch and bound algorithm to simplify the search procedure for an optimal decision on frame type of the sequence.
The benefits of the current intention can be easily appreciated by any person skilled in the art of video compression.
a shows a block diagram of the optimal resource allocation for video content.
b shows a general optimal resource allocation process.
a. Example of buffer model: raw frame buffer.
b. Example of buffer model: encoder bit buffer.
c. Example of buffer model: decoder bit buffer.
The invention is directed to a system and method of encoding video sequence, that generally includes providing as the input a video sequence containing pictures as the input, a resource budget for the sequence, and a sequence cost function. The output is an action sequence of encoder parameters configured to produce an encoded version of the sequence to achieve a minimum sequence cost within the sequence resource budget.
In one embodiment, the resource allocation and control mechanism are generic and in principle can be used with any MPEG-type encoder or a similar mechanism. For illustration, in the examples of embodiments below, it is assumed that the underlying encoding pipeline implements the H.264 AVC encoder, with independent slice encoders working simultaneously. Such an implementation is currently typical of real-time hardware encoders.
A system-level block diagram of exemplary controllers configured according to the invention is shown in
The detailed process of the Optimal Resource Allocation 10 is further explained in
In one embodiment, the system consists of the sequence-level controller, which is responsible for frame type and reference selection and allocation of the bit budget and initial quantization parameter for the currently encoded frame. The frame-level sequence controller allocates the resources on sub-frame level, in order to utilize the total frame bit budget and achieve the highest perceptual visual quality. For this purpose, a perceptual quality model and a PSNR shaping mechanism are employed. Both controllers are based on the encoder model, which predicts the behavior of the encoder given a set of encoding parameters. The model, in turn, utilizes visual clues, a set of descriptors of video data.
The main focuses of the current invention are the encoder models and the sequence-level controller. The most important features of the sequence-level controller include:
Content-Aware Encoder Model
The purpose of encoder model is to estimate the amount of resources used by the encoder without resorting to the expensive process of encoding itself. Specifically, the amount of bits produced by the encoder, the distortion as the result of encoding and the time complexity of the encoding process are predicted according to the encoder model. The encoder model is disclosed in U.S. patent application Ser. No. 12/040,788, filed Feb. 29, 2008.
In general, if x is the quantity to be predicted (amount of bits, distortion, or time) for the current frame, the predicted value according to the encoder model is computed by
where {circumflex over (x)}i is a macroblock-wise predictor, θi are the encoding parameters and zi are the visual clues in macroblock i, NMB is the number of macroblocks in the frame and K is some normalization factor. Specifically, the following models are used.
Bit Production Model
In each macroblock, the amount of bits produced is proportional to the quantization step qi−1 and to the macroblock complexity zi. Depending on the frame type, either texture (in I-frames) of motion complexity (in P-frames) are used. The macroblock-wise predictor is given by
{circumflex over (b)}
i(qi,zi)=α1+α2ziqi−1 (4)
where α1,α2≦0 are the model parameters obtained by training. Assuming that q is constant for the entire frame, the frame-wise predictor of the amount of bits can be therefore expressed as
where the coefficients αb=NMBα1 and βb=Σi=1N
Distortion Model
The luma PSNR is used as an objective distortion criterion. In each macroblock, the PSNR is inversely proportional to the quantization parameter qi−1 and to the macroblock complexity zi (texture complexity is used for both P- and I-frames).
{circumflex over (p)}
i(qi,zi)=β1−β2zi−β3q′i (6)
where β1,β2,β3≧0 are the model parameters obtained by training. Assuming that q′i is constant for the entire frame, the predictor of the frame-wise average PSNR can be therefore expressed as
where the coefficients
and
depend on the frame content and frame type.
Distortion Model for Dropped Frames
In some cases, the sequence-level controller may decide to drop a frame. In such a case, it is assumed that the decoder delays the previous frame for a longer duration, equal to the duration of the previous and current (dropped frame). The distortion due to the drop is computed as the average PSNR between a downsampled (16 times in each axis) version the dropped frame and the previous frame. As an exemplary implementation, the average PSNR is computed by first calculating the average MSE and then converting it into logarithmic scale (rather than averaging the PSNR values directly). For example, dropping a frame at the beginning of a new shot would result in a low PSNR, while dropping a frame in a sequence of frames with slow motion will result in high PSNR. It may appear that though the average PSNR is high, in some small portions of the image the distortion due to frame drop is very large (such a situation is typical for a frame with abrupt motion of a foreground object, while the background is static). In order to take into consideration such situations, together with the average PSNR, we measure the 0:1% lowest quantile of the PSNR values, denoted by {circumflex over (p)}min.
Time Complexity Model
The underlying assumption of the time complexity model is that the encoding pipeline used to encode a macroblock consists of multiple stages, some of which are bypassed according to the encoded macroblock type using some decision tree. Consequently, it can be simplistically assumed that encoding a macroblock of certain type takes a fixed amount of time. Let cT denote the maximum amount of time it takes to encode a macroblock (nominal encoding time), which depends on the frame type and on the complexity scale c′. Let's denote τ′ the fraction of cT is takes to encoder macroblock i, the time used to encode a macroblock is τi=cTτ′i. According to the model, the normalized encoding time for macroblock i is given by
{circumflex over (τ)}′i(qi,zi)=max{min{γ1+γ2zi−γ3q′i,1},0}T (8)
where γ1,γ2,γ3≧0 are the model parameters obtained by training, and zi and q′ are the texture complexity (for I-frames) and motion complexity (for P-frames) respectively. The minimum and maximum is taken in order to ensure that the predicted normalized time value is within a valid range of [0,T] (for a reasonable range of zi and q′, this assumption will hold). Assuming that q′ is constant for the entire frame, the predictor of the encoding time for the entire frame is approximated (neglecting the possible nonlinearity due to the minimum and maximum functions) as
where the coefficients α1=NMBγ1cT−ΣI=1N
Buffer Models
The purpose of the buffer models is to provide a simple means of modeling the encoder input and output as well as the hypothetic decoder input behavior, imposing constrains in bit and time allocation. The encoder model comprises a raw frame buffer constituting the input of the system to which raw frames are written, and the encoder bit buffer, constituting the output of the system, to which the encoded bitstream is written. The hypothetic decoder model comprises a decoder bit buffer, connected to the encoder bit buffer by a channel responsible for the transport of the coded stream.
Raw Frame Buffer Model
The e fullness of the raw frame buffer at time t is denoted by lraw(t). We assume that initially lraw(0)=0, and raw frames are being filled at constant time intervals of 1/F, where F denotes the input sequence frame rate. Though lraw may assume integer values only, in most parts of our treatment we will relax this restriction treating lraw as a continuous quantity. The encoder starts reading the first frame from the raw buffer as soon as the raw buffer level reaches. lraw≧lminraw where lminraw≧1 denotes the minimum amount of look-ahead required by the encoder in the current implementation. The first frame in the sequence starts being encoded at the time
The encoding of the first frame takes τ1 seconds, after which the frame is removed from the raw buffer, decreasing its level by 1. The moment this happens is the encoding end time, denoted by
t
1
enc
=t
0+τ1 (11)
If lraw(tnenc+∈)<lminraw, for infinitesimal ∈>0, encoder raw buffer underflow occurs; in this case the encoder will stall until the minimum buffer level is reached. We denote this idle time by
In this notation, the encoding of the next frame will start at time t1enc+t1idle=t0+τ1+τ1idle. The raw frame buffer is assumed to be capable of holding up to lmaxraw raw frames; when the buffer capacity reaches this limit, an encoder raw buffer overflow occurs, and the input process stalls. Whenever the duration of such a stall is not negligible, input data may be lost.
The buffer level has to be checked for overflow immediately before the encoder finishes encoding a frame, and for underflow immediately before the encoder starts encoding a frame. Combining the contributions of frame production by the input subsystem and consumption by the encoder, the raw frame buffer levels are given by
l
raw(tnenc−∈)=tnencF−n+1. (13)
The raw frame buffer model is shown in
Encoder Bit Buffer Model
The fullness of the encoder bit buffer at time t is denoted by lenc(t) We assume that initially lenc(0)=0 and at the time tnenc when the encoder completes encoding frame n, bn bits corresponding to the access unit of the coded frame are added to the buffer. After the first time the buffer level exceeds linitenc the buffer starts being drained at a rate r(t)≦rmax(t), determined by the transport subsystem. If at a given time the buffer is empty, the instantaneous draining rate drops to r(t)=0. This situation is referred to as encoder bit buffer underflow. Except for inefficiency of utilizing the full channel capacity, encoder buffer underflow poses no danger to the system. The complement situation of encoder bit buffer overflow occurs when lenc(tnenc)+bn≧ltotenc. As a consequence, the encoder will stall until the bit buffer contains enough space to accommodate the encoded bits. If the stall lasts for a non-negligible amount of time, unpredictable results (including input raw frame buffer overflow) may occur.
The buffer level has to be checked for undertow immediately before the encoder finishes encoding a frame, and for overflow immediately after the encoding is finished. Combining the contributions of bit production by the encoder and consumption by the transport subsystem, the encoder bit buffer levels are given by
l
enc(tnenc−∈)=lenc(tnenc+∈)−∫t
l
enc(tnenc+∈)=lenc(tnenc−∈)+bn. (15)
The encoder bit buffer model is shown in
the average rate while the frame n being encoded. Assuming that r(t)=rmax(t) if lenc(t)>0, and r(t)=0 otherwise, we obtain
In these terms, we may write
l
enc(tnenc−∈)=lenc(tn−1enc+∈)−(tnenc−tn−1enc)rn (18)
Decoder Bit Buffer Model
Bits drained from the encoder bit buffer at time t at rate r(t) are transported by the channel and are appended to the decoder bit buffer at time t=δchan at the same rate. The channel delay δchan needs not to be constant; in this case, maximum channel delay may be assumed. The fullness of the decoder bit buffer at time t is denoted by ldec(t). Initially, ldec(0)=0; the buffer level remains zero until t=t1enc+δchan, where the first bits of the first frame start arriving. The decoder remains idle until ldec≧linitdec; once the bit buffer is sufficiently full, the decoder removes b1 bits from the buffer and starts decoding the first frame of the sequence. The time lapsing between the arrivals of the first bit until the first access unit is removed from the bit buffer is given by the smallest δdec satisfying
∫t
We denote by
t
1
dec
=t
1
enc+δchan+δdec (20)
the time when the first frame starts being decoded The delay passing between production and consumption of the access unit corresponding to a frame is denoted by
δ=δchan+δdec (21)
The decoder removes access units corresponding to the encoded frames at a constant rate F, resulting in the following decoding schedule
The decoder bit buffer level assumes highest and lowest values immediately before frame decoding start, and immediately after the decoding is started, respectively. Combining the contributions of bit production by the transport subsystem and consumption by the decoder, the decoder bit buffer levels are given by
l
dec(tndec−∈)=ldec(tn−1dec+∈)+∫t
l
dec(tndec−∈)=ldec(tn−1dec−∈)−bn. (24)
The decoder bit buffer model is shown in
Sequence Resource Allocation
Assuming that the encoder is encoding an i-th frame in the current GOP, and is capable of observing n−1 additional future frames i+1, . . . , i+n−1 queued in the raw input buffer, the problem of sequence resource allocation consists of establishing a resource budget for the latter sequence of n frames. Here, allocation of bit and encoding time budgets, denoted by bT and τT, respectively are considered. The display time stamp relative to the beginning of the current GOP of the i-th frame is denoted by ti. Assuming an estimate {circumflex over (δ)} of the decoder delay is available and denoted by
The maximum GOP duration (if the duration is smaller than the estimated delay, tmax={circumflex over (δ)} is used). The minimum GOP duration is set to the minimum temporal distance between two consecutive IDR frames,
tmin=tminidr. (26)
The maximum amount of time remaining till the GOP end is denoted by
t
rem
=t
max
−t
i, (27)
from where the number of remaining frames,
n
rem=tremF (28)
is obtained.
The sequence-level controller flow diagram is shown in
Effective Bit Rate Estimation
The estimated rate, {circumflex over (r)}dec at the decoder at the time when the i-th frame will be decoded is approximated by
{circumflex over (r)}dec≈r(tenc+δ), (29)
where r is the current rate. Assuming the simplistic model of the rate remaining constant r in the interval [ten,tenc+δ], and then remaining constant {circumflex over (r)}dec, the average bit rate of the encoder buffer drain in the interval [tenc,tenc+tmax] is given by
Under the assumption of nearly constant channel rate, the encoder and decoder bit buffers remain synchronized (up to an initial transient), connected by the relation
l
enc(t)+ldec(t+δ)=ltotenc. (31)
However, bit rate fluctuations result in a loss of such synchronization. For example, in the case of a future decrease in the channel capacity ({circumflex over (r)}dec<r), encoding the next frames at the rate r will cause the decoder buffer level to drop and, eventually, to underflow, while keeping the encoder buffer level balanced. In the converse case of a future increase in the channel capacity ({circumflex over (r)}dec>r) keeping the video stream rate at r will result in a decoder bit buffer overflow.
A solution proposed here consists of modifying the rate at which the stream is encoded such that both buffers remain as balanced as possible. We refer to such a modified rate as to the effective bit rate,
r
eff=max{min{0.5(renc+{circumflex over (r)}dec)+Δr, max{renc,{circumflex over (r)}dec}},min{renc,{circumflex over (r)}dec}}, (32)
where
Using reff instead of r in the first example will cause both encoder and decoder buffer levels to drop by a less extent
Bit Budget Allocation
Given an estimated effective encoding bit rate reff, the amount of bit budget remaining till the end of the GOP is given by
brem=refftrem. (34)
According to the bit production model (6), the estimated amount of bits generated by the encoder for an i-th frame given a quantizer qi is given by
where the coefficients αb=(α1b, . . . , αb)T and βb=(β1b, . . . , βb)T depend on the frame content and frame type fi. The latter is assumed to be assigned by a higher-level frame type decision algorithm and
denote the sum of the model coefficients of the observed n frames. Substituting αb and βb into the bit production model yields the total estimate of bits produced by encoding the observed frames i, . . . , i+n−1 with a constant quantizer. Similarly, we denote by
αremb=(nrem−n)
βremb=(nrem−n)
the model coefficients for estimating the amount of bits of the remaining frames in the current GOP;
The goal of sequence bit budget allocation is to maintain a fair distribution of the encoding bits throughout the sequence, considering the long-term effect of the decided allocation, as well as reacting to changes in channel bit rate and frame texture and motion complexity. Under the simplifying assumption that at the sequence level the visual quality is a function of the quantizer only, not depending on the frame content, the problem of sequence bit budget allocation can be translated to finding a single quantization scale q, which assigned to the remainder of the GOP frames produces brem bits. Substituting the appropriate bit production models.
and solving for 1/q yields,
Substituting the latter result into the bit production model yields the following bit budget for the sequence of the observed n frames.
is a scaling factor decreasing the bit budget for higher, and increasing for lower decoder bit buffer levels, respectively; {circumflex over (l)}dec={circumflex over (l)}(tidec−∈) is the estimated decoder bit buffer fullness immediately prior to decoding the i-th frame, ∈1=3, and h is the target relative decoder bit buffer level, defined as
To guarantee some level of bit allocation fairness, the budget is constrained to be within 25% to 200% of the average bit budget,
resulting in
b′T =min{max{b∵T,0.25
Encoder and decoder bit buffer constrains are further imposed, yielding the following final expression for the sequence bit budget
where lenc enc is the encoder bit buffer level immediately prior to encoding the i-th frame.
Encoding Time Budget Allocation
Similar to the bit budget allocation, the time budget allocation problem consists of assigning the target encoding time τT for the observed frames n. However, due to the stricter constrains imposed by the encoder raw frame buffer, encoding time allocation operates with shorter terms. Since frames are added to the encoder input buffer at the rate of F frames per second, the average encoding time of a frame is 1/F, resulting in the average budget
for the sequence of n observed frames. Using this budget and assuming the encoder has encoded nenc frames so far (including the dropped ones), the time at the encoder should be nenc/F. However, since the actual encoding time may differ from the allocated budget, the time at the encoder immediately prior to encoding the i-th frame (relative to the time t0 when the first frame in the sequence starts being encoded) usually differs from the ideal value. We denote by
the time difference between the ideal and actual encoding time; if tdif>0 the encoder is faster than the input raw frame rate and the encoding time budget has to be increased in order to avoid raw buffer underflow. Similarly, if tdif<0, the encoder is lagging behind the input and the time budget has to be decreased in order to prevent an overflow. Demanding the encoder to close the time gap in nresp frames (nres/F seconds), yields the following encoding time budget
Typically, nresp≈5, depending on lmaxraw. To guarantee some level of fairness, the budget is constrained by
τTn=max{min{τ′T,1.5
Encoder bit buffer constrains are further imposed yielding the final encoding time budget
and {circumflex over (b)}i are the expected bits produced by encoding the i-th frame. Due to the dependence on {circumflex over (b)}i, the time allocation must be preceded by bit allocation.
Frame Resource Allocation
The problem of frame resource allocation consists of distributing a given budget of resources between n frames in a sequence. More formally, we say that the vector x=(x1, . . . , xn)T is an allocation of a resource x, where each xi quantifies the amount of that resource allocated to the frame i. For example, x can represent the amount of coded bits or encoding time. Ideally, we would like to maximize the xi for each frame; however, the allocation has to satisfy some constrains, one of which is
where xT is the resource budget assigned to the sequence by a higher-level sequence resource allocation algorithm. Other constrains stemming, for example, from the encoder buffer conditions apply as well. Formally, we say that an allocation vector x is feasible if it satisfies that set of conditions, and infeasible otherwise. Resource allocation can be therefore thought of as finding a feasible solution to the maximization problem.
Since usually the encoder is controlled using a set of parameters different from the allocated resource itself (e.g., though coded bits are allocated as a resource, the amount of produced bits is controlled through the quantization parameter), it is more convenient to express x as a function of some vector of parameters θ=(θ1, . . . , θm) . This yields the following optimization problem
Note, however, that since the maximized objective is a vector rather than a scalar quantity, there exists no unique way to define what is meant by “maximum x”. Here, we adopt the following notion of vector objective optimality.
Definition 1 (Max-min optimality) A vector of resources x=(x1, . . . , xn)T is said to be max-min optimal if it is feasible, and for any 1≦i≦n and a feasible y=(y1, . . . , yn) for which xp<yp, there is some j with xi≧yj and xj>yj.
Informally, this means that it is impossible to increase the resource allocated to a frame i without decreasing the resources allocated to frames that have already a smaller resource allocation than xi, and without violating the feasibility constrains. For this reason, a max-min optimal resource allocation can be thought of as fair. As to notation, given the vector of resources x as a function of some parameters θ and feasibility constrains θ ∈ Ω, we will henceforth interpret the vector objective maximization problem.
as finding a max-min optimal resource allocation x*=x(θ*). In the remainder of this section, we are going to explore and formulate allocation problems for two types of resources considered here, namely coded bits and encoding time.
Bit Allocation
The first problem we are going to address is the distribution of the budget of bT bits between the frames, which can be denoted as a bit allocation vector b=(b1, . . . , bn). However, since the encoder is not controlled directly by b, but rather by the quantizer value, we reformulate the problem as allocating a vector q=(q1, . . . , qn)T of quantizer scales (or, equivalently, a vector q′ of quantization parameters). Though q and q′ may assume only discrete values, we relax this restriction by making them continuous variables. In these terms, the quantizer allocation problem can be formulated as max-min allocation of coding quality, in our case, average frame PSNR values p=(p1, . . . , pn)T), as function of q and subject to feasibility constrains,
Note that the feasibility constrains are imposed on the amount of bits produced by the encoder with the control sequence q. Since in practice the exact amount of bits produced for a given quantizer is unknown a priori, b(q) has to be substituted with the estimated coded bits {circumflex over (b)}(q). Such an estimate depends on the frame types, which are assumed to be decided by a higher-level frame type decision algorithm, detailed in the sequel. One of the feasibility constrains is, clearly, {circumflex over (b)}(q)=bT. However, this condition is insufficient, as it may happen that the produced bits violate the buffer constrains for a frame i<n. In order to formulate the buffer constrains on the allocated bits, let us denote by tdec=(t1dec, . . . tndec)T the decoding timestamps of the frames, assuming without loss of generality that t1=0 and that the frames are numbered in the increasing order of ti. We denote the estimated decoder bit buffer fullness immediately before and immediately after frame i is decoded by {circumflex over (l)}−dec(q)=l0dec(tidec−∈) and {circumflex over (l)}+dec(q)=l0dec(tidec+∈), respectively. We assume that the initial decoder buffer fullness {circumflex over (l)}−dec={circumflex over (l)}0dec is given as the input. In vector notation, we can write
K=J−I, and r̂dec is the estimated average decoder bit buffer filling rate on the time interval
The constrained optimal allocation problem becomes
where lmindec and lmaxdec are the minimum and the maximum decoder big buffer levels, respectively. Since it is reasonable to assume a monotonically decreasing dependence of the PSNR in the quantization parameter (or the quantizer scale), the maximizer of (60) coincides with the minimizer of
A numerical solution of the max-min allocation problem (61) can be carried out using a variant of the bottleneck link algorithm, summarized in Algorithm 1. In the algorithm, we assume that the amount of coded bits produced for a frame i as a function of qi is given by the model
where the coefficients αb=(α1b, . . . , αb)T and βb=(β1b, . . . , βnb)T depend on the frame content and frame types. In vector notation, this yields
where vector division is interpreted as an element-wise operation. The algorithm can be easily adopted to other models as well by replacing the closed-form solution in Steps 1,8, and 9 with the more general equation
with respect to the scalar q.
Encoding Time Allocation
The second problem we are going to address is the allocation of the encoding time budget τT between the frames, which can be denoted as an encoding time allocation vector τ=(τ1, . . . , τn)T. As in the case of bit allocation, since the encoder is not controlled directly by τ, but rather by the complexity scale, we reformulate the problem as allocating a vector c=(c1, . . . , cn) of complexity scales. Again, we think of c as of a continuous variable, though in practice it may be restricted to a set of discrete values. We denote by {circumflex over (t)}enc=({circumflex over (t)}1enc, . . . , {circumflex over (t)}nenc)T the time at which frame i's encoding is complete, assuming without loss of generality that the first frame starts being encoded at time 0. Furthermore, we assume that the encoding time for a frame i coincides with the encoding start time for the frame i+1. In this notation,
{circumflex over (t)}
enc
=J{circumflex over (τ)}(c) (65)
where τ̂=(̂τ1, . . . , τ̂n)T denotes the estimated encoding times.
Assuming the quantizer allocation has fixed the estimated amount of bits {circumflex over (b)} produced by encoding each frame, our goal is to maximize the coding quality p(c) subject to the encoder buffer constrains. The encoder buffer constrains include both the encoder bit buffer constrains, applying to the output bit buffer, and the encoder raw buffer constrains, applying to the input raw frame buffer. Similar to the bit allocation problem, we denote the estimated encoder bit buffer fullness immediately before and immediately after frame i's encoding is complete by {circumflex over (l)}i−enc={circumflex over (l)}enc({circumflex over (t)}ienc−∈) and {circumflex over (l)}i+enc={circumflex over (l)}enc({circumflex over (t)}ienc+∈), respectively. The initial buffer level l0enc=lenc(0) is assumed to be known from a direct observation. Using this notation, the buffer levels are given by
{circumflex over (l)}
−
enc(c)=l0enc+K{circumflex over (b)}−rJ{circumflex over (τ)}(c) (66)
{circumflex over (l)}
+
enc(c)=l0enc+J{circumflex over (b)}−rJ{circumflex over (τ)}(c) (67)
In the same manner, {circumflex over (l)}i−raw and {circumflex over (l)}i+raw denote the estimated encoder raw frame buffer fullness immediately before and immediately after frame i's encoding is complete. The initial buffer level at time 0 is denoted by l0raw and is available from a direct observation. Since the filling rate of the input buffer is F, we have
{circumflex over (l)}
raw({circumflex over (t)}ienc−∈)=l0raw+{circumflex over (t)}iencF−(i−1) (68)
and {circumflex over (l)}raw({circumflex over (t)}ienc+∈)={circumflex over (l)}raw({circumflex over (t)}ienc−∈)−1. In vector form, this yields
{circumflex over (l)}
−
raw(c)=l0raw+FJ{circumflex over (τ)}(c)−K1 (69)
{circumflex over (l)}
+
raw(c)=l0raw+FJ{circumflex over (τ)}(c)−J1 (70)
The buffer-constrained encoding complexity allocation problem can be expressed as
However, this formulation suffers from two potential problems. The first drawback stems from our assumption that encoding end time for a frame i coincides with the encoding start time for the frame i+1. For example, if the i frame is dropped and the input buffer level falls below lminenc, the encoder will stall until the minimum buffer level is reached. This will make τi non-negligible, which in turn will require ci to be very high (or even infinite if we assume that the nominal encoding time for a dropped frame is strictly zero). We overcome this problem by relaxing the constrain c≦cmax. If some elements of the optimal complexity allocation vector c exceed the maximum complexity scale, we will say that the encoder yields a portion of its CPU time to other processes potentially competing with it over the CPU resources. This fact can be quantified by introducing a vector of CPU utilization η=(η1, . . . , ηn)T, where 0≦ηi≦1 expresses the fraction of the CPU time used by the encoder process in the time interval [{circumflex over (t)}i−1enc,{circumflex over (t)}ienc], {circumflex over (t)}0enc=0. Setting,
and c=min{c*,cmax} the encoding complexity scale of frames with ci* exceeding cmax will be set to cmax, and the utilizable CPU time will be lower than 100%. The second difficulty in the allocation problem (71) stems from the fact that sometimes the input and output buffer constrains may be conflicting (or, more formally, the feasible region may be empty).
Definition 2 (Ordered constrains). Given an ordered m-tuple of indicator functions χ=(χ1, . . . , χm)m)T, χi: Rn→{0, 1}, a vector x is said to satisfy the ordered constrains χ(x) if there exists no y with χ(y)<χ(x), where < denotes the lexicographic order relation between binary strings.
Informally, this definition implies that in case of conflicting constrains, satisfying the first constrain is more important than satisfying the second one, and so on. Using this relaxed notion of constrained optimality, allocation problem (71) can be rewritten as
where the constrains are interpreted as ordered constrains.
The time complexity allocation (73) is solved using Algorithm 2 similar to Algorithm 1 for bit allocation. In the algorithm, we assume that the encoding time of a frame i as a function of ci is given by the model
where the coefficients and αt=(α1t, . . . , αnt)T and βt=(β1t, . . . , βnt)T depend on the frame content and frame types, and γt=(α1t+β1tq′1, . . . , α1t+β1tq′)T.
Joint Bit and Encoding Time Allocation
Allocation problems (61) and (73) tacitly assume that allocation of quantization and encoder time complexity scales are independent. However, while it is reasonable to assume that the coding quality p is a function of q only (and, thus, is independent of c), the amount of produced bits b clearly depends on both parameters. This dependence couples together the two problems through the more accurate expression for the encoder bit buffer levels
{circumflex over (l)}
−
enc(q,c)=l0enc+K{circumflex over (b)}(q,c)−rJ{circumflex over (τ)}(c) (75)
{circumflex over (l)}
+
enc(q,c)=l0enc+J{circumflex over (b)}(q,c)−rJ{circumflex over (τ)}(c) (76)
and the decoder bit buffer levels
{circumflex over (l)}
−
dec(q,c)=l0dec+K{circumflex over (b)}(q,c)+{circumflex over (r)}dectdec (77)
{circumflex over (l)}
+
dec(q,c)=l0dec−J{circumflex over (b)}(q,c)+{circumflex over (r)}dectdec (78)
which now depend on both q and c. Unfortunately, a joint bit and time complexity allocation problem is not well-defined, since combining the two vector-valued objectives q and c can no more be treated using the max-min optimality framework, as the two vectors are non-commensurable.3 However, joint allocation can be performed by alternatively solving the bit and time allocation problems, as suggested by the following algorithm
The convergence condition can be a bound on the change in c* and q*, the number of iterations, or any combination thereof. Our practice shows that a single iteration of this algorithm usually produces acceptable allocation.
Frame Type Decision
The purpose of frame type decision is to associate with a sequence of n frames an optimal sequence of frame types. The frame type with which an i-th frame is encoded is denoted by fi. To simplify notation, we assume that fi also specifies whether the frame is used as reference, whether it is an IDR, and which frames are used for its temporal prediction (unless it is a spatially predicted frame). For example a frame i=1 can be IDR, I, P predicted from frame 0, or DROP. The space of possible frame type assignment for a frame i is denoted by Fi, and depends solely on the status of the reference buffer immediately before the frame is encoded, denoted by Ri (Ri is defined as the list of reference frame indices). Speaking more broadly, Fi is a function of the encoder state immediately prior to encoding a frame i, defined as the pseudo-vector
σi=(Ri,li−enc,li−raw,li−dec,tind,tiidr) (79)
where c li−enc, li−raw, and li−dec denote the levels of the encoder bit buffer, raw buffer, and decoder bit buffer, respectively, denotes the index of the last non-dropped frame, and denotes the presentation time of the last IDR frame. σi fully defines the instantaneous encoder state. In practice, only estimated buffer levels are available. We will henceforth denote the estimated encoder state by
σi=(Ri,{circumflex over (l)}i−enc,{circumflex over (l)}i−raw,{circumflex over (l)}i−dec,ιind,tiidr) (80)
Note that Ri, tind, and, tiidr in {circumflex over (σ)}i are fully deterministic.
It is important to observe that fi does not fully define the set of control parameters required by the encoder in order to encode the i-th frame, as it does not specify the quantization and complexity scales qi and ci. In order to separate the optimization of frame type and reference indices from the optimal quantizer and complexity allocation, we assume that given the sequence of frame types f=(f1, . . . , fn), the allocation algorithms described in the previous section are invoked to find q(f)=(q*1, . . . , q*n) and c(f)=(c*1, . . . , c*n) As consequence, the amount of bits produced and the amount of time consumed by encoding the i-th frame can be expressed as functions of f. To simplify notation, we will denote the latter quantities by bi and τi, respectively. Similarly, pi will denote the distortion of the i-th frame. In the case where fk=DROP, the distortion is evaluated as the PSNR of the difference between the original frame i and the last displayed frame ιind.
We jointly refer to ai=(fi, q*i, c*i) as the encoder action for the frame i. Note that although ai is defined for a single frame, the optimal values of q*i and c*i depend on the fi's of the entire sequence of frames 1, . . . , n. As consequence, a can be determined only as a whole, i.e. ai+1 is required in order to determine ai.
Encoder State Update
Given the encoder state σi and the action σi+1 the next state σi+1 is unambiguously determined by the state update rule σi=σ(σi,ai). In practice, the update rule is applied to the estimated state, σi+1=σ({circumflex over (σ)}i ,ai). The update for the buffer levels is given by
{circumflex over (l)}
i+1
enc
={circumflex over (l)}
i−
enc
+{circumflex over (b)}
i
−r
i{circumflex over (τ)}i
{circumflex over (l)}
i+1
raw
={circumflex over (l)}
i−
raw
+F{circumflex over (τ)}
i−1
{circumflex over (l)}
i+1
dec
={circumflex over (l)}
i−
dec
+{circumflex over (b)}
i
+{circumflex over (r)}
i
dec(ti+1dec−tidec), (81)
where tidec is the decoding time stamp of the i-th frame, ri is the average encoder bit buffer draining rate at the time tienc, and {circumflex over (r)}i is the predicted decoder bit buffer filling rate at the time tidec. The last displayed is updated according to
The last IDR presentation time stamp is updated according to
The reference buffer is updated according to the sliding window policy
where min Ri denotes the smallest frame index found in the reference buffer, |Ri| denotes the number of frames in the reference buffer, and Rmax stands for the maximum reference buffer occupancy. It is important to emphasize that though the next state σi−1 depends only on σi and ai, ai itself depends on a1, . . . , ai−1,ai+1, . . . , an. Formally, this can be expressed by saying that the update of the full encoder state is non-Markovian. However, some constituents of the encoder state do satisfy the Markovian property. We denote by
σiM=(Ri,lind,tiidr) (85)
the Markovian part of the state, whose update rule can be expressed as
σi+1M=σ(σiM, fi) (86)
(note the dependence on fi only). On the other hand, the remaining constituents of the encoder state
{circumflex over (σ)}iNM=(li−enc,li−raw,li−dec) (87)
are non-Markovian, since their update rule requires and which, in turn, depend on the entire sequence f through q* and c*. The update rule for σiNM is a function of the initial state σlNM, and the entire f
σiNM=σ(σ1NM,f) (88)
Action Sequence Cost
Given a sequence of encoder actions α=(α1, . . . ,αn), we associate with it a sequence cost ρ(α), defined as
where typically λbuf=10, λdis=1, λbit=100, λidr=0.5, λqp=0.01 The constituent terms of the cost function are defined as follows.
Buffer cost penalizing the estimated decoder bit buffer violation is defined as
is a single-sided hyperbolic penalty function, and ∈ is a small number, typically ∈≈10−6
Distortion cost penalizing the frame distortion is given by
p
dis({circumflex over (p)}i)=2552·10−0.1 {circumflex over (p)}i. (92)
Drop cost penalizing the dropped frame distortion is given by
where {circumflex over (p)}imin is the estimated minimum PSNR. For a dropped frame, {circumflex over (p)}imin is computed as the 0.1%-quantile of the PSNR of the difference between the original frame i and the last displayed frame lind For a non-dropped frame, {circumflex over (p)}imin={circumflex over (p)}i.
IDR cost for penalizes for a too early IDR frame,
For fi≠IDR,the cost is given by
penalizing for a too late IDR frame. The cost is constructed in such a way that an IDR is placed in the time interval [tminidr,tmaxgop].
Quantizer fluctuation cost penalizes on the deviation from the average sequence quantization parameter according to
P
idr(q′i)=max{2q′
where
The decay parameter is set λ=0.99, and may be adjusted according to the sequence frame rate.
Bit budget deviation cost penalizes for action sequences resulting in a deviation from the allocated bit budget bT
and ∈=0.1.
Action Sequence Optimization
Using the notion of the action sequence cost, the frame type optimization problem can be expressed as the minimization problem
Since the search space F1× . . . ×Fn is usually very large, the complexity of finding the best action sequence by exhaustive search is prohibitive. Due to the non-Markovian property of the cost function, no greedy Markov decision algorithms can be employed. However, one can construct a Markovian lower bound for ρ, depending on fi only, and allowing to prune the search space significantly
We observe that though the estimated frame distortion {circumflex over (p)}i depends on the selection of the quantizer, it can be bounded below by the distortion achieved if bT bits were allocated to the i-th frame alone. The lower bound {circumflex over (p)}−i on the frame distortion can be expressed as
where {circumflex over (q)}(b) is the inverse function of {circumflex over (b)}(q) given by
(the model coefficients αib and βib depend on fi). For dropped frames, the bound is exact; moreover, {circumflex over (p)}imin can be estimated as well.
Aggregating the terms of p that do not depend on the quantizer, the following lower bound on the action cost is obtained.
(though a lower bound on the buffer penalty ρbuf can also be devised, we do not use it here for simplicity). Note that unlike ρ, ρ is additive, that is
The lower bound ρ(α) is used in the branch and bound Algorithm 4 for solving the combinatorial minimization problem (102). The algorithm is first invoked with f=Ø, ρ=∞, and the current state of the encoder. The order of the loop searching over all feasible frame types F1 should be selected to maximize the probability of decreasing the bound ρ as fast as possible. Typically, the best ordering of F is DROP, followed by P (if multiple reference frames are available, they are ordered by increasing display time difference relative to the current frame), followed by IDR. Though not considered here, non-IDR I frames and non-reference P or B frames can be straightforwardly allowed for as well.
In the case where complexity allocation is performed after bit allocation and the latter is not reiterated, complexity allocation may be removed from the frame type decision algorithm and performed after the optimal frame types are assigned.
Branch and Bound Frame Type Decision Algorithm
ρ.
A flow chart corresponding to the Branch and bound frame type decision algorithm is shown in
When the recursion does not reach its end, the system initializes the sequence cost function by assigning its value to infinity at step 312 and starts the iteration for every possible frame type f1. The number of all frame type of F1 is denoted by L(F1) the index i used to process all possible frame types by initializing i to 1 at step 314 and checking the ending condition i≦L(F1) at step 366. For each frame type of f1, the corresponding sequence cost is computed and compared with the lower bound at ρ step 320. If the computed cost is higher than the lower bound, no further processing is needed as shown as the “yes” branch at step 320.
If the computed sequence cost is smaller than the lower bound, the Markovian encoder state is updated at step 322 and the frame type f1 is added to the sequence frame type f at step 324. Both the frame index n and the frame type index i are updated at step 326. After the frame index is decremented by 1, the Frame Type Decision recursion occurs as the step 350 is within in main function at step 300. The corresponding action sequence, sequence cost, and lower bound on the sequence cost are updated at step 360. The sequence cost is then compared with the optimal cost at step 362 and the optimal action sequence and sequence cost are updated at step 364 if the newly computed cost is lower.
Encoding Order Optimization
Given a sequence of n frames indexed by incrementing display order, the purpose of encoding order optimization is to find their optimal encoding order. For notation convenience, we denote the frame reordering as a permutation π of {1, . . . , n}. We also denote by α*(π) and ρ*(π) the optimal action sequence and its cost, respectively, found using the frame type decision algorithm from the previous section applied to the ordered sequence of frames π1, . . . , πn. Using this formalism, encoding order optimization can be expressed as finding the permutation minimizing
Since out-of-display-order encoding requires temporary storing the decoded frames until the next consecutive frame has been decoded, the search space of feasible permutations is constrained by the decoded picture buffer level. In H.264, the decoded picture buffer is shared with the reference buffer.
We augment the Markovian component of the encoder state with the last displayed frame index ιidisp specifying the largest frame index that has been consecutively decoded from the sequence start at the time when frame i is being encoded. The last displayed frame is updated according to
where con(Ri) denotes the largest subsequence of consecutive frame indices in Ri starting with min Ri (e.g., if Ri={1, 2, 3, 5}, con(Ri)={1, 2, 3}; if R Ri={1, 3, 4, 5}, con(Ri)={1}, etc). We also modify the state update rule for the reference buffer as
and con(Ri,k) denotes the sequence of at most k smallest elements in con(Ri) (or less, if |con(Ri)|<k). Note that πi replaces i, and the update scheme now allows frames to be locked in the reference buffer until a consecutive sequence is formed.
In order to satisfy the reference buffer size constrains, |Ri|≦Rmax has to hold for every frame in the sequence. This condition can be verified prior to minimizing ρ*(π); permutations not satisfying the reference buffer constrains are discarded from the search space. In some applications, additional constrains may apply to Π in order to enforce further frame ordering regularity. For example Π may be restricted to few pre-defined ordering patterns providing regular single-or multi-level temporal scalability of the encoded sequence. The encoding order optimization procedure is summarized in following algorithm.
Encoding Order Optimization Algorithm
The flow chart of the encoding order optimization algorithm is shown in