The present invention relates generally to rate control in video coding. Particularly, the present invention relates to coding tree unit level rate-distortion optimization for rate control in video coding.
Recently, several studies have been conducted to improve rate control (RC) optimization in High Efficiency Video Coding (HEVC). There are three categories of RC algorithms for HEVC: quadratic model [1], ρ-domain model [2] and R-λ model. More specifically, Li et al. [3] first proposed the λ domain RC based on the relationship between coding bits and the Lagrange multiplier. Due to the low complexity and high efficiency, the R-λ model has been adopted in HEVC reference software as the default RC algorithm. Lee et al. investigated the Laplacian probability distribution function (PDF) in [4] to model the residue and proposed independent R-Q models to establish the relationship between the quantization parameters and coding bits, including texture and non-texture bits. Moreover, intra frame RC algorithms have also been studied. Li et al. [5] proposed an adaptive bit allocation algorithm to improve the R-λ model RC algorithm on intra frame. In [6], sum of absolute transformed differences (SATD) was used to measure the complexity for intra-frame, which further improves the performance. Wang et al. proposed an intra R-λ, model in [7], and the gradient was used to characterize the picture complexity.
In
The disclosures of above references are incorporated herein by reference in their entirety.
The present disclosure relates to methods based on coding tree unit (CTU) level rate-distortion (R-D) optimization for rate control (RC) in video coding which can effectively improve the perceptual rate-distortion performance and coding efficiency. Firstly, a perceptual R-D model is established using a divisive normalization framework, which characterizes the relationship between local visual quality and coding bits. Subsequently, the established perceptual R-D model is applied to overall distortion optimization which is transformed into a global optimization problem and solved with convex optimization algorithms to obtain optimal CTU level coding bit allocation.
Embodiments of the invention are described in more detail hereinafter with reference to the drawings, in which:
In the following description, methods and apparatus using coding tree unit (CTU) level rate-distortion optimization for rate control (RC) in video coding are set forth as preferred examples. It will be apparent to those skilled in the art that modifications, including additions and/or substitutions may be made without departing from the scope and spirit of the invention. Specific details may be omitted so as not to obscure the invention; however, the disclosure is written to enable one skilled in the art to practice the teachings herein without undue experimentation.
In one aspect, the present invention may be implemented in the video coding system 100 of
Encoding is defined as computationally modifying a video source 10 to a different form. Encoding includes compression, in which data amounts are reduced, enhancement, resolution changes, aspect ratio changes. In one aspect, the encoding may be performed according to the High-Efficiency Video Coding (HEVC)/H.265 standard. Frames are generated in a frame generating module 25. Subsequently, the frame may be divided into one or more CTUs in CTU module 22.
Video encoded by the encoder 20 forms a video bitstream 30 that represents information from the video source 10. The video bitstream 30 is transmitted or transferred to decoder 50 over transmission medium 40.
Transmission medium 40 may be a wired or wireless communication network or a file transfer to decoder 50.
Decoder 50 takes the video bitstream 30 and creates a video stream 60, which is a computationally modified version of the video source 10. The decoder may create a video stream 60 that can have different properties from the source such as a different frame rate, different resolution, different color parameters, different view order, different aspect ratio, a different frame rate, or combinations, etc.
The video stream 60 is transmitted to a display medium 70 including a display processor 75. The display processor 75 can receive the video stream 60 from the video decoder 50 for display by the display medium 70.
The video coding system 100 can employ a variety of video coding syntax structures. For example, the video coding system 100 can encode and decode video information using High Efficiency Video Coding/H.265 (HEVC), scalable extensions for HEVC (SHVC), or other video coding syntax structures.
The video encoder 20 and the video decoder 50 may be implemented by hardware, software, or a combination thereof. For example, the video encoder 50 may be implemented with custom circuitry, a digital signal processor, microprocessor, or a combination thereof. In another example, the video decoder 60 can be implemented with custom circuitry, a digital signal processor, microprocessor, or a combination thereof.
According to one embodiment of the present invention, a method based on CTU level rate-distortion optimization for RC in video coding is provided. The method may be implemented in the system of
D′(R)=D(R)/f2 (1)
where D(R) is the MSE distortion, D′(R) is the normalized perceptual distortion, and R is the bit rate of CTU. In general, the MSE distortion, D(R), may be defined by:
where I and k are original frame and reconstructed frame, respectively. I(i, j) and K(i, j) are the pixel values of original frame and reconstructed frame, respectively. m and n are the numbers of rows and columns in a frame, respectively.
To obtain the divisive normalization factor f, each CTU can be divided into I sub-blocks for Discrete cosine transform (DCT), and the factor f is obtained from the Structural Similarity (SSIM) index in DCT domain:
where E( ) is the expectation operation in the whole frame. U(j) and V(j) denote the DCT coefficients of the input and reconstructed signals, Ui(j) and Vi(j) are the corresponding j-th DCT coefficient in the i-th sub-block.
In some embodiments, the DCT coefficients of the reconstructed signals are approximated by the original input signals as the frame has not been encoded when deriving the normalization factors. CI is the constant in accordance with the definition of SSIM index. NL is the sub-block size, and may be set to be 16. However, it should be understood by those skilled in the art that the sub-block size can be set to any other values for deriving the divisive normalization factors f.
Given the available bit rate allocated to the frame, the CTU level rate control may be achieved by CTU level bit allocation through optimizing the perceptual distortion by minimizing a perceptual rate distortion cost function J defined by:
where λ is the Lagrange multiplier in HEVC, which is also used when the distortion is normalized with the divisive normalization strategy, D′(Ri) is the perceptual distortion of the i-th CTU with a coding bit rate Ri, and N is the number of CTU in one frame.
In some embodiments, a global optimization approach for optimizing the CTU level coding bit allocation may be used, wherein all CTUs in a frame are configured to compete for the resources under the constraint of the target frame-level coding bits. Therefore, the CTU level coding bits allocation can be performed effectively by solving an optimization problem. The scheme of the present invention not only improves the reconstruction quality and coding efficiency in terms of perceptual rate-distortion, but also benefits the future R-D modelling with high accuracy.
In the global optimization approach, each CTUs of a frame, denoted as CTU1, CTU2, . . . , CTUN, may be allocated with utilities of coding bit rates, R1, R2, . . . , RN respectively. Possible utility combination sets may be expressed with a utility vector denoted as Um=(R1m, R2m, . . . , RNm), m∈[0, M] where M is the quantity of the possible combinations of utility.
As the utility set U=(U1, U2, . . . , UM) is non-empty and bounded, and the set of feasible utility U is convex, the CTU level rate control can be achieved by an optimal bit rate allocation. The optimal bit allocation may be investigated by minimizing average distortion which depends on the perceptual distortion D′(R) . As such, the CTU level bit allocation may be formulated as:
where N is the number of CTUs of one frame and Rc is the frame-level bit rate.
Therefore, the perceptual distortion optimization problem can be converted from a constrained optimization problem into an unconstrained optimization problem and the cost function J of Equation (4) may be converted to:
Typically, Equation (6) is the minimal value of different function and convex function on convex set. Therefore, Karush-Kuhn-Tucker (KKT) condition ensures that the local optimal solution of Equation (6) is a KKT point and the local optimal solution is also the global optimal solution.
Taking the video content into consideration, the relationship between the normalized perceptual distortion D′(R) and the bit rate R may be depicted with a logarithmic R-D model:
D′(R)=ln(c×R−k) (7)
where c and k are model parameters depending on the video content.
The prediction accuracy of the logarithmic R-D model may be validated by calculating the average Pearson correlation coefficient between the predicted and actual values for a series of test sequences with different QPs including: PeopleOnstreet (1600p), ParkScene (1080p), FourPeople(720p), BQMall(832×480) and BQsquare(416×240). It can be seen from a table in
The effectiveness of the logarithmic R-D model is also validated using a Low Delay B (LDB) coding structure with reference image in HM 16.8. The values of c and k are obtained by fitting the actual values with the model.
Based on the logarithmic R-D model, the optimal coding bit for each CTU may be obtained by solving the equation:
where Rj and kj are the coding bit and model parameter for the j-th CTU, respectively.
Given Eq. (8), we have,
Then the following relationship can be derived by
Subsequently, by substituting Equation (10) into Equation (8), we have
Accordingly, the optimal coding bits of the jth CTU, denoted as R*j, may be determined by
After obtaining the parameter kj and R*j for the j-th CTU of the current to-be-encoded i-th frame, the CTU level target bit budget may be further adjusted by:
where ωa is an adjustment term to regularize the CTU level bit such that the frame-level budget can be met. Ract,p and R*p are the real bits and the target bits after bit allocation, respectively. The corresponding QP can be obtained for each CTU through the R-Q model disclosed in [1].
In some embodiments, optimal values of the parameter k for each CTU in a current to-be-encoded i-th frame may be estimated with an updating strategy based on the coding statistics of a previously encoded (i-I)th frame. In particular, the optimal value of k for a j-th CTU in the i-th frame may be obtained by minimizing the difference between a true distortion Dreal of the j-th CTU in the previously encoded i-I frame and an estimated distortion Dcomp for the j-th CTU of the i-th frame.
The true distortion Dreal may be estimated with Equation (1): D′(R)=D(R)/f2. The distortion between two adjacent frames is of great importance to control the consistent quality and the distortion of the current CTU is similar to the co-located position of previous frame. Therefore, the distortion of a co-located CTU may be used to obtain Dcomp.
The difference between Dreal and Dcomp can be represented by a squared error function denoted as e2 which is expressed as:
e
2=(Dreal−Dcomp)2. (14)
By taking the derivative of e2 to k, we have
Based on the Taylor's expansion, the optimal value of k for the j-th CTU the i-th frame, knew, may be obtained by:
where λk is a constant which is preferably set to be 0.05, kold is the value of k of the co-located CTU in the previous frame, and R is the bit rate for the to-be-encoded CTU.
It should be noted that λk in Equation (16) can be adaptive to the video content and the model parameters between two consecutive frames are of great importance to achieve quality control in video coding. Regarding the rate control that produces videos with consistent quality, the model parameters of a CTU are better to be consistent with the co-located CTU in the previous frame. Therefore, the value of k of the co-located CTU in the previous frame may be used as kold Equation (16) for computing knew. As to the initial values of k, it may be set to an arbitrary value such as 2.5 used in the experiment. It is also worth mentioning that the initial values of k are not critical for CTU level rate control in the present invention, as the value will keep updating in the actual coding process.
Experiments have been carried out to compare performance of the RC method provided in the present invention in various aspects with some state-of-the-art RC methods. In the experiments, an LDB coding structure was used and both non-hierarchical (N-Hie) and hierarchical (Hie) encoding were involved.
The visual qualities of a 120-th frame in BQSquare sequence with target bit rate of 160 kbps after being encoded with some state-of-the-art RC methods and the method of the present invention under hierarchical configuration are compared to investigate the subjective quality improvement.
Similarly, the visual qualities for a 120-th frame in BQSquare sequence with target bit rate of 780 kbps after being encoded with some state-of-the-art RC methods and the method of the present invention under hierarchical configuration are also compared to investigate the subjective quality improvement.
It can be seen that the method of the present invention can produce better visual quality at similar bit rate. Experimental results also show that, compared to the method of the present invention, other state-of-the-art RC methods are more likely to suffer from structural deformation, blocking effects as well as color artifacts, leading to lower visual quality. As a result, the visual quality is obviously degraded. Moreover, the method of the present invention has better quality in the texture areas.
Quality smoothness is another factor influencing the visual quality of experience.
Occupancy of buffer is another important factor in rate control, as the overflow and underflow should be avoided. Therefore, stable buffer occupancy is of great importance in evaluating RC performance. The buffer occupancy is mainly determined by the target bits and actual bits and may be indicated with the buffer size, Buf, which is defined as:
B
uf
=D
elay
×T
ar (17)
where Delay is the delay time and Tar is the bandwidth.
The accuracy of the bit rate at the frame level is also investigated for mismatch error, which is calculated as follows,
where Ract and Rtar are the actual bit and the target bit at the frame level.
The computational complexities of the method of the present invention and some state-of-the-art RC methods are also compared and evaluated with the computation time of the RC methods which are calculated by:
where Tpro and Torg are the encoding time of the scheme of the present invention and HM16.8 anchor.
The robustness of RC algorithm of the method of the present invention and some state-of-the-art methods under hierarchical configuration are also evaluated and compared on video sequences with dynamic scene changes, including Mobisode, Kimono and Tennis.
The methods based on CTU level rate-distortion optimization for rate control in video coding may be implemented in the apparatus described above and can be incorporated into systems including high definition televisions, mobile or personal computing devices (e.g. “tablet” computer, laptop computer, and personal computer), kiosks, printers, digital cameras, scanners or photocopiers or user terminals having built-in or peripheral electronic displays. The apparatus, including the encoder, may include machine instructions for performing the algorithms; wherein the machine instructions can be executed using general purpose or specialized computing devices, computer processors, or electronic circuitries including, but not limited to, digital signal processors (DSP), application specific integrated circuits (ASIC), field programmable gate arrays (FPGA), and other programmable logic devices. The apparatus may also comprise computer storage media having computer instructions or software codes stored therein which can be used to program computers or microprocessors to perform any of the processes of the present invention. The storage media can include, but are not limited to, floppy disks, optical discs, Blu-ray Disc, DVD, CD-ROMs, and magneto-optical disks, ROMs, RAMs, flash memory devices, or any type of media or devices suitable for storing instructions, codes, and/or data.
The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Various of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art, each of which is also intended to be encompassed by the disclosed embodiments.