The present invention relates generally to processing and coding of image sequences, and more particularly to encoding and decoding images using motion compensation.
Motion-compensated prediction is widely used in image sequence processing and coding. Though it is burdened by several disadvantages like causal processing of images, challenging rate allocation problems due to dependent quantization, and limited scalability. In recent years, new methods for representing groups of successive images have been developed while considering the motion among the successive images. Such representations offer perfect reconstruction, allow for multiresolution analysis and synthesis, and aim at compacting the energy of the video signal into a small number of representing coefficients.
A new method for constructing an orthogonal representation for general motion fields has been introduced in U.S. Pat. No. 8,346,000. The transforms that generate these coefficients are defined by a sequence of incremental transforms, which are realized by so-called Euler rotations. Examples are uni-directional transforms, bi-directional transforms, and basic half-pel accurate motion-compensated transforms which are limited to simple averaging interpolation filters. The problem of general sub-pel accurate motion-compensated transforms that support general interpolation filters is not solved in U.S. Pat. No. 8,346,000.
The present invention is offering an efficient solution for general sub-pel accurate motion-compensated transforms that support general interpolation filters. First, general interpolation filters are introduced as a constraint in the high-dimensional transform. Second, the transform is obtained in two steps, namely energy compaction and energy redistribution.
Fractional-pel accurate motion is widely used in video processing and coding. For sub-band processing and coding, fractional-pel accuracy is challenging since it is difficult to handle general motion fields with temporal transforms. In the prior invention U.S. Pat. No. 8,346,000, we designed integer-accurate motion-adaptive transforms (MAT) which can transform integer-accurate motion-connected coefficients. In the present invention, we extend the integer MAT to fractional-pel accuracy. The integer MAT allows only one reference coefficient to be the low-band coefficient. In the present invention, we design the transform such that it permits multiple references and generates multiple low-band coefficients. In addition, our fractional-pel MAT can incorporate a general interpolation filter into the basis vector, such that the high-band coefficients produced by the transform can be generated with interpolation filters that are commonly used for sub-pel accurate motion-compensated prediction. The fractional-pel MAT offers perfect reconstruction, orthogonality, and improved coding efficiency.
The present invention is directed to a compact representation of image sequences and related approaches, their uses and systems for the same.
The present invention describes orthogonal motion-adaptive transforms (MAT) which represent the image sequences in a compact representation, while allowing energy conservation and fractional-pel motion accuracy with arbitrary interpolation filters.
A specific implementation of the compact representation is to compact the energy of the pixels to fewer pixels. For an n-dimensional MAT, it generates n−1 energy-compacted lowband coefficient and one energy-removed highband coefficient. The interpolation filter is incorporated in MAT as one basis vector to generate the energy-removed coefficients.
This embodiment describes the energy compaction step of MAT. With this step, one energy-compacted lowband coefficient is generated. It also generates n−1 energy-removed highband coefficients, where last one of them is determined by the interpolation filter.
The first basis vector of energy compaction transform is determined by scale factors. The last basis vector is determined by an interpolation filter. The remaining basis vectors can be found by, e.g., Gram-Schmidt orthogonalization algorithm.
Scale factors are used to track the energy compaction under the assumption of ideal motion. Let x=[x1, x2, . . . , xn]T be a vector of coefficients connected by motion. Let the vector of scale factors associated with x be c=[c1, c2, . . . , cn]T. Each xi, (i∈{1, 2, . . . , n}), can also be considered as a lowband coefficient. The scale factor ci is used to represent the compacted energy in the coefficient as xi=cixi′, where xi, is the original intensity value. Ideal motion assumes that x1′=x2′= . . . =xn′=x′, i.e., these n pixels have the same original intensity x′. Then, the input coefficients can be expressed as
x=x′c. (1)
In the following, a simple example with two coefficients and a Haar transform is used to illustrate the use of scale factors. Let x1 and x2 be the original intensity values, x1=x2=x′. If we compact the energy of x1 and x2 into one lowband coefficient y1, i.e.,
the output lowband coefficient y1=√{square root over (2)}x′ becomes a scaled x′ with a factor √{square root over (2)}. The scale factor of y1 is √{square root over (2)}, which is determined by the factor of x′ in (2). In general, y1 is likely to be used further in hierarchical transforms. Thus, it is helpful to track the energy compaction of each lowband coefficient.
Similarly, if the energy of n pixels of x is compacted to one lowband coefficient, the corresponding scale factor is √{square root over (n)}. The scale factors are only determined by the motion information. They do not require extra information to be encoded.
Let T be an n×n transform matrix, and y=[y1, y2, . . . , yn]T the output. The transform gives y=TTx. The transform compacts the energy into one lowband coefficient and produces n−1 highband coefficients.
With the assumption of ideal motion (1), the aim is to an orthonormal transform matrix T that perfectly compacts the energy of x into one lowband coefficient. Let t1, t2, . . . , tn be the basis vectors of T, where t1 represents the lowband vector and tn the highest highband vector. The output coefficients are
y
i
=x′t
i
T
c, for i=1, . . . , n. (3)
The first coefficient y1=x′t1Tc is designed to capture the total energy of the signal x. Thus, t1 needs to be collinear with c,
Then, y1=x′√{square root over (cTc)} contains the total energy of x, and no energy is left in other dimensions. Since t1 represents one dimension in the n-dimensional space, and all the other basis vectors t2, . . . , tn are orthogonal to t1, all highband coefficients y2, . . . , yn are zero. With this, the transform T is able to compact the energy perfectly. The constraint of t1 in (4) is referred to as the subspace constraint.
If x deviates from ideal motion, i.e., x1, x2, . . . , xn are affected by noise, it will not give perfect energy compaction into one coefficient. However, the subspace constraint t1 is kept as it reflects ideal energy compaction for ideal motion.
Next, the highband vectors are constructed. The highband vectors need to be orthogonal to t1 an are not unique. For fractional-pel accurate motion compensation, an interpolation filter is used over several reference pixel values to better approximate the current target pixel value. Hence, one solution is to design to based on the interpolation filter.
Consider the input x=[x1, x2, . . . , xn]T, where the first n−1 coefficients x1, x2, . . . , xn−1 are the integer-sample references for the target xn. The first n−1 coefficients can be viewed as the reference pixel values in the reference frame which are used to generate an interpolation value. Let an interpolation filter be h=[h1, h2, . . . , hn−1], where Σi=1n−1 hi=1. The interpolated value is {circumflex over (x)}n=Σi=1n−1 hixi, and the approximation error between the interpolated value and the target is xn−{circumflex over (x)}n=xn−Σi=1n−1 hixi.
When using an orthonormal transform, the energy of the highband-to-be xn is expected to be removed as much as possible. In the transform, the last highband coefficient is given by the last basis vector. Thus, the interpolation filter is incorporated into the transform. A first approach is to form a basis vector as tn=[−h, 1]T. This generates a highband coefficient
y
n
=t
n
T
x=−Σ
i=1
n−1
h
i
x
i
+x
n, (5)
which is consistent with above defined approximation error.
The motion-adaptive transforms consider scale factors by design. Assuming ideal motion, the input signal is expressed as x=[c1x′, c2x′, . . . , cnx′]T. To reuse this concept, we use the scale factors to adjust the coefficients of the interpolation filter. Then, the last basis vector tn is
which can be normalized to
For non-deal motion, the high-band coefficient yn will reflect the approximation error.
Note that the basis vector tn is orthogonal to t1, as Σi=1n−1 hi=1.
For vertical or horizontal fractional-pel positions, the references are aligned in one dimension. The interpolation filter can be directly used to form tn. For non-vertical/horizontal fractional-pel positions, the references are distributed in two dimensions and tn cannot be obtained directly. For example, to interpolate a diagonal HP position, HEVC first uses the 8-tap interpolation filer along the rows to generate eight horizontal HP values, and then, uses the 8-tap filter again along the columns to filter these eight horizontal HP values to generate the final interpolated value. Thus, to obtain tn, we need to consider the interpolation filters in both dimensions.
Let hh be the p-tap interpolation filter horizontally, and hv the q-tap filter vertically. Let H=hhThv be the filter coefficient matrix of size p×q and X the matrix of references of the same size. The interpolated value is {circumflex over (x)}n=Σij HijXij. Similar to the one-dimensional case, the highband coefficient is
y
n
=x
n−{circumflex over (x)}n=xn−Σij HijXij=tnTx. (7)
Reshaping H and X into vectors gives
t
n
=[−H
11
, −H
12
, . . . , −H
pq, 1]T, (8)
x=[X
11
, X
12
, . . . , X
pq
, x
n]T, (9)
Again, normalize tn to
Since tn is of dimension (pq+1)×1, this approach is not separable. When scale factors are used, an approach similar to (6) is necessary.
In an n-dimensional space, two basis vectors are determined by t1 and tn. The remaining (n−2)-dimensional subspace is not unique for n>3. There are many ways to find a basis for the remaining subspace, e.g., decomposing the n-dimensional space using Gram-Schmidt or finding a certain matrix with its eigenvector matrix satisfying these constraints. Different approaches give different sets of t2, . . . , tn−1. One example is to use an approach based on Gram-Schmidt decomposition.
Let an n-dimensional space be spanned by orthonormal vectors f1, . . . , fn (fj∈Rn for j=1, . . . , n). We decompose this space for the given vectors t1 and tn using Gram-Schmidt orthonormalization. Let the projection of vector fj onto the vector ti be proj(fj, ti)=fjTti·ti. For fj, we find a vector that is orthogonal to t1, i.e., the orthogonal vector ej=fj−proj(fj,t1). By subtracting the projections proj(f1, t1), . . . ,proj(fn,t1), we reduce the n-dimensional space by one dimension. Since tn⊥t1, tn is a vector in the (n−1)-dimensional subspace. Again, reduce the dimensionality by subtracting the projections of e1, . . . , en onto tn. Then, we obtain an (n−2)-dimensional subspace. This subspace is orthogonal to both t1 and tn. The remaining basis vectors can be easily found within this subspace by using Gram-Schmidt, i.e.,
Equation (10) implies that {tilde over (e)}j is obtained by subtracting all the projected parts of fj−1 onto t1, . . . tj−1 and tn. The basis vectors t1, . . . , tj−1 and tn have been orthogonalized in the previous steps. {tilde over (e)}j is guaranteed to be orthogonal to all the previous calculated basis vectors.
The advantage of the Gram-Schmidt algorithm is that the algorithm does not modify the set of vectors if the input set of vectors is already optimal. That is, if the input vectors are orthogonal and decorrelate the signal (i.e., the KLT basis), the algorithm outputs the same set of vectors. Assume that f1, . . . , fn are the KLT basis vectors and that t1=f1 and tn=fn. We need to find vectors that are orthogonal to f1 and fn. Since the KLT basis vectors are orthogonal to each other, we always obtain proj(fj−i, ti)=0 in (10), and thus, tj=fj. That is, the algorithm will not degrade the performance of an efficient initial orthogonal basis. In general, it is possible to choose an arbitrary set {f1, . . . , fn} for decomposition. Each will lead to a possible decomposition.
This embodiment describes the energy redistribution step of MAT. With this, the energy is redistributed from one coefficient to k (1≤k<n) coefficients.
The energy compaction process in Embodiment 2 compacts the energy into one coefficient and determines a highband coefficient. For fractional-pel MAT, there are two major challenges at this point. First, the transform in Embodiment 2 compacts the energy to only one coefficient. Since fractional-pel motion estimation refers to multiple references, the compacted energy need to be shifted to other references. Second, since there will be multiple lowband coefficients, the scale factors associated with these lowband coefficients need to be determined.
The main concept of creating multiple lowband coefficients includes two steps: First, compact the energy of the input signal to one coefficient, and second, redistribute the energy from one coefficient to multiple coefficients. The energy should be conserved. Thus, the transforms in the two steps need to be orthonormal.
Consider x as the input and y the output of the energy-compacting transform. Assume yl to be the lowband coefficient. The energy of yl is redistributed to k energy-redistributed coefficients {tilde over (x)}k=[{tilde over (x)}1, . . . , {tilde over (x)}k]T. For fractional-pel accurate motion, k=n−1, i.e., the energy is redistributed to all the n−1 references. In general, 1≤k<n and k∈Z. Let Uk be the transform for energy redistribution. A k-dimensional orthonormal transform Uk is used to redistribute the energy to {tilde over (x)}k,
{tilde over (x)}k=UkTyk, (12)
where yk denotes the first k elements of y.
The energy compaction is given by y=TTx. As T is orthonormal, the inverse process of energy compaction is then x=TT
{tilde over (x)}k={tilde over (T)}kyk, (13)
To determine {tilde over (T)}k, the scale factors of {tilde over (x)}k are needed. Let {tilde over (T)}k=[{tilde over (t)}1, . . . , {tilde over (t)}k]. Similar to T, the lowband vector {tilde over (t)}1 needs to satisfy the subspace constraint determined by the scale factors {tilde over (c)}k of {tilde over (x)}k, i.e.,
Given {tilde over (c)}k, the matrix of {tilde over (T)}k can be constructed using, e.g., Gram-Schmidt orthonormalization.
In conclusion, in the first step, TT compacts the energy of the input to one energy-compacted coefficient. In the second step, {tilde over (T)}k redistributes the compacted energy to k references for further processing. The final n-dimensional output is
where 0k×(n−k) and 0(n−k)×k are zero matrices and In−k the identity matrix. In the fractional-pel case, {tilde over (T)}k with k=n−1 can be viewed as rotation around the nth basis vector. Constructing {tilde over (T)}k does not affect the highband vector tn.
The scale factors {tilde over (c)}k are updated to track the energy of the lowbands. {tilde over (c)}k is related to energy that is to be distributed to each lowband coefficient. One solution is to redistribute the lowband energy equally to the k coefficients, and thus, update the scale factors equally. Alternatively, since nearby references contribute more to the interpolated value according to the interpolation filter, it is reasonable to redistribute more energy to nearby references and less energy to faraway references.
A specific updating example is given below. Consider a simple 3-dimensional example with input x=[x1, x2, x3]T, where x1 and x2 are the references of x3. The energy of x3 is distributed to the two references x1 and x2, which become {tilde over (x)}1 and {tilde over (x)}2, respectively. Let the interpolation filter coefficients associated with x1 and x2 be h1 and h2, respectively. The energy is expected to be equally distributed to these two coefficients if there is no particular preference for any of the two, i.e., h1=h2. As shown in
A small
means that x1 to the interpolation, thus, it is reasonable to distribute less energy to x1, and vice versa. Then, s1 and s2 are
Now, consider the ideal motion assumption that the input is represented as x=[x1, x2, x3]T=[c1, c2, c3]Tx′, where x′ is the original pixel value and c1, c2, c3 are the scale factors. The energy of x3 is E=x32=c32x′2. From (17), we obtain that
The energies of {tilde over (x)}1 and {tilde over (x)}2 are updated to
respectively. Let {tilde over (c)}1 and {tilde over (c)}2 be the scale factors of {tilde over (x)}1 and {tilde over (x)}2, respectively. Since {tilde over (E)}1={tilde over (c)}12x′2 and {tilde over (E)}2={tilde over (c)}22x′2, the scale factors are updated as
Note that the scale factors are only determined by the final energy. The intermediate variable yk discussed in (13) does not affect the update of scale factors.
In general, when the energy E is distributed to k references, E=Σj=1k sj2 can be viewed as a hypersphere. Extending the line origin−(|h1|, . . . , |hk|) such that it intersects the hypersphere, we find the coordinates of the intersection point. Similar to (16), we have |s1|:|s2|: . . . :|sk|=|h1|:|h2|: . . . :|hk|, and we obtain that
Under ideal motion assumption, the energy can be expressed as E=c2x′2, where c is the scale factor associated with E. The scale factor ĉi of the ith reference is updated according to
Here is a simple example to construct a half-pel MAT (HP-MAT) with two references. Let the input be x=[x1, x2, x3]T, where x1 and x2 are the two references for x3. Assume x1=x2=x3=x are the original intensity values associated with scale factors of one. Since there are only two references, let the interpolation filter be h=[½, ½]. Then, the transform Tis a 3×3 matrix, and the basis vectors can be determined using (4) and (6), i.e.,
Then, decomposing a 3-dimensional space, the remaining vector is orthogonal to both t1 and tn as
With T, the energy of x is compacted to the first coefficient.
For energy redistribution, since there are only two references with equal filtering weights, the scale factors are updated according to (22) as
Then, the transform for redistribution is
It can be easily verify that TTT=I and {tilde over (T)}2T{tilde over (T)}2=1.
Thus, the MAT matrix according to (15) is
The last row of TMAT is given by t3, which is the highband vector determined by c and h as shown in (6). Applying TMAT to x=[x, x, x]T, we obtain the final output
The energy of x is compacted to two lowband references and the highband turns to zero.
For HP accuracy, the half-pel MCOT (HP-MCOT) considers only two references, horizontally or vertically. HP-MCOT is a sequential Euler rotations that rotates the signal step by step. Assume a 3-dimensional signal with scale factors c1, c2, c3, and the scale factors after update are {tilde over (c)}1, {tilde over (c)}2. We implement the energy compaction step as x2→x1, x3→x1, and, redistribution step as x1→x2. The transform matrix of the energy compaction step in HP-MCOT is
where ∥c12∥=√{square root over (c12+c22)} and ∥c123∥=√{square root over (c12+c22+c32)}. The energy redistribution step of HP-MCOT is the same as that of MAT, since it is a two dimensional fixed matrix where one basis vector [{tilde over (c)}1, {tilde over (c)}2]T is given and the other vector is orthogonal to the given one.
It can be seen from (28) that if c1=c2=c3, H1 is the transpose of T in (24) up to sign differences. However, if the scale factors are not equal, the third row of H1 will be different from tn as discussed in (6). Then, HP-MCOT gives a different transform matrix than HP-MAT. In higher dimensions these two transforms are also different, since the HP-MAT has a highband vector determined by the interpolation filter, while HP-MCOT does not have such a vector.
This embodiment is a combination of Embodiment 2 and Embodiment 3.
The application of Embodiments 1-4 is not limited to scalable video coding or temporal transforms. It can be applied to other areas where energy compaction is needed. One example is to apply in the spatial domain where hierarchical spatial transforms are needed.
Number | Date | Country | Kind |
---|---|---|---|
62638851 | Mar 2018 | US | national |
This application claims priority, under 35 U.S.C. § 19(e), of U.S. Patent Application Ser. No. 62/638,851, entitled “Methods and Arrangements for Sub-Pel Motion-Adaptive Image Processing,” and filed on Mar. 5, 2018, which is fully incorporated herein by reference.