The present application relates generally to video encoding/decoding (codec) scheme and, more specifically, to a method and apparatus for a video codec scheme that supports decoding video that has been encoded with minimal computations.
Current video coding technology has developed assuming that a high-complexity encoder in a broadcast tower would support millions of low-complexity decoders in receiving devices. However, with the proliferation of inexpensive camcorders and cellphones, User-Generated-Content (UGC) will become commonplace and there is a need for low-complexity video-encoding technology that can be deployed in these low-cost devices.
U.S. Pat. No. 7,233,269 B1 (Chen), US 2009/0225830 (He), US 2009/0122868 A1 (Chen) and US 2009/0323798 A1 (He) describe technology that use Wyner-Ziv theory to shift the computationally complex motion-estimation block from the encoder to the decoder, thus reducing encoder complexity. Although these inventions reduce encoder technology compared to the standardized codecs, their encoders still have relatively high complexity because they require transform-domain processing and quantization. Furthermore, Wyner-Ziv encoders usually require a feedback channel from the decoder to the encoder to determine the correct encoding rate. Such feedback channels are impractical for UGC creation. To avoid feedback channels, some Wyner-Ziv encoders US 2009/0323798 A1 (He) use rate-estimation blocks. Unfortunately, these blocks also increase encoder complexity.
US 2009/0196513 A1 (Tian) and US 2010/0080473 A1 (Han) exploit compressive sampling to improve coding performance of standardized encoders. Although compressive sampling theoretically enables low-complexity encoding of certain data sources, these inventions attempt to augment standardized encoders with a compressive-sampling block, to increase compression ratios. Therefore these implementations still have high complexity.
In “Compressive Coded Aperture Imaging,” SPIE Electronic Imaging, 2009 (Marcia, et al.), compressive sampling is used to implement a low-complexity video encoder in which a hardware component directly converts video frames into a compressed set of measurements. To reconstruct the video frames, the decoder solves an optimization problem. However, because the decoder does not explicitly account for the motion of objects between video frames, this method achieves low compression ratios.
In “A Multiscale Framework for Compressive Sensing of Video,” Picture Coding Symposium (PCS 2009), Chicago, 2009, (Park et al.), compressive sampling is used for video encoding. This implementation does model object-motion between video frames and hence it provides higher compression ratios than Marcia et al. However, the implementation requires the encoder to compute the wavelet transform of each video frame. Hence this implementation has relatively high complexity.
There exists a need for a low-complexity video encoder in which the encoder performs minimal computations. To achieve moderate compression ratios, the corresponding decoder must account for inter frame object motion. Additionally, the encoder and decoder must function independently, without a feedback channel.
A method for encoding a video is provided. A first plurality of random measurements is taken for a first frame at an encoder. A subsequent plurality of random measurements is taken for each subsequent frame at the encoder such that the first plurality of random measurements is greater than each subsequent plurality of random measurements. Each plurality of random measurements is encoded into a bitstream.
An apparatus for encoding video is provided. The apparatus includes a compressive sampling (CS) unit and an entropy coder. The CS unit takes a first plurality of random measurements for a first frame, and takes a subsequent plurality of random measurements for each subsequent frame at the encoder. The first plurality of random measurements is greater than each subsequent plurality of random measurements. The entropy coder encodes each plurality of random measurements into a bitstream.
A method for decoding video is provided. An encoded bitstream, which includes a current input frame, is received at a decoder. A sparse recovery is performed on the current input frame to generate an initial version of a currently reconstructed frame based on the current input frame. At least one subsequent version of the currently reconstructed frame is generated based on a last version of the currently reconstructed frame. Each subsequent version of the currently reconstructed frame has a higher image quality than the last version of the currently reconstructed frame.
An apparatus for decoding video is provided. The apparatus includes a decoder and a controller. The decoder receives an encoded bitstream that includes a current input frame, generates an initial version of a currently reconstructed frame based on the current input frame, and generates at least one subsequent version of the currently reconstructed frame based on a last version of the currently reconstructed frame. The subsequent version of the currently reconstructed frame has a higher quality image than the last version of the currently reconstructed frame. The controller determines how many subsequent versions of the currently reconstructed frames are to be generated. The decoder includes a sparse recovery unit that generates the initial version of the currently reconstructed frame by performing a sparse recovery on the current input frame.
Before undertaking the DETAILED DESCRIPTION OF THE INVENTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document: the terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation; the term “or,” is inclusive, meaning and/or; the phrases “associated with” and “associated therewith,” as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like; and the term “controller” means any device, system or part thereof that controls at least one operation, such a device may be implemented in hardware, firmware or software, or some combination of at least two of the same. It should be noted that the functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. Definitions for certain words and phrases are provided throughout this patent document, those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.
For a more complete understanding of the present disclosure and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
To achieve moderate compression ratios, the corresponding decoder must account for inter-frame object motion. Additionally, the encoder and decoder must function independently, without a feedback channel. Embodiments of the present disclosure operate at approximately the “Desired Operating Point” in
In compressive sampling, the video frame 300 having N×N pixels may be converted to an N2×1 vector xN that is sampled using a random sensing matrix A (i.e. measurement matrix) having a size of M×N2 (i.e. matrix A has N2 elements in each row and M columns where M is smaller than N2). This may be mathematically represented as a matrix multiplication of the random sensing matrix A and vector xN which produces an M×1 vector y, according to Equation 1 below:
y=Ax
N [Eqn. 1]
The resulting product is the bitstream 320 which is an M×1 vector y. As M (number of elements in the bitstream 320) is less than N2 (number of elements in vector xN of the original image 300), compression is achieved through a very simple process. It is noted that the above process is a mathematical description of the CS process, which is generally performed in the CS device 310. Some examples of devices that enable CS include a digital micromirror device (DMD) of a single pixel encoder, Fourier optics in a Fourier domain random convolution decoder, a Complementary Metal-Oxide-Semiconductor (CMOS) in a spatial-domain random convolution encoder, vibrating coded-aperture mask of a coded-aperture encoder, a noiselet-basis encoder, and any other device that supports the taking of random measurements from images.
It is noted that Mp<Mi, meaning that less compression was used for x0 than the subsequent frames. In other words, the first video frame is encoded in a set with more measurements, while the subsequent video frames are encoded with fewer measurements. This is because during the decoding process, the first bitstream y0 does not have a reconstructed previous video frame that can be used as a reference for generating the reconstructed frame {circumflex over (x)}0, which has been approximated based on y0 to reconstruct frame x0. That is, frame x0 is reconstructed independently based on the bitstream y0. In contrast, frame x1 can be reconstructed based on the bitstream y1 and the reconstructed previous frame {circumflex over (x)}0 to generate the reconstructed frame {circumflex over (x)}1. Similarly, frame x2 may be reconstructed based on the bitstream y2 and the reconstructed previous frame {circumflex over (x)}1 to generate the reconstructed frame {circumflex over (x)}2, and frame x3 may be reconstructed based on the bitstream y3 and the reconstructed previous frame {circumflex over (x)}2 to generate the reconstructed frame {circumflex over (x)}3, and so forth. As such, the bitstream y0 corresponds to the I-Frame, the first reference frame which is to be decoded independently by a decoder. Bitstreams y1, y2, and y3 correspond to P-Frames, each of which is to be predicted from a reference frame (i.e. the reconstructed previous frame) by the decoder. According to an embodiment, motion information from the first frame (x0) may be used to improve estimates of the subsequent frames.
There are several ways to improve the CS encoding process.
As previously discussed, different types of encoding techniques such as single-pixel encoding, Fourier-domain random convolution encoding, spatial-domain random convolution encoding, coded-aperture encoding, and noiselet-basis encoding may be used in various embodiments of the present disclosure. In some situations, one or more types of encoding techniques may be available during an encoding process. According to an embodiment, the encoder may determine the optimal random measurements and measurement technique for a given video.
where Ψ denotes any suitable sparse-representation basis, {circumflex over (x)} denotes the estimate of the vector xN of the original image 300, y denotes the vector yM of the bitstream 600, and A denotes the random sensing matrix that was used to generate the bitstream 600. In Equation 2a, Ψ and y are known and used to determine a best estimate of {circumflex over (x)} that corresponds to y. A different Ψ may be used for according to the type of video to optimize decoding. In Equation 2b, α controls the tradeoff between the sparsity term ∥ΨT{circumflex over (x)}∥1 and the data consistency term ∥A{circumflex over (x)}−y∥2. α may be selected based on many different factors including noise, signal structure, matrix values, and so forth. These optimization problems may be referred to as sparse solvers which accept A, Ψ, and y as input and give out the signal estimate {circumflex over (x)}. Equation 2a and Equation 2b may be solved via a convex solver or approximated with a greedy algorithm.
The equality constrained problem of Equation 2a can be made equivalent to the unconstrained form of Equation 2b, but in a very loose sense. Choosing a very small value of α would result in both equations 2a and 2b giving solutions that are very close to each other. The equality constrained problem (also called Basis pursuit) is usually used when there is substantially no noise in the measurements and the underlying signal enjoys a very sparse representation. However, if there is some noise in the measurements, or for whatever reason the signal estimate does not match the measurements exactly (which will be the case if only a low-resolution image is estimated from the measurements of a full-resolution image), then the equality constraint AxN=y may be relaxed with something similar to ∥A{circumflex over (x)}−y∥2<=ε for some small value of ε (also called Basis pursuit de-noising). The unconstrained form in the present disclosure is equivalent to the basis pursuit de-noising. In short, the relaxed form is used when measurement constraints cannot be satisfied and constrained otherwise.
Where Ψ0 denotes the wavelet basis restricted to resolution ‘0’ wavelets, which are wavelets corresponding to the lowest defined resolution. The subsequent resolution wavelets can be estimated according to Equation 4 below:
where Ψk denotes the wavelet basis restricted to the resolution-k wavelets, for k=1, 2, 3, . . . that corresponds to each subsequent estimation, and αk may change with the k wavelets. Because minimization is over basis subsets, the recovery is more robust. Multi-resolution implies spatial and complexity scalability. That is, the number of iterations may be set in the decoder by a user or preconfigured. Alternatively, decoding may be halted at an intermediate resolution in low-complexity devices that do not support high resolution. It is noted that Equation 4 does not recover signal approximation at any scale exactly. Rather, the number of iterations may be used to reach a particular level of approximation/resolution. The sparse recovery block 710 may perform sparse recovery in a feedback loop such that the estimated vector {circumflex over (x)}N from a current iteration may be used as an input, along with the next Ψk, for the next iteration in the loop. A controller (not shown) may determine the number of iterations. Furthermore, the multi-resolution approach can exploit motion information efficiently. According to another embodiment the constrained forms of Equations 3 and 4 may be used.
In block 820, {circumflex over (x)}128, a low-resolution version of the image (i.e. any size image that does not have confidence in wavelet coefficients on finer scales beyond the 128×128 resolution), is reconstructed from the input vector yindex (i.e. the input bitstream) by solving an optimization problem that determines the sparsest lowest-resolution wavelets which agree with the measurements according to Equation 4. According to an embodiment, a previously reconstructed frame at the lowest resolution (e.g. {circumflex over (x)}128prev) may be used to initiate the optimization search for the lowest resolution version of the reconstructed frame (e.g. {circumflex over (x)}128). When process 800 is performed as a feedback loop, block 820 may be construed as the operation for initializing the loop. That is, the lowest-resolution version of P-frame {circumflex over (x)}128 is decoded without motion information.
According to an embodiment, Equation 3 and Equation 4 may be “warm-started”, using the estimate of the previous frame or lower resolution estimate of the current frame. This can help in expediting the iterative update and restricting the search space for the candidate solutions.
In block 824, motion is estimated against the lowest-resolution version of the previous, reconstructed frame (e.g. {circumflex over (x)}128prev) to determine motion vectors. According to an embodiment, various types of motion estimation may be used, such as phase-based motion estimation using complex wavelets, or optical flow, or block-based motion estimation, or mesh-based motion estimation. In the present disclosure any of these or other motion-estimation techniques may be used wherever the term “motion estimation” occurs. In block 826, the resultant motion vectors are used to motion compensate a next higher resolution version of the previous frame (e.g. {circumflex over (x)}256prev), and this motion-compensated frame (e.g. {circumflex over (x)}256mc) initiates the optimization search for the next higher-resolution version of the reconstructed frame. According to an embodiment, however, the motion compensation may be performed on image estimates at full resolution (i.e. final reconstructed version of the previous frame). As shown in blocks 830, 834, and 840, these operations may be repeated until the highest-resolution version of the frame consistent with the measurements is recovered (i.e. {circumflex over (x)}N). As already mentioned, the number of iterations may be configured by a user, predetermined, adjusted at run-time, and so forth. When the current frame is reconstructed, process 800 may then be performed, using the versions of the recovered frame {circumflex over (x)}N at the various resolutions may be used as the new reference frames, to recover the next incoming frame. As such, the versions of the reference frames that support various resolutions may be stored in memory or a set of registers. When performed as a feedback loop, the operations described in blocks 824, 826, and 830 may be looped such that the output of block 830 and the corresponding resolution version of the previous frame may be used as the inputs for the next iteration in the loop. A controller (not shown) may control the feedback loop and determine the number of iterations.
It is noted that although the intermediate versions of the reconstructed frame (e.g. {circumflex over (x)}128) imply a resolution of 128×128, this is merely used in the present disclosure as an example and is not intended to limit the scope of the present disclosure. In fact, {circumflex over (x)}128 also does not necessarily refer to a resolution or the actual size of the image. Instead, the {circumflex over (x)}128 notation should be regarded as any image for which there is insufficient confidence in wavelet coefficients on finer scales beyond the specified resolution level (here, 128×128). According to an embodiment, measurements may be taken at full resolution/size (i.e. number of pixels). As such, each intermediate version of the reconstructed image may be construed as having full size (i.e. number of pixels) in the spatial domain; the term “resolution” denotes how many scales of the wavelets were used to reconstruct the image. This similarly applies to references to versions of the reconstructed frame (e.g. lowest resolution version, low-resolution version, high-resolution version, next higher resolution version, previous lower resolution version, and such). Moreover, this applies to all embodiments of the present disclosure.
In block 920, a sparse recovery is performed from the input vector yindex by solving the sparse recovery problem to estimate {circumflex over (x)}N according to Equation 2. When process 900 is performed as a feedback loop, block 920 may be construed as the operation for initializing the loop.
In block 924, motion is estimated against the previous reconstructed frame to determine motion vectors. According to an embodiment, the motion vectors are estimated using complex-wavelet phase-based motion estimation, or traditional block-, or mesh-based motion estimation, or optical flow. Alternatively, the CS decoder may use any elaborate motion estimation scheme, as it does not incur any cost in terms of communication overhead like it does in conventional coders. In block 926, the motion vectors are used to compute a motion compensated frame mc(xNprev) from the reference frame (i.e. the previous reconstructed frame xNprev).
In block 928, a sensing matrix A is applied to the motion compensated frame mc(xNprev). The operation is similar to multiplying the sensing matrix A with the motion compensated frame mc(xNprev) to get A(mc(xNprev)). In block 929, Δy is calculated as the difference between the input vector yindex and A(mc(xNprev)) (i.e. the output of block 928).
In block 930, Δy is used to estimate the motion compensated residual Δx by solving a sparse recovery problem according to Equation 5 below:
Referring back to Equation 1, the following relationship may be derived according to Equation 6:
Δy=yindex−A(mc(xNprev))≡A(xindex−mc(xNprev)) [Eqn. 6]
where xindex denotes the original image that was encoded at an encoder. According to Equation 7:
Δx=xindex−mc(xNprev) [Eqn. 7]
Therefore, in block 932, the new estimate for xindex may be calculated according to Equation 8:
{circumflex over (x)}
index=mc(xNprev)+Δx [Eqn. 8]
where {circumflex over (x)}index denotes the new {circumflex over (x)}N. Blocks 934, 936, 938, and 939 perform substantially the same operations as blocks 924, 926, 928, and 929, with the difference being that the input vector is the new {circumflex over (x)}N. In other words, the operations of blocks 924-930 may be repeated with each updated {circumflex over (x)}N any number of times such that, with each subsequent iteration, the reconstruction of the original image is improved. The number of iterations may be preconfigured or adjusted. A controller (not shown) may determine the number of iterations. The last {circumflex over (x)}N that is estimated may then be set as the reference frame (i.e. previous frame) by the decoder to reconstruct the next incoming video frame using process 900.
In block 1020, a low-resolution version of the image, is reconstructed from the input vector yindex (i.e. the input bitstream) by solving an optimization problem that determines the sparsest lowest-resolution wavelets which agree with the measurements according to Equation 4. When process 1000 is performed as a feedback loop, block 1020 may be construed as the operation for initializing the loop. That is, the lowest-resolution version of P-frame {circumflex over (x)}128 is decoded without motion information.
In block 1024, motion is estimated against the lowest-resolution version of the previous, reconstructed frame (e.g. {circumflex over (x)}128prev to determine motion vectors. In block 1026, the motion vectors are used to compute a motion compensated frame mc(x128prev) the lowest-resolution version of the previous, reconstructed frame {circumflex over (x)}128prev.
In block 1028, a sensing matrix A is applied to the motion compensated frame mc(x128prev). The operation is similar to multiplying the sensing matrix A with the motion compensated frame mc(x128prev) to get A(mc(x128prev)). As explained previously, this operation is well-defined because mc(x128prev) may be construed as having full-domain spatial size. In block 1029, Δy128 is calculated as the difference between the input vector yindex and A(mc(x128prev)) (i.e. the output of block 1028).
In block 1030, Δy128 is used to estimate the motion compensated residual at a next higher resolution version (e.g. Δx256) by solving a sparse recovery problem according to Equation 5. In block 1031, the motion compensated frame mc(x128prev) is also upsampled to the next higher resolution (e.g. mc(x128prev)). In block 1032, the new estimate for {circumflex over (x)}128 may be calculated according to Equation 8. As such, blocks 1024-1032 constitute one iteration for reconstructing the video frame.
Subsequent iterations (comprising the functions of blocks 1024-1032) reconstruct the images that support higher resolutions. A controller (not shown) may determine the number of iterations. As already discussed, the number of iterations may be configured by a user, predetermined, adjusted at run-time, and so forth. For example, in block 1031, the estimated image vector {circumflex over (x)}128 is upsampled (i.e. the size of the vector is increased by interleaving zeros and then interpolation filtering, or by wavelet-domain upsampling) to create a new image vector that can support a higher resolution (e.g. {circumflex over (x)}256). In an embodiment, a low-resolution image may be used for {circumflex over (x)}256 to reduce buffering costs. In such an embodiment, the upsample 1031 creates the higher resolution {circumflex over (x)}256 that is subsequently used by 1032 for motion estimation. However, as previously discussed, the higher resolution does not necessarily indicate an increase in the spatial size of the image but, rather, an increase in the number of scales of the wavelets that were used to reconstruct the image. According to an embodiment, another upsample block may be added before each sensing matrix such that measurements at the sensing matrix are taken at full resolution (i.e. number of pixels in the final image).
According to another embodiment, intermediate estimates may comprise full spatial size images that are reconstructed from wavelet approximations at different scales. According to yet another embodiment, in which buffering costs are not an issue, no upsampling blocks are required. In this embodiment, full resolution is maintained in all images, but the effective resolution is determined by the number of wavelet scales used for reconstruction. Therefore, for example, {circumflex over (x)}256 would use one more wavelet scale than {circumflex over (x)}128 although both these images would have the N×N pixels, where N is the maximum resolution and N may be larger than 256. Blocks 1034, 1036, 1038, and 1039 are substantially similar to blocks 1024, 1026, 1028, and 1029, respectively. Any number of iterations may be performed in a loop according to an embodiment until the highest-resolution version of the frame consistent with the measurements is recovered (i.e. {circumflex over (x)}N).
When the current frame is reconstructed, the decoder may set the versions of the recovered frame {circumflex over (x)}N at the various resolutions as the new reference frames to recover the next incoming frame using process 1000. As such, the versions of the reference frames at the various resolutions may be stored in memory or a set of registers. When performed as a feedback loop, the operations described in blocks 1024, 1026, 1028, 1029, 1030, and 1032 may be looped, with the estimated frame at each iteration being upsampled for the subsequent iteration, such that the output of block 1032 and the corresponding resolution version of the previous frame may be used as the inputs for the next iteration in the loop.
According to some embodiments, the encoding and decoding processes of the present disclosure may be performed in a transform domain.
While conventional recovery occurs iteratively in the wavelet domain under spatial constraint (e.g., see Equation 2a), with wavelet-domain measurements, recovery and constraint are in the wavelet-domain, thus reducing decode time according to Equation 9 below:
where λ denotes the coefficients from the wavelet transform. The compression ratio will increase because random measurements of wavelet-domain frame differences have reduced entropy.
For all embodiments disclosed, analyticity of complex wavelet bases or overcomplete complex wavelet frames (or quaternion wavelet bases or overcomplete quaternion wavelet frames) may be exploited during the recovery process. Specifically, the complex wavelet transforms of real-world images are analytic functions with phase: patterns which are predictable from local image structures. Examples of phase patterns may be found in “Signal Processing for Computer Vision,” by G. H. Granlund, H. Knutsson, Kluwer Academic Publishers, 1995. Therefore, the recovery process can be improved by imposing additional constraints on predicted phase patterns.
According to an embodiment, motion information may also be used in the wavelet domain. Normally, it is difficult to exploit motion information in the minimization using Equation 4 because wavelet bases Ψk are shift variant, and hence, motion information is garbled. However, over-complete, wavelet frames for Ψk are shift-invariant and, therefore, may be used such that motion information is made explicitly available using techniques such as phase-based motion estimation. In other embodiments, over-complete complex wavelet or overcomplete quaternion frames may be used. Because minimization occurs in the decoder, the over-complete wavelet frame does not incur a compression penalty.
In some embodiments, the CS decoder may further be improved by implementing parallelization of the decoding processes. For example, in processes 800 and 1000, the next frame may processed as an estimate of the previous image is calculated at each increasing resolution level.
Decoder 1200, or any individual component, may be implemented in one or more field-programmable gate arrays (FPGAs), one or more application-specific integrated circuits (ASICs), as software stored in a memory and executed by a processor or microcontroller. CS decoder may be implemented in a television, monitor, computer display, portable display, or any other image/video decoding device.
The sparse recovery component 1210 solves the sparse recovery problem for an input vector, as discussed with reference to
According to an embodiment components 1210-1250 may be integrated into a single component or each component may be further divided into multiple sub-components. Furthermore, one or more of the components may not be included in a decoder according to the embodiment. For example, a decoder that reconstructs video using process 700 may not include the motion estimation & compensation component 1220 and the sensing matrix component 1230.
Although the present disclosure has been described with an exemplary embodiment, various changes and modifications may be suggested to one skilled in the art. It is intended that the present disclosure encompass such changes and modifications as fall within the scope of the appended claims.
The present application is related to U.S. Provisional Patent. Application No. 61/377,360, filed Aug. 26, 2010, entitled “LOW COMPLEXITY VIDEO ENCODER (LoCVE)”. Provisional Patent Application No. 61/377,360 is assigned to the assignee of the present application and is hereby incorporated by reference into the present application as if fully set forth herein. The present application hereby claims priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application No. 61/377,360.
Number | Date | Country | |
---|---|---|---|
61377360 | Aug 2010 | US |