The following description relates generally to digital video coding, and more particularly to techniques for motion estimation using one or more reference frames of a temporal search range.
The evolution of computers and networking technologies from high-cost, low performance data processing systems to low cost, high-performance communication, problem solving, and entertainment systems has increased the need and desire for digitally storing and transmitting audio and video signals on computers or other electronic devices. For example, everyday computer users can play/record audio and video on personal computers. To facilitate this technology, audio/video signals can be encoded into one or more digital formats. Personal computers can be used to digitally encode signals from audio/video capture devices, such as video cameras, digital cameras, audio recorders, and the like. Additionally or alternatively, the devices themselves can encode the signals for storage on a digital medium. Digitally stored and encoded signals can be decoded for playback on the computer or other electronic device. Encoders/decoders can use a variety of formats to achieve digital archival, editing, and playback, including the Moving Picture Experts Group (MPEG) formats (MPEG-1, MPEG-2, MPEG-4, etc.), and the like.
Additionally, using these formats, the digital signals can be transmitted between devices over a computer network. For example, utilizing a computer and high-speed network, such as digital subscriber line (DSL), cable, T1/T3, etc., computer users can access and/or stream digital video content on systems across the world. Since the bandwidth for such streaming is typically not as large as local access and because processing power is ever-increasing at low costs, encoders/decoders often attempt to require more processing during the encoding/decoding steps to decrease the amount of bandwidth required to transmit the signals.
Accordingly, encoding/decoding methods have been developed, such as motion estimation (ME), to provide pixel or region prediction based on a previous reference frame, thus reducing the amount of pixel/region information that should be transmitted across the bandwidth. Typically, this requires encoding of only a prediction error (e.g., a motion-compensated residue). Standards such as H.264 have been released to extend temporal search ranges to multiple previous reference frames (e.g., multiple reference frames motion estimation (MRFME)). However, as the number of frames utilized in MRFME increase, so does its computational complexity.
The following presents a simplified summary in order to provide a basic understanding of some aspects described herein. This summary is not an extensive overview nor is intended to identify key/critical elements or to delineate the scope of the various aspects described herein. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
Variable frame motion estimation in video coding is provided where the gain of using single reference frame motion estimation (ME) or multiple reference frame motion estimation (MRFME), and/or a number of frames in MRFME can be determined. Where the gain meets or exceeds a desired threshold, the appropriate ME or MRFME can be utilized to predict a video block. The gain determination or calculation can be based on a linear model of motion-compensated residue over the evaluated reference frames. In this regard, performance gain of utilizing MRFME can be balanced with the computational complexity thereof to produce an efficient manner of estimating motion via MRFME.
For example, beginning with a first reference frame prior in time to the video block to be evaluated, if the motion-compensated residue of the reference frame, as compared to the video block, meets or exceeds a given gain threshold, MRFME can be performed, as opposed to regular ME. If motion-compensated residue of a subsequent reference frame, as compared to the previous reference frame, meets the same or another threshold, the MRFME can be performed with an additional reference frame, and so on until the gain of adding additional frames is no longer justified by the computational complexity of MRFME according to the given threshold.
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings. These aspects are indicative of various ways which can be practiced, all of which are intended to be covered herein. Other advantages and novel features may become apparent from the following detailed description when considered in conjunction with the drawings.
Efficient temporal search range prediction is provided for multiple reference frames motion estimation (MRFME) based on a linear model for motion-compensated residue. For example, gain of searching more or less reference frames in MRFME can be estimated by utilizing the current residue for a given region, pixel, or other portion of a frame. The temporal search range can be determined based on the estimation. Therefore, for a given portion of a frame, the advantage of using a number of previous reference frames for MRFME can be measured over the cost and complexity of MRFME. In this regard, MRFME can be utilized for portions having a gain over a given threshold when MRFME is used. Since MRFME can be computationally intensive (especially as the number of reference frames increases), it can be used over regular ME when it is advantageous according to the gain threshold.
In one example, the MRFME can be utilized over regular ME when the gain is at or above a threshold; however, in another example, the number of reference frames used in MRFME for a given portion can be adjusted based on a gain calculation of MRFME for the number of reference frames. The number of frames can be adjusted for a given portion to reach an optimal balance of computational intensity and accuracy or performance in encoding/decoding, for example. Moreover, the gain can relate to an average peak signal-to-noise ratio (PSNR) of MRFME (or a number of reference frames utilized in MRFME) relative to that of regular ME or a shorter temporal search range (e.g., a lesser number of reference frames utilized in MRFME), for example.
Various aspects of the subject disclosure are now described with reference to the annexed drawings, wherein like numerals refer to like or corresponding elements throughout. It should be understood, however, that the drawings and detailed description relating thereto are not intended to limit the claimed subject matter to the particular form disclosed. Rather, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the claimed subject matter.
Now turning to the figures,
By utilizing the H.264 coding standard, functionalities of the standard can be leveraged while increasing efficiency through aspects described herein. For example, the video coding component 104 can utilize the H.264 standard to select variable block sizes for motion estimation by the motion estimation component 102. Selecting the block sizes can be performed based on a configuration setting, an inferred performance gain of one block size over others, etc. Moreover, the H.264 standard can be used by the motion estimation component 102 to perform MRFME. In addition, the motion estimation component 102 can calculate gain of performing MRFME using a number of reference frames and/or performing regular ME (with one reference frame) for given blocks to determine motion estimation. As mentioned, MRFME can be computationally intensive as the number of reference frames utilized (e.g., temporal search range) increases, and sometimes such increasing in the number of frames used only provides a small benefit in predicting motion. Thus, the motion estimation component 102 can balance computational intensity of temporal search ranges in MRFME with accuracy and/or performance based on the gain, hereinafter referred to as MRFGain, to provide efficient motion estimation for a given block.
In one example, the MRFGain can be calculated by the motion estimation component 102 based at least in part on motion-compensated residue of a given block. As mentioned, this can be the prediction error for a given block based on the ME or MRFME chosen. For example, where the MRFGain for searching multiple reference frames of a video block is small, the process of utilizing the additional previous reference frames can yield a small performance improvement while providing high complexity in computation. In this regard, it can be more desirable to utilize a smaller temporal search range. Conversely, where the MRFGain of a video block is large (or beyond a certain threshold, for example), increasing the temporal search range can yield a greater benefit to justify the increase in computation complexity; in this case, a larger temporal search range can be utilized. It is to be appreciated that the functionalities of the motion estimation component 102 and/or the video coding component 104 can be implemented in a variety of computers and/or electronic components.
In one example, the motion estimation component 102, video coding component 104, and/or the functionalities thereof, can be implemented in devices utilized in video editing and/or playback. Such devices can be utilized, in an example, in signal broadcasting technologies, storage technologies, conversational services (such as networking technologies, etc.), media streaming and/or messaging services, and the like, to provide efficient encoding/decoding of video to minimize bandwidth required for transmission. Thus, more emphasis can be placed on local processing power to accommodate lower bandwidth capabilities, in one example.
Referring to
As described above, the MRFGain calculation component 202 can calculate the MRFGain of shorter and longer temporal search ranges, which the motion estimation component 102 can then utilize in determining a balanced motion estimation considering the performance gain of the chosen estimation as well as its computational complexity. Moreover, as mentioned, the temporal search range can be chosen (and hence the MRFGain can be calculated) based at least in part on a linear model of motion-compensated residue (or prediction error) for a given block or frame.
For example, assuming F is the current frame or block for which video encoding is desired, previous frames can be denoted as {Ref (1), Ref (2), . . . Ref (k), . . . }, where k is the temporal distance between F and reference frame Ref (k). Thus, given a pixel s in F, p(k) can represent the prediction of s from Ref (k). Therefore, the motion-compensated residue, r(k), of s from Ref (k) can be r(k)=s−p(k). Moreover, r(k) can be a random variable with zero-mean and variance σr2(k). Additionally, r(k) can be decomposed as:
r(k)=rt(k)+rs(k),
where rt(k) can be the temporal innovation between F and Ref (k), and rs(k) can be the sub-integer pixel interpolation error in the reference frame Ref(k). Thus, representing σr
σr2(k)=σr
As the temporal distance k increases, so does the temporal innovation between the current frame (e.g., F) and the reference frame (e.g., Ref (k)). Therefore, it can be assumed that σr
σr
where Ct is the increasing rate of σr
σr2(k)=Cs+Ct*k.
Using this linear model, the MRFGain calculation component 202 can determine the MRFGain of utilizing ME, or one or more reference frames from the reference frame component 204 for MRFME, for a given frame or video block in the following manner. A block residue energy can be defined as
Subsequently,
In this case, if Δt(k)<Δs(k), Δ(k) would be negative, which can mean that searching one more reference frame Ref(k+1) from the reference frame component 204 results in smaller residue energy, and therefore, improved coding performance by the video coding component 104. Furthermore, for large Δs(k) and small Δt(k), large residue energy reduction, and thus large MRFGain, can be achieved by utilizing an additional reference frame in the motion estimation.
In this example, the values of Δs(k) and Δt(k) are related to the parameters of the linear model provided supra (e.g., Cs and Ct). Parameter Cs can represent the interpolation error variance σr
In an example, once the MRFGain has been determined by the MRFGain calculation component 202, the following temporal search range prediction can be used for blocks or frames in the video. It is to be appreciated that other range predictions can be utilized with the MRFGain; this is just one example to facilitate explanation of using the gain calculation. Assuming MRFME is performed in a time-reverse manner where Ref (1) is the first reference frame to be searched, the estimations of MRFGain, G, can vary for different Ref(k), (e.g., k>1 vs. k=1). For example, assuming the current reference frame is Ref(k)(k>1), and the temporal search on this frame is complete, to determine if the next reference frame Ref(k+1) should be searched, Cs and Ct can be estimated from the available information
If the current reference frame is Ref(1)(k=1), however,
where factor γ is tuned from training data. In some examples, a fixed value of γ can be used (such as γ=6) for different sequence.
To determine whether the MRFGain is sufficient for a given reference frame utilization factor in MRFME, the value of G can be compared with a predefined threshold TG. If G is larger than TG(G>TG), it can be assumed that searching more reference frames will improve the performance, so ME can continue with Ref(k+1). However, if G≦TG, MRFME of the current block can terminate, and the rest of the reference frames will not be searched. It is to be appreciated that the higher the TG, the more computation is saved; the lower the TG, the less performance drop is achieved. The MRFGain calculation component 202, or another component can appropriately tune the threshold to achieve a desired performance/complexity balance.
Turning now to
According to an example, the MRFGain calculation component 202 can determine MRFGain of one or more temporal search ranges of reference frames from reference frame component 204 based on the calculations shown supra. Additionally, the motion vector component 302 can also determine an optimal temporal search range for a video block in some cases. For example, for a reference frame Ref(k) related to a current frame F, the motion vector component 302 can attempt to locate a motion vector MV(k). If the best motion vector MV(k) found is an integer pixel motion vector, it can be assumed that the object in the video block has integer motion between Ref (k) and F. Since there is no sub-pixel interpolation error in
According to this example, motion can be estimated in the following manner. For k=1 (first reference frame Ref(1)), motion estimation can be performed with respect to Ref(k), and MV(k),
provided above. Additionally, the motion vector component 302 can find a best motion vector MV(k) in the reference frame for the video block. If G≦TG(TG being a threshold gain) or MV(k) is an integer pixel motion vector, motion estimation can terminate. If MV(k) is an integer pixel motion vector, it can be used to determine the temporal search range, otherwise, G≦TG and the temporal search range is simply the first reference frame. The video coding component 104 can utilize this information to encode the video block as described above.
However, if G>TG or MV(k) is not an integer pixel motion vector, the MRFGain calculation component 202 can move to the next frame setting k=k+1. Motion estimation can be performed with respect to Ref (k), and again MV(k) and
Again, the motion vector component 302 can find a best motion vector MV(k) in the reference frame. If G>TG or MV(k) is not an integer pixel motion vector, the MRFGain calculation component 202 can move to the next frame setting k=k+1 and repeat this step. If G≦TG or MV(k) is an integer pixel motion vector, MRFME of the current block can terminate. If MV(k) is an integer pixel motion vector, it can be used to determine the temporal search range, otherwise, G≦TG and the temporal search range is the number of frames evaluated. It is to be appreciated that a maximum number of frames can be configured for searching to achieve desired efficiency as well.
Referring now to
In one example, the MRFGain calculation component 202 can determine a temporal search range for a given video block for motion estimation as described supra (e.g., using the reference frame component 204 to obtain reference frames and performing calculations to determine the gain). According to an example, the inference component 402 can be utilized to determine a desired threshold (such as TG from the examples above). The threshold can be inferred based at least in part of one or more of a video/block type, video/block size, video source, encoding format, encoding application, prospective decoding device, storage format or location, previous thresholds for similar videos/blocks or those having similar characteristics, desired performance statistics, available processing power, available bandwidth, and the like. Moreover, the inference component 402 can be utilized to infer a maximum reference frame count for MRFME based in part on previous frame counts, etc.
Moreover, the inference component 402 can be leveraged by the video coding component 104 to infer an encoding format utilizing motion estimation from the motion estimation component 102. Additionally, the inference component 402 can be used to infer a block-size to send to the motion estimation component 102 for estimation, which can be based on similar factors to those used to determine a threshold, such as encoding format/application, suspected decoding device or capabilities thereof, storage format and location, available resources, etc. The inference component 402 can also be utilized in determining location or other metrics regarding a motion vector, and the like.
The aforementioned systems, architectures and the like have been described with respect to interaction between several components. It should be appreciated that such systems and components can include those components or sub-components specified therein, some of the specified components or sub-components, and/or additional components. Sub-components could also be implemented as components communicatively coupled to other components rather than included within parent components. Further yet, one or more components and/or sub-components may be combined into a single component to provide aggregate functionality. Communication between systems, components and/or sub-components can be accomplished in accordance with either a push and/or pull model. The components may also interact with one or more other components not specifically described herein for the sake of brevity, but known by those of skill in the art.
Furthermore, as will be appreciated, various portions of the disclosed systems and methods may include or consist of artificial intelligence, machine learning, or knowledge or rule based components, sub-components, processes, means, methodologies, or mechanisms (e.g., support vector machines, neural networks, expert systems, Bayesian belief networks, fuzzy logic, data fusion engines, classifiers . . . ). Such components, inter alia, can automate certain mechanisms or processes performed thereby to make portions of the systems and methods more adaptive as well as efficient and intelligent, for instance by inferring actions based on contextual information. By way of example and not limitation, such mechanism can be employed with respect to generation of materialized views and the like.
In view of the exemplary systems described supra, methodologies that may be implemented in accordance with the disclosed subject matter will be better appreciated with reference to the flow charts of
By comparing the residue energy for the current reference frame of the block and a prior reference frame, a performance decision can be made on whether or not to extend the temporal search range to include more prior reference frames for block prediction. At 606, it is determined if a gain measured from the residue energy levels for the current and previous frame(s) is more than (or equal to, in one example) that of a threshold gain (e.g., configured, inferred, or otherwise predetermined). If so, at 608 the temporal search range can be extended for MRFME by adding additional reference frames. It is to be appreciated that the method can return to 602 to start again, and compare the residue level of a frame prior to the prior frame and so on. If the gain measured from the residue energy levels is not higher than the threshold, then at 610 the current reference frame is used to predict the video block. Again, if the method had continued and added more than one additional prior reference frames, substantially all of the prior reference frames added could be used at 610 to predict the video block.
If, however, G does meet the threshold and the motion vector is not an integer pixel motion vector, then at 710, motion estimation can be performed on a next reference frame (e.g., a next prior reference frame). At 712, the gain of motion estimation with the next prior reference frame and the first reference frame can be determined as well as a best motion vector of the next prior reference frame. The gain can be determined using the formulas provided supra where the calculation is based at least in part on the gain received from using the first frame in motion estimation. At 714, if the gain, G, meets the threshold gain explained above and the motion vector is not an integer pixel motion vector, then an additional reference frame can be utilized in the MRFME continuing at 710. If, however, G does not meet the threshold or the motion vector is an integer pixel motion vector, then at 708, the video block prediction can complete using the reference frames. In this regard, complexity caused by MRFME will only be used where it will result in a desired performance gain.
As used herein, the terms “component,” “system” and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an instance, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computer and the computer can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
The word “exemplary” is used herein to mean serving as an example, instance or illustration. Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Furthermore, examples are provided solely for purposes of clarity and understanding and are not meant to limit the subject innovation or relevant portion thereof in any manner. It is to be appreciated that a myriad of additional or alternate examples could have been presented, but have been omitted for purposes of brevity.
Furthermore, all or portions of the subject innovation may be implemented as a method, apparatus or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed innovation. The term “article of manufacture” as used herein is intended to encompass a computer program accessible from any computer-readable device or media. For example, computer readable media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips . . . ), optical disks (e.g., compact disk (CD), digital versatile disk (DVD) . . . ), smart cards, and flash memory devices (e.g., card, stick, key drive . . . ). Additionally, it should be appreciated that a carrier wave can be employed to carry computer-readable electronic data such as those used in transmitting and receiving electronic mail or in accessing a network such as the Internet or a local area network (LAN). Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope or spirit of the claimed subject matter.
In order to provide a context for the various aspects of the disclosed subject matter,
With reference to
The system memory 816 includes volatile and nonvolatile memory. The basic input/output system (BIOS), containing the basic routines to transfer information between elements within the computer 812, such as during start-up, is stored in nonvolatile memory. By way of illustration, and not limitation, non-volatile memory can include read only memory (ROM). Volatile memory includes random access memory (RAM), which can act as external cache memory to facilitate processing.
Computer 812 also includes removable/non-removable, volatile/non-volatile computer storage media.
The computer 812 also includes one or more interface components 826 that are communicatively coupled to the bus 818 and facilitate interaction with the computer 812. By way of example, the interface component 826 can be a port (e.g., serial, parallel, PCMCIA, USB, FireWire . . . ) or an interface card (e.g., sound, video, network . . . ) or the like. The interface component 826 can receive input and provide output (wired or wirelessly). For instance, input can be received from devices including but not limited to, a pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, camera, other computer and the like. Output can also be supplied by the computer 812 to output device(s) via interface component 826. Output devices can include displays (e.g., CRT, LCD, plasma . . . ), speakers, printers and other computers, among other things.
The system 900 includes a communication framework 950 that can be employed to facilitate communications between the client(s) 910 and the server(s) 930. Here, the client(s) 910 can correspond to program application components and the server(s) 930 can provide the functionality of the interface and optionally the storage system, as previously described. The client(s) 910 are operatively connected to one or more client data store(s) 960 that can be employed to store information local to the client(s) 910. Similarly, the server(s) 930 are operatively connected to one or more server data store(s) 940 that can be employed to store information local to the servers 930.
By way of example, one or more clients 910 can request media content, which can be a video for example, from the one or more servers 930 via communication framework 950. The servers 930 can encode the video using the functionalities described herein, such as ME or MRFME calculating gain of utilizing one or more reference frames to predict blocks of the video, and store the encoded content (including error predictions) in server data store(s) 940. Subsequently, the server(s) 930 can transmit the data to the client(s) 910 utilizing the communication framework 950, for example. The client(s) 910 can decode the data according to one or more formats, such as H.264, utilizing the error prediction information to decode frames of the media. Alternatively or additionally, the client(s) 910 can store a portion of the received content within client data store(s) 960.
What has been described above includes examples of aspects of the claimed subject matter. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but one of ordinary skill in the art may recognize that many further combinations and permutations of the disclosed subject matter are possible. Accordingly, the disclosed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the terms “includes,” “has” or “having” or variations in form thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.