The present disclosure relates generally to video compression.
Embodiments of the present disclosure include systems and methods of compressing video using machine learning. In accordance with the technology described herein, a computer-implemented method for compressing a target video is disclosed. The computer-implemented method may be implemented in a computer system that may include one or more physical computer processors and non-transient electronic storage. The computer-implemented method may include obtaining, from the non-transient electronic storage, the target video. The computer-implemented method may include extracting, with the one or more physical computer processors, one or more frames from the target video. The one or more frames may include one or more of a key frame and a target frame. The computer-implemented method may also include generating, with the one or more physical computer processors, an estimated optical flow based on a displacement of pixels between the one or more frames.
In embodiments, the displacement of pixels may be between a key frame and/or the target frame.
In embodiments, the computer-implemented method may further include applying, with the one or more physical computer processors, the estimated optical flow to a trained optical flow model to generate a refined optical flow. The trained optical flow model may have been trained by using optical flow training data. The optical flow training data may include (i) optical flow data, (ii) a corresponding residual, (iii) a corresponding warped frame, and/or (iv) a corresponding target frame.
In embodiments, the computer-implemented method may further include generating, with the one or more physical computer processors, a warped target frame by applying the estimated optical flow to the key frame. The warped target frame may include a missing element not visible in the key frame. The computer-implemented method may also include identifying, with the one or more physical computer processors, the missing element in the warped target frame using supplemental information. The computer-implemented method may include synthesizing, with the one or more physical computer processors, the missing element from the warped target frame by applying the warped target frame to a trained interpolation model. The trained interpolation model may have been trained using interpolation training data. The interpolation training data may include (i) a user-defined value and/or (ii) multiple sets of frames. A given set of frames may include a previous training frame, a target training frame, and/or a subsequent training frame. The computer-implemented method may also include generating, with the one or more physical computer processors, a synthesized target frame.
In embodiments, the supplemental information may include one or more of a mask, the target frame, a given magnitude of a given estimated optical flow for a given object in the warped target frame, and/or a depth corresponding to the missing element.
In embodiments, identifying the missing element may include, based on the given magnitude of the given estimated optical flow of the given object, identifying, with the one or more physical computer processors, the given object as a foreground object when the magnitude reaches a threshold value. Identifying the missing element may also include identifying, with the one or more physical computer processors, the missing element in a background of the warped target frame using the displacement of the foreground object between the one or more frames.
In embodiments, identifying the missing element may include, based on a change of depth of an object between the one or more frames, identifying, with the one or more physical computer processors, the missing element using the estimated optical flow. Identifying the missing element may also include generating, with the one or more physical computer processors, an element to apply to the missing element. Identifying the missing element may include generating, with the one or more physical computer processors, a synthesized target frame.
In embodiments, the trained optical flow model and/or the trained interpolation model may include a convolutional neural network.
In embodiments, the computer-implemented method may further include encoding, with the one or more physical computer processors, the synthesized target frame. The computer-implemented method may include encoding, with the one or more physical computer processors, side information based on the encoded synthesized target frame. The side information may include one or more of the optical flow and/or a mask.
In accordance with additional aspects of the present disclosure, a system may include non-transient electronic storage and one or more physical computer processors. The one or more physical computer processors may be configured by machine-readable instructions to perform a number of operations. One operation may be to obtain, from the non-transient electronic storage, the target video. Another operation may be to extract, with the one or more physical computer processors, one or more frames from the target video. The one or more frames may include one or more of a key frame and/or a target frame. Yet another operation may be to generate, with the one or more physical computer processors, an estimated optical flow based on a displacement of pixels between the one or more frames.
In embodiments, another operation may be to apply, with the one or more physical computer processors, the estimated optical flow to a trained optical flow model to generate a refined optical flow. The trained optical flow model may have been trained by using optical flow training data. The optical flow training data may include (i) optical flow data, (ii) a corresponding residual, (iii) a corresponding warped frame, and/or (iv) a corresponding target frame.
In embodiments, another such operation is to generate, with the one or more physical computer processors, a warped target frame by applying the estimated optical flow to the key frame. The warped target frame may include a missing element not visible in the key frame. Yet another such operation is to identify, with the one or more physical computer processors, the missing element in the warped target frame using supplemental information. Another operation is to synthesize, with the one or more physical computer processors, the missing element from the warped target frame by applying the warped target frame to a trained interpolation model. The trained interpolation model may have been trained using interpolation training data. The interpolation training data may include (i) a user-defined value and (ii) multiple sets of frames. A given set of frames may include a previous training frame, a target training frame, and/or a subsequent training frame. Another operation is to generate, with the one or more physical computer processors, a synthesized target frame.
In embodiments, the supplemental information may include one or more of a mask, the target frame, a given magnitude of a given estimated optical flow for a given object in the warped target frame, and/or a depth corresponding to the missing element.
In embodiments, identifying the missing element may include based on the given magnitude of the given estimated optical flow of the given object, identifying, with the one or more physical computer processors, the given object as a foreground object when the magnitude reaches a threshold value. Identifying the missing element may also include identifying, with the one or more physical computer processors, the missing element in a background of the warped target frame using the displacement of the foreground object between the one or more frames.
In embodiments, identifying the missing element may include based on a change of depth of an object between the one or more frames, identifying, with the one or more physical computer processors, the missing element using the estimated optical flow. Identifying the missing element may also include generating, with the one or more physical computer processors, an element to apply to the missing element. Identifying the missing element may include generating, with the one or more physical computer processors, a synthesized target frame.
In embodiments, the trained optical flow model and/or the trained interpolation model may include a convolutional neural network.
In embodiments, the operation may include encoding, with the one or more physical computer processors, the synthesized target frame. In embodiments, the operation may also include encoding, with the one or more physical computer processors, side information based on the encoded synthesized target frame. The side information may include one or more of the optical flow and/or a mask.
In embodiments, the key frame may include one or more of a previous frame and/or a subsequent frame.
In embodiments, generating the estimated optical flow may include using, with the one or more physical computer processors, the previous frame and the target frame.
In accordance with additional aspects of the present disclosure, a non-transitory computer-readable medium may have executable instructions stored thereon that, when executed by one or more physical computer processors, cause the one or more physical computer processors to perform a number of operations. One operation may be to obtain, from the non-transient electronic storage, the target video. Another operation may be to extract, with the one or more physical computer processors, one or more frames from the target video. The one or more frames may include one or more of a key frame and a target frame. Yet another operation may be to generate, with the one or more physical computer processors, an estimated optical flow based on a displacement of pixels between the one or more frames
Aspects of the present disclosure will be appreciated upon review of the detailed description of the various disclosed embodiments, described below, when taken in conjunction with the accompanying figures.
The figures are described in greater detail in the description and examples below are provided for purposes of illustration only, and merely depict typical or example embodiments of the disclosure. The figures are not intended to be exhaustive or to limit the disclosure to the precise form disclosed. It should also be understood that the disclosure may be practiced with modification or alteration, and that the disclosure may be limited only by the claims and the equivalents thereof.
The present disclosure relates to systems and methods for machine learning based video compression. For example, neural autoencoders have been applied to single image compression applications, but video compression using machine learning (i.e., deep learning) has only focused on frame interpolation and its application to video compression.
Embodiments disclosed herein are directed towards frame synthesis methods that include interpolation and extrapolation with multiple warping approaches, compression schemes that use intermediate frame interpolation results and/or compression schemes that employ correlation between images and related information, such as optical flow.
Video codecs used for video compression generally decompose video into a set of key frames encoded as single images, and a set of frames for which interpolation is used. In contrast, the present disclosure applies deep learning (e.g., neural networks) to encode, compress, and decode video. For example, the disclosed method may include interpolating frames using deep learning and applying various frame warping methods to correct image occlusions and/or other artifacts from using the optical flow. The method may use the deep learning algorithm to predict the interpolation result. Embodiments disclosed here may further apply forward warping to the interpolation to correlate flow maps and images for improved compression. In some embodiments, a video compression scheme may predict a current frame by encoding already available video frames, e.g., the current frame and one or more reference frames. This is comparable to video frame interpolation and extrapolation, with the difference that the predicted image is available at encoding time. Example video compression schemes may include motion estimation, image synthesis, and data encoding, as will be described herein.
In some embodiments, using available reference frames {r1|I i ∈ 1 . . . n} (usually n=2), a new frame, or target frame, I, may be encoded. The reference frames may be selected to have some overlap with the content of I. Motion vector maps, or optical flow, may be estimated between the reference frames and the target frame. For example, a motion vector map may correspond to a 2d displacement of pixels from ri to I.
Frame synthesis may use the estimated optical flow to forward warp (e.g., from an earlier frame of the video to a later frame of the video) the reference frames ri and compute a prediction of the image to encode. The forward mapped image may be Wr
Two types of frames may be used at encoding time: (1) the key frames, which rely entirely on single image compression, and (2) interpolated frames, which are the result of image synthesis. Encoding interpolated frames is more efficient because it takes advantage of the intermediate synthesis result Î. Any frame that is used as a reference frame must also encode the displacement map, from r1 to I, . may be correlated to ri.
Optical Flow
Methods for estimating optical flow are disclosed herein. In some embodiments, for each reference frame ri, the 2d displacement for each pixel location may be predicted to match pixels from I.
A ground truth displacement map may be used to estimate optical flow. In this case, optical flow may be computed at encoding time, between the reference frame ri and the frame to encode I. This optical flow estimate may be encoded and transferred as part of the video data. In this example, the decoder only decodes the data to obtain the displacement map.
In some embodiments, the reference frames r1 and r2 are respectively situated before and after I. Assuming linear motion, optical flow can be estimated as:
F
r
→I=0.5*Fr
Where term Rr
Some example embodiments include predicting multiple displacement maps. When predicting multiple displacement maps, the correlation between displacement maps may be used for better flow prediction and to reduce the size of the residual information needed. This is illustrated in
Frame Synthesis
Some examples of frame prediction include estimating a prediction from a single image. In the case where a single reference frame r1 is available, the motion field Fr1→I may be used to forward warp the reference frame and obtain an initial estimate Wr1→I. The resulting image may contain holes in regions occluded or not visible in r1. Using machine learning (e.g., a convolutional neural network), the missing parts may be synthesized and used to compute an approximation of I1:
Î=F
s(Wr
Some example embodiments include a method for predicting residual motion from multiple images. Video compression may involve synthesis from a single frame using larger time intervals. These images may then be used for predicting in-between short-range frames. The proposed synthesis algorithm can take an optional supplementary input when available. Embodiments of the present disclosure include warping one or more reference frames using optical flow and providing the warping results as input for synthesis.
Î=F
s(Wr
Image Warping
In some embodiments, before using machine learning (e.g., a convolutional neural network) to synthesize the frame Î, the reference image may be warped using the estimated optical flow.
In some embodiments, a forward approach may be used. For example, a pixel p from the reference frame, r1, will contribute to 4 pixel locations around its end position in Î. In embodiments, for a pixel location q, the resulting color is
Sq is the set of pixels from r1 contributing to location q with weight ωp. Bilinear weights may be used as illustrated in
If an occlusion occurs between r1 and I, using all pixels as in the contributing sets Sq will create ghosting artifacts (see
In some examples, filling in occlusions may be estimated from the image. Contrary to frame interpolation, during video coding, ground truth colors of destination pixels are available and can be used to build the Sq. The first element is the pixel p* defined as:
From this, Sq is defined as the set of pixels p ∈ Aq satisfying:
In embodiments sets, Sq, need not be explicitly built. Instead, pixels p that are not used may be marked and ignored in the warping. A morphological operation may be used to smooth the resulting mask around the occlusion by consecutively applying opening and closing with a kernel size of about 5 pixels. It should be appreciated that other processes may be applied to smooth the mask. At decoding time the, same warping approach may be used, but the mask may be transmitted with optical flow.
In some examples, locations and colors of occlusions may be estimated from displacement. The previous solution requires the use of a supplementary mask which is also encoded. In the present approach, the magnitude of the optical flow may be used to resolve occlusions. For example, a large motion is more likely to correspond to foreground objects. In this case, the first element is the pixel p* defined as:
Sq is defined as the set of pixels p ∈ Aq satisfying:
∥Fr
Where ϵ may represent a user-defined threshold (e.g., based on the statistics of background motion). In embodiments, additional filtering may be used.
In some examples, occlusion may be estimated from depth. Depth ordering may be estimated with a machine learning process (e.g., a convolutional neural network). For example, a depth map network may estimate depth maps from an image or one or more monocular image sequences. Training data for the depth map network may include image sequences, depth maps, stereo image sequences, monocular sequences, and/or other content. After training an initial depth map network using the training data, a trained depth map network may receive content and estimate a depth map for the content and estimate occlusions based on the depth maps. Occluded pixels are identified with a depth test and simply ignored during warping. With sufficient computation power, more precise depth information can also be obtained using multi-view geometry techniques.
The warping techniques described herein are complementary and can be combined in different ways. For example, displacement and depth may be correlated. Many of the computations may be shared between the two modalities and obtaining depth represents a relatively minor increment in computation time. Occlusion may be estimated from the ground truth image. Deciding if the warping mask should be used may be based on the encoding cost comparison between the mask and the image residual after synthesis. In embodiments, these may be user selected based on the given application.
Synthesis Network
Still referring to
Training depends on the application case. For example, for interpolation from two reference frames r1 and r2, the network may be trained to minimize the objective function L over the dataset D consisting of triplets of input images (r1, r2) and the corresponding ground truth interpolation frame, I:
For the loss, C, we use the 1-norm of pixel differences which may lead to sharper results than 2.
(Î, I)=∥I−Î∥1 (10)
Compression
In some embodiments, image compression may be implemented through a compression network. In the following, C and D denote compression and decoding functions, respectively.
In some embodiments, key frames, which are not interpolated, may be compressed using a single image compression method (see
(I, I′)=R(I, I′)=γε({tilde over (y)}) (11)
with {tilde over (y)}=C(I) and I′=D({tilde over (y)}). The total loss takes into account the reconstruction loss R(I, I′) and the rate loss entropy ε({tilde over (y)}). In some embodiments, example video compression techniques may be described in greater detail in U.S. patent application Ser. No. 16/254,475, which is incorporated by reference in its entirety herein.
In some examples, for predicted frames, the compression process may include multiple steps, e.g., interpolation and image coding, to make the process more efficient.
In one example, the residual information may be explicitly encoded to the interpolation result or letting the network learn a better scheme. Training data for the network may be multiple videos. Training may include, for example, using a warped frame and generating multiple predictions of the warped frame. Residuals may be generated based on differences between the multiple predictions and the original frame. The residuals may be used to train the network to improve itself. In embodiments, the network may include a variational autoencoder including one or more convolutions, downscaling operation, upscaling operations, and/or other processes. It should be appreciated that other components may be used instead of, or in addition to, the network. In both cases, the network as illustrated in
In some embodiments, the image and the side information may be encoded at the same time. In this case, image colors and side information may be concatenated along channels and the compression network may predict the same number of channels.
In one embodiment, optical flow and image compression may be combined in one forward pass, as illustrated in
Some embodiments of the present disclosure may be implemented using a convolutional neural network as illustrated in
As used herein, the term component might describe a given unit of functionality that can be performed in accordance with one or more embodiments of the technology disclosed herein. As used herein, a component might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a component. In implementation, the various components described herein might be implemented as discrete components or the functions and features described can be shared in part or in total among one or more components. In other words, as would be apparent to one of ordinary skill in the art after reading this description, the various features and functionality described herein may be implemented in any given application and can be implemented in one or more separate or shared components in various combinations and permutations. As used herein, the term engine may describe a collection of components configured to perform one or more specific tasks. Even though various features or elements of functionality may be individually described or claimed as separate components or engines, one of ordinary skill in the art will understand that these features and functionality can be shared among one or more common software and hardware elements, and such description shall not require or imply that separate hardware or software components are used to implement such features or functionality.
Where engines, components, or components of the technology are implemented in whole or in part using software, in one embodiment, these software elements can be implemented to operate with a computing or processing component capable of carrying out the functionality described with respect thereto. One such example computing component is shown in
Referring now to
Computing component 1000 might include, for example, one or more processors, controllers, control components, or other processing devices, such as a processor 1004. Processor 1004 might be implemented using a general-purpose or special-purpose processing engine such as, for example, a microprocessor, controller, or other control logic. In the illustrated example, processor 1004 is connected to a bus 1002, although any communication medium can be used to facilitate interaction with other components of computing component 1000 or to communicate externally.
Computing component 1000 might also include one or more memory components, simply referred to herein as main memory 1008. For example, preferably random access memory (RAM) or other dynamic memory might be used for storing information and instructions to be executed by processor 1004. Main memory 1008 might also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1004. Computing component 1000 might likewise include a read only memory (“ROM”) or other static storage device coupled to bus 1002 for storing static information and instructions for processor 1004.
The computing component 1000 might also include one or more various forms of information storage device 1010, which might include, for example, a media drive 1012 and a storage unit interface 1020. The media drive 1012 might include a drive or other mechanism to support fixed or removable storage media 1014. For example, a hard disk drive, a floppy disk drive, a magnetic tape drive, an optical disk drive, a CD or DVD drive (R or RW), or other removable or fixed media drive might be provided. Accordingly, storage media 1014 might include, for example, a hard disk, a floppy disk, magnetic tape, cartridge, optical disk, a CD or DVD, or other fixed or removable medium that is read by, written to, or accessed by media drive 1012. As these examples illustrate, the storage media 1014 can include a computer usable storage medium having stored therein computer software or data.
In alternative embodiments, information storage mechanism 1010 might include other similar instrumentalities for allowing computer programs or other instructions or data to be loaded into computing component 1000. Such instrumentalities might include, for example, a fixed or removable storage unit 1022 and an interface 1020. Examples of such storage units 1022 and interfaces 1020 can include a program cartridge and cartridge interface, a removable memory (for example, a flash memory or other removable memory component) and memory slot, a PCMCIA slot and card, and other fixed or removable storage units 1022 and interfaces 1020 that allow software and data to be transferred from the storage unit 1022 to computing component 1000.
Computing component 1000 might also include a communications interface 1024. Communications interface 1024 might be used to allow software and data to be transferred between computing component 1000 and external devices. Examples of communications interface 1024 might include a modem or softmodem, a network interface (such as an Ethernet, network interface card, WiMedia, IEEE 802.XX, or other interface), a communications port (such as for example, a USB port, IR port, RS232 port, Bluetooth® interface, or other port), or other communications interface. Software and data transferred via communications interface 1024 might typically be carried on signals, which can be electronic, electromagnetic (which includes optical) or other signals capable of being exchanged by a given communications interface 1024. These signals might be provided to communications interface 1024 via a channel 1028. This channel 1028 might carry signals and might be implemented using a wired or wireless communication medium. Some examples of a channel might include a phone line, a cellular link, an RF link, an optical link, a network interface, a local or wide area network, and other wired or wireless communications channels.
In this document, the terms “computer program medium” and “computer usable medium” are used to generally refer to media such as, for example, memory 1008, storage unit 1020, media 1014, and channel 1028. These and other various forms of computer program media or computer usable media may be involved in carrying one or more sequences of one or more instructions to a processing device for execution. Such instructions embodied on the medium, are generally referred to as “computer program code” or a “computer program product” (which may be grouped in the form of computer programs or other groupings). When executed, such instructions might enable the computing component 1000 to perform features or functions of the disclosed technology as discussed herein.
While various embodiments of the disclosed technology have been described above, it should be understood that they have been presented by way of example only, and not of limitation. Likewise, the various diagrams may depict an example architectural or other configuration for the disclosed technology, which is done to aid in understanding the features and functionality that can be included in the disclosed technology. The disclosed technology is not restricted to the illustrated example architectures or configurations, but the desired features can be implemented using a variety of alternative architectures and configurations. Indeed, it will be apparent to one of skill in the art how alternative functional, logical or physical partitioning and configurations can be implemented to implement the desired features of the technology disclosed herein. Also, a multitude of different constituent component names other than those depicted herein can be applied to the various partitions. Additionally, with regard to flow diagrams, operational descriptions and method claims, the order in which the steps are presented herein shall not mandate that various embodiments be implemented to perform the recited functionality in the same order unless the context dictates otherwise.
Although the disclosed technology is described above in terms of various exemplary embodiments and implementations, it should be understood that the various features, aspects and functionality described in one or more of the individual embodiments are not limited in their applicability to the particular embodiment with which they are described, but instead can be applied, alone or in various combinations, to one or more of the other embodiments of the disclosed technology, whether or not such embodiments are described and whether or not such features are presented as being a part of a described embodiment. Thus, the breadth and scope of the technology disclosed herein should not be limited by any of the above-described exemplary embodiments.
Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. As examples of the foregoing: the term “including” should be read as meaning “including, without limitation” or the like; the term “example” is used to provide exemplary instances of the item in discussion, not an exhaustive or limiting list thereof; the terms “a” or “an” should be read as meaning “at least one,” “one or more” or the like; and adjectives such as “conventional,” “traditional,” “normal,” “standard,” “known” and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. Likewise, where this document refers to technologies that would be apparent or known to one of ordinary skill in the art, such technologies encompass those apparent or known to the skilled artisan now or at any time in the future.
The presence of broadening words and phrases such as “one or more,” “at least,” “but not limited to” or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent. The use of the term “component” does not imply that the components or functionality described or claimed as part of the component are all configured in a common package. Indeed, any or all of the various components of a component, whether control logic or other components, can be combined in a single package or separately maintained and can further be distributed in multiple groupings or packages or across multiple locations.
Additionally, the various embodiments set forth herein are described in terms of exemplary block diagrams, flow charts, and other illustrations. As will become apparent to one of ordinary skill in the art after reading this document, the illustrated embodiments and their various alternatives can be implemented without confinement to the illustrated examples. For example, block diagrams and their accompanying description should not be construed as mandating a particular architecture or configuration.
The present application claims priority to U.S. Patent Application No. 62/717,470 filed on Aug. 10, 2018, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62717470 | Aug 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16261441 | Jan 2019 | US |
Child | 18049262 | US |