The present disclosure relates generally to the field of computer vision technology. More specifically, the present disclosure relates to computer vision systems and methods for compositional pixel-level prediction.
A single image of a scene allows for a remarkable number of judgments to be made about the underlying world. For example, by looking at an image, a person can easily infer what the image depicts, such as, a stack of blocks falling over, a human holding a pullup bar, etc. While these inferences showcase humans' ability to understand what is, even more remarkable is their capability to predict what will occur. For example, looking at an image of stacked blocks falling over, a person can predict how the blocks will topple. Similarly, looking at a human holding the pullup bar, a person can predict that the human will lift his torso while keeping his hands in place.
Computer vision systems are capable of modeling multiple objects in physical systems. These systems use the relationship between objects, and can predict the trajectories over a long time horizon. However, these approaches typically model deterministic processes under simple visual (or often only state based) input, while often relying on observed sequences instead of a single frame. Although some systems take raw image as input, they only make state predictions, and not pixel space prediction. Further, existing approaches apply variants of graph neural networks (“GNNs”) for future prediction, which are restricted to predefined state-spaces as opposed to pixels, and do not account for uncertainties using latent variables.
Therefore, there is a need for computer vision systems and methods capable of predicting future motions, movements, and events from a single image of a scene and at a pixel-level. These and other needs are addressed by the computer vision systems and methods of the present disclosure.
The present disclosure relates to computer vision systems and methods for compositional pixel-level prediction. The system processes input images fed into a pixel-level prediction engine to generate one or more sets of output images. For example, the input image can show three blocks falling over, and the pixel-level prediction engine would predict how the blocks fell in output images. To generate a prediction, the system uses an entity predictor module to model how at least one entity presents change in an input image. The system then uses a frame decoder module to infer pixels by retaining the properties of each entity in the input image and resolves conflicts (e.g. occlusions when composing the image). Next, the system accounts for a fundamental multi-modality in each task. Finally, the system uses a latent encoder module to predict a distribution over the latent variable u using a target video.
The foregoing features of the invention will be apparent from the following Detailed Description of the Invention, taken in connection with the accompanying drawings, in which:
The present disclosure relates to computer vision systems and methods for compositional pixel-level prediction, as described in detail below in connection with
Given an input image (which may be referred to as an input frame), along with known or detected locations of entities present in the input image, the system 10 predicts a sequence of future frames. Specifically, given a starting frame f0 (e.g., an input image) and the location of N entities {b0n}Nn=1, the system 10 generates T future frames f1, f2, . . . , fT (e.g., output images). By way of example, the pixel-level prediction engine 12 predicted output images 16a and 16b at 0.5 seconds from the input images 14a and 14b, and predicted output images 18a and 18b at 1.0 seconds from the input images 14a and 14b. It is noted that the system 10 is capable of processing a scene comprising of multiple entities, thus accounting for different dynamics and interactions, and processing an inherently multi-modal nature of a prediction task.
In step 24, the system 10 uses a frame decoder module to infer pixels by retaining the properties of each entity in the input image, respecting the predicted location, and resolving conflicts (e.g. occlusions when composing the image). In step 26, the system 10 accounts for a fundamental multi-modality in the task. Specifically, the system 10 uses a global random latent variable u that implicitly captures ambiguities across an entire video. The latent variable u, deterministically (via a learned network) yields per time-step latent variables zt which aid per time-step future predictions. Specifically, a predictor P takes as input the per entity representation {xtn} along with the latent variable zt, and predicts the entity representations at the next time-step {xnt+1}≡P({xtn}, zt). The decoder D, using these predictions (and the initial frame f0 to allow modeling background), composes a predicted frame.
In step 28, the system 10 uses a latent encoder module to predict a distribution over the latent variable u using a target video. Specifically, the system 10 is trained to maximize the likelihood of the training sequences, comprising of terms for both the frames and the entity locations. As is often the case with optimizing likelihood in models with unobserved latent variable models, directly maximizing likelihood is intractable, and therefore, the system 10 maximizes a variational lower bound. The annotation of future frames/locations, as well as the latent encoder module, are used during training. During inference, the system 10 takes in as input only a single frame along with locations of the entities present, and generates multiple plausible future frames.
The entity predictor module will now be discussed in detail. Given per-entity locations and implicit appearance features {xtn}Nn=1≡{(btn, dn)}Nn=1 the entity predictor module outputs the predictions for a next time step using the latent variable zt. An iterative application of the entity predictor module therefore allows the system 10 to predict the future frames for the entire sequence using the encodings from the initial frame. To obtain the initial input to the entity predictor module (e.g., the entity encodings at the first time step {x0n}Nn=1), the system 10 uses the known/detected entity locations {b0n}, and extracts the appearance features {a0n} a convolutional neural network (“CNN”) on the cropped region from the frame f0. For example, the system 10 can use a standard ResNet-18 CNN.
While the predictor P infers per-entity features, the entity predictor module allows for the interaction among these entities rather than predicting each of them independently (e.g., a block may or may not fall depending on the other ones around it). To enable this, the system 10 leverages a computer vision model in the graph neural network family, in particular based on Interaction Networks which take in a graph G=(V, E) with associated features for each node, and update these via iterative message passing and message aggregation. The predictor P that infers {xt+1n} from ({xtn}, zt) comprises of four interaction blocks, where the first block takes as input the entity encodings concatenated with the latent feature: {xtn⊕zt}Nn=1. Each of these blocks performs a message passing iteration using the underlying graph, and the final block outputs predictions for the entity features for the next time step {xtn}Nn=1≡{(btn, atn}Nn=1. This graph can either be fully connected as with synthetic data experiments, or more structured (e.g., skeleton in human video prediction experiments).
It should be noted that the entity predictor module can comprise Interaction Networks, a Graph Convolution Network (“GCN”), or any other network having subtle or substantial differences, both in architecture and application. For example, the entity predictor module can stack multiple interaction blocks for each time step rather than use a single interaction block to update node features. Additionally, the entity predictor module can use non-linear functions as messages for better performance, rather than use a predefined mechanism to compute edge weights and use linear operations for messages.
The frame decoder module will now be discussed in detail. The frame decoder module generates pixels of the frame ft from a set of predicted entity representations. While the entity representations capture the moving aspects of the scene, the system 10 also can incorporate static background, and additionally use the initial frame f0 to do so. The decoder D, predicts ft≡D({xtn}, f0).
To account for a predicted location of the entities when generating images, the frame decoder module decodes a normalized spatial representation for each entity, and warps it to the image coordinates using predicted 2D locations. To allow for the occlusions among entities, the frame decoder module predicts an additional soft mask channel for each entity, where the value of masks capture the visibility of the entities. Lastly, the frame decoder module overlays the (masked) spatial features predicted via the entities onto a canvas containing features from the initial frame f0, and then predicts the future frame pixels using this composed feature.
Specifically, the frame decoder module denotes by ϕbg the spatial features pre-dieted from the frame f0 (using, for example, a CNN with architecture similar to U-Net [a CNN with biomedical image segmentation features]). The function {(
ϕn=(
Warped mask and features (ϕn, Mn) for each entity are zero outside predicted bounding box bn, and mask Mn can further have variable values within this region. Using these independent background and entity features, the frame decoder module composes frame level spatial features by combining these via a weighted average. Denoting by Mbg a constant spatial mask (with value 0.1), the frame decoder module obtains the composed features as Equation 2, seen below:
The composed features incorporate information from all entities at the appropriate spatial locations, allow for occlusions using the predicted masks, and incorporate the information from background. The frame decoder module then decodes the pixels for the future frame from these composed features. The system 10 can select over the spatial level where the feature composition occurs (e.g., it can happen in feature space at near the image resolution (late fusion)), or directly at pixel-level (where the variables all represent pixels), or alternatively at a lower resolution (mid/early fusion).
The latent encoder module will now be discussed in detail. As discussed above, the system 10 is conditioned on the latent variable u, which in turn generates per time-step conditioning variables zt that are used in each prediction step.
During training, the system 10 maximizes the variational lower bound of the log-likelihood objective (rather than marginalizing the likelihood of the sequences over all possible values of the latent variable u). This is done via training the latent encoder module, which (during training) predicts a distribution over u conditioned on a ground-truth video. For example, the system conditions on the first and last frame of the video (using, for example, a feed-forward neural network), where the distribution predicted is denoted by q(u|f0, f{circumflex over ( )}T). Given a particular u sampled from this distribution, the system 10 recovers the {zt} via a one-layer long short-term memory (“LSTM”) network, which, using u as the cell state, predicts the per time-step variables for the sequence.
It is noted that the training objective can be thought of as maximizing the log-likelihood of the ground-truth frame sequence {f{circumflex over ( )}t}Tt=1. The system can further use training-time supervision for the locations of the entities {{b{circumflex over ( )}tn}Nn=1}Tt=1. While this objective has an interpretation of log-likelihood maximization, for simplicity, the system 10 can consider it as a loss L composed of different terms, where the first Lpred encourages the future frame and location prediction to match the ground-truth. Lpred is can be expressed by Equation 3, seen below:
The second component corresponds to enforcing an information bottleneck on the latent variable distribution, expressed by Equation 4, below:
L
enc
=KL|q(u)∥(0,I)] Equation 4
Lastly, to further ensure that the frame decoder module generates realistic composite frames, the system 10 includes an auto-encoding loss that enforces the system 10 to generate the correct frame when given entities representations {x{circumflex over ( )}tn} extracted from f{circumflex over ( )}t (and not the ones frames) as input. This is expressed by Equation 5, below:
The system 10 determines the total loss as L=Ldec+Lpred+λ2Lenc, with hyper-parameter λ2 determining the trade-offs among accurate predictions and information bottleneck in random variable.
Testing of the above systems and methods will now be discussed in greater detail. A goal of the testing is to show qualitative and quantitative results highlighting the benefits the system 10, and its modules (the entity predictor module, the frame decoder module, and the latent encoder module). First, the testing validates the system 10 and methods of the present disclosure using a synthetic dataset comprised of stacked objects that fall over time (e.g., a Shapestacks dataset), and presents several ablations comparing each module with relevant baselines. The testing also presents qualitative and quantitative results on a Penn Action dataset, which comprises of humans performing various activities. The two datasets highlight the ability of the system 10 to work in different scenarios, one where the ‘entities’ correspond to distinct objects, and another where the ‘entities’ are the joints of the human body. In both settings, the testing evaluates the predicted entity locations using average mean square error and the quality of generated frames using the Learned Perceptual Image Patch Similarity (“LPIPS”) metric.
Shapestacks is a synthetic dataset containing stacks of objects falling under gravity with diverse blocks and configurations. The testing used a subset of this dataset containing three blocks. The blocks can be cubes, cylinders or balls in different colors. The data is generated by simulating the given initial configurations in advance physics simulator MuJoCo for 16 steps. The testing used 1320 videos for training, 281 clips for validation and 296 clips for testing. While the setting is deterministic under perfect state information (precise 3D position and pose, mass, friction, etc.), the prediction task is ambiguous given an image input.
The testing used the Shapestacks dataset to validate the different modules (e.g., the entity predictor module, the frame decoder module, and the latent encoder module) for the latent variables. A subtle detail in the evaluation is that at inference, the prediction is dependent on a random variable u, and while only a single ground-truth is observed, multiple predictions are possible. To account for this, the system 10 used the mean u predicted by the latent variable module. When ablating the choices of the latent representation itself, the system used (K=100) prediction samples and report the lowest scoring error.
The testing showed that the entity predictor module factorized prediction over per-entity locations and appearance, and that allowed reasoning via GNNs helps improve prediction. Specifically, the system 10 compared against two alternate models: a) a No-Factor model and b) a No-Edge model. The No-Edge model does not allow for interactions among entities when predicting the future. The No-Factor model does not predict a per-entity appearance but simply outputs a global feature that is decoded to foreground appearance and mask. The No-Factor model takes as input (and outputs) the per-entity bounding boxes, (these are not used via the frame decoder module).
The No-Factor model shows the benefits of composing different features for each entity while accounting for their predicted spatial location, the testing ablated whether this composition should directly be at a pixel-level or at some implicit feature level (early fusion, mid fusion, or late fusion). Across all the ablations, the number of layers in the frame decoder module remained the same, and only the level at which features from entities are composed differed.
During testing, the latent variables used in the pixel-level prediction engine differed from using a per time-step random variable zt. The approach of the system 10 was compared to other approaches. Specifically, a No-Z baseline approach directly uses u across every time steps, instead of predicting a per time-step zt. In Fixed Prior (“FP”) and Learned Prior (“LP”) baselines, the random variables are sampled per time-step, either independently (as in FP), or depending on previous prediction (as in LP). During training, both FP and LP models (baselines) are trained using an encoder that predicts zt using the frames ft and ft=1 (instead of using f0 and fT to predict u as in the system of the present disclosure).
To evaluate these different choices, given an initial frame from a test sequence, K=100 video predictions are sampled from each model, and the lowest error among these is measured (this allows evaluating all methods while using same information, and not penalizing diversity of predictions). The quantitative evaluations of these methods is shown in the graphs of
Penn Action is a real video dataset of people playing various indoor and outdoor sports with an annotation of human joint locations. In an experiment, the system of the present disclosure was trained using Penn Action to generate video sequences of 1 second at 8 frames per second (“FPS”) given an initial frame. The Penn Action dataset comprises of a) diverse backgrounds, b) noise in annotations, and c) multiple activity classes with different dynamics.
The parameters of the present system used for this dataset is the same as that in the Shapestacks experiment, with the modification that the graph used for the interactions in the entity predictor module is based on the human skeleton, and not fully-connected. If some joint is missing in the video, the system instead links the edge to a parent. It is noted that while the graph depends on the skeleton, interaction blocks are the same across each edge. The following subset of categories was used: bench press, clean and jerk, jumping jacks, pull up, push up, sit up, and squat; all related to gym activities because most videos in these classes do not have camera motion and their background are similar within these categories. The categories are diverse in scale of people, human poses, and view angles. The same scenes do not appear in both sets, resulting in 290 clips for training and 291 for testing. To reduce overfitting, the present system augments data on the fly, including randomly selecting a starting frame for each clip, random spatial cropping, etc.
The architecture of the entity predictor module, the frame decoder module, and the latent encoder module will now be discussed. The entity predictor module leverages the graph neural network family, whose learning process can be abstracted to iterative message passing and message aggregation. In each round of message passing, each node (edge) is a pa-rameterized function of their neighboring node and edges, which updates the parameters by back propagation. The architecture of the entity predictor module can be expressed by instantiating the message passing and aggregation operation as seen in Equations 6 and 7 below, where for the l-th layer of message passing, it consists of two operations:
The system 10 performs node-to-edge passing f(l)v→e where edge embeddings are implicitly learned. Then, the system 10 performs edge-to-node f(l)v→e operation given the updated edge embeddings. The message passing block can be stacked to arbitrary layers to perform multiple rounds of message passing between edge and node. For example, stack four blocks can be stacked. For each block, f(l)v→e, f(l)e→v are both implemented as a single fully connected layer. The aggregation operator is implemented as an average pooling. Connection expressed in the edge set can be either from an explicitly specified graph, or a fully connected graph when the relationship is not explicitly observed.
The frame decoder module uses a backbone of Cascaded Refinement Networks. Given feature in the shape of (N, D, h0, w0) either from entity predictor module or a background feature, the frame decoder module upsamples the shape at the end of every unit. Each unit comprises of Conv→Batch→LeakyRelu. When the entity features are warped to image coordinates, the spatial transformation is implemented as a forward transformation to sharpen entities.
At training, the latent encoder module takes in the concatenated features of two frames and applies a one layer neural network to obtain a mean and a variance of u, where the system 10 resamples with reparameterization at training time. The resampled u′ is fed into a one-layer LSTM network as a cell unit to generates a sequence of z′. The system 10 optimizes the total loss with an Adam optimizer in learning rate 1e−4, λ1=100, λ2=1e−3. The dimensionality of latent is 8, i.e., |u|=|zt|=8. The location feature is represented as the center of entities |b|=2, and the appearance feature is represented as |a|=32. The region of each entity is set to a large enough fixed width and height to cover the entity, such as, for example, d=70. The generated frames are in resolution of 224×224. It should be understood that these parameters are by way of example, and that those skilled in the art would be able to use different parameters with the system of the present disclosure.
The functionality provided by the present disclosure could be provided by computer vision software code 106, which could be embodied as computer-readable program code stored on the storage device 104 and executed by the CPU 112 using any suitable, high or low level computing language, such as Python, Java, C, C++, C #, .NET, MATLAB, etc. The network interface 108 could include an Ethernet network interface device, a wireless network interface device, or any other suitable device which permits the server 102 to communicate via the network. The CPU 112 could include any suitable single-core or multiple-core microprocessor of any suitable architecture that is capable of implementing and running the computer vision software code 106 (e.g., Intel processor). The random access memory 114 could include any suitable, high-speed, random access memory typical of most modern computers, such as dynamic RAM (DRAM), etc.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art can make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure.
This application claims priority to U.S. Provisional Patent Application Ser. No. 62/962,412 filed on Jan. 17, 2020 and U.S. Provisional patent Application Ser. No. 62/993,800 filed on Mar. 24, 2020, each of which is hereby expressly incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62962412 | Jan 2020 | US | |
62993800 | Mar 2020 | US |