The present invention relates to scene prediction and, more particularly, to the prediction of future events within a dynamic scene.
The analysis of dynamic scenes (e.g., video feeds from security cameras or other changing scenes) may identify agents (such as, e.g., people or vehicles) and track their motion through the scene. The scene may include other elements (e.g., roads and crosswalks). Thus, in the example of monitoring a traffic scene, agents may be tracked across the scene elements.
A method for predicting a trajectory includes determining prediction samples for agents in a scene based on a past trajectory. The prediction samples are ranked according to a likelihood score that incorporates interactions between agents and semantic scene context. The prediction samples are iteratively refined using a regression function that accumulates scene context and agent interactions across iterations. A response activity is triggered when the prediction samples satisfy a predetermined condition.
A system for predicting a trajectory includes a prediction sample module configured to determine prediction samples for agents in a scene based on a past trajectory. A ranking/refinement module includes a processor configured to rank the prediction samples according to a likelihood score that incorporates interactions between agents and semantic scene context and to iteratively refine the prediction samples using a regression function that accumulates scene context and agent interactions across iterations. A response module is configured to trigger a response activity when the prediction samples satisfy a predetermined condition.
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
Embodiments of the present principles predict behavior of agents in a dynamic scene. In particular, the present embodiments predict the locations of agents and the evolution of scene elements at future times using observations of the past states of the scene, for example in the form of agent trajectories and scene context derived from image-based features or other sensory date (if available).
The present embodiments thus adapt decision-making to make determinations about object and scene element types using image features. This stands in contrast to sensors like radar, which cannot generally resolve such features. The present embodiments furthermore handle the complex interactions between agents and scene elements in a scalable manner and account for the changes in behavior undertaken by agents in response to, or in anticipation of, the behavior of other agents.
To this end, the present embodiments formulate the prediction as an optimization that maximizes the potential future reward of the prediction. Because it is challenging to directly optimize the prediction, the present embodiments generate a diverse set of hypothetical predictions and ranks and refines those hypothetical predictions in an iterative fashion.
Referring now to
Thus, the present embodiments provide the ability to predict behavior in monitored scenes, aiding in early warning of dangerous or criminal activities, predicting dangerous conditions, providing steering information for autonomous vehicles, etc. It should be understood that, although the present embodiments are described in the context of the specific application of traffic scene analysis, these embodiments may be applied to any future-prediction task. The present embodiments provide scalability (because deep learning enables end-to-end training and easy incorporation with multiple cues from past motion, scene context, and agent interactions), diversity (the stochastic output is combined with an encoding of past observations to generate multiple prediction hypotheses that resolve the ambiguities and multimodalities of future prediction), and accuracy (long-term future rewards are accumulated for sampled trajectories and a deformation of the trajectory is learned to provide accurate predictions farther into the future).
Referring now to
RNNs are generalizations of feedforward neural networks to sequences. The power of RNNs for sequence-to-sequence modeling makes them useful to the generation of sequential future prediction outputs. The RNNs discussed herein use gated recurrent units (GRUs) over long short-term memory (LSTM) units, since the former is simple but yield no degradation in performance. Contrary to existing RNN models, the present embodiments can predict trajectories of variable length.
Block 204 then determines the random prediction samples that are most likely to reflect future trajectories while incorporating scene context and interactions. Block 204 ranks the samples and refines them to incorporate contextual and interaction cues. Block 204 uses an RNN that is augmented with a fusion layer that incorporates interaction between agents and a convolutional neural network (CNN) that provides scene information. Block 204 uses training in a multi-task learning framework where the ranking objective is formulated using inverse optimal control (IOC) and the refinement objective is obtained by regression. In a testing phase, the ranking/refinement of block 204 is iterated to obtain more accurate refinements of the prediction of future trajectories.
Referring now to
Future prediction is inherently ambiguous, with uncertainties as to which of several plausible scenarios will result from a given trajectory. Following the example of
The present embodiments therefore create a deep generative model (CVAE). CVAE can learn the distribution P(Yi|Xi) of the output Y conditioned on the input trajectories Xi by introducing a stochastic latent variable zi. CVAE uses multiple neural networks, including a recognition network Qϕ(zi|Yi,Xi), a prior network Pv(zi|Xi), and a generation network Pθ(Yi|Xi,zi), where θ, ϕ, and v denote parameters of each network.
The prior of the latent variables zi is modulated by the input Xi, but this can be relaxed to make the latent variables statistically independent of the input variables such that, e.g., Pv(zi|Xi)=Pv(zi). CVAE essentially introduces stochastic latent variables zi that are learned to encode a diverse set of predictions Yi given the input Xi, making it suitable for modeling the one-to-many mapping of predicting future trajectories.
During training, block 302 learns Qϕ(zi|Yi,Xi) such that the recognition network gives higher probability to values of zi that are likely to produce a reconstruction Ŷi that is close to actual predictions given the full context of training data for a given Xi and Yi. At test time, block 304 samples zi randomly from the prior distribution and decodes the latent variables through a decoder network to form a prediction hypothesis. This provides a probabilistic inference which serves to handle multi-modalities in the prediction space.
Block 302 encodes the training data, including a past trajectory Xi and a future Yi for an agent i using respective RNNs with separate sets of parameters. Encoding the training data converts the image to a vector representation. The resulting encodings, X
Upon successful training, the target distribution is learned in the latent variable zi, which provides random samples from a Gaussian distribution for the reconstruction of Yi in block 304. Since back-propagation is not possible through random sampling, reparameterization is used to make the model differentiable.
To model Pθ(Yi|Xi,zi), zi is combined with Xi as follows. The sampled latent variable zi is passed to one fully connected layer to match the dimension of X
There are two loss terms in training the CVAR-based RNN encoder/decoder in block 302. A reconstruction loss is defined as
The reconstruction loss measures how far the generated samples are from the actual ground truth. A Kullback-Leibler divergence loss is defined as lKLD=(Qϕ(zi|Yi,Xi)∥P(zi)). The Kullback-Leibler divergence loss measures how close the sampling distribution at test-time is to the distribution of the latent variable learned during training (e.g., approximate inference).
At test time, block 304 does not have access to encodings of future trajectories, so the encodings of past trajectories X
Referring now to
Block 412 performs a softmax operation on the latent variable which is then combined with the encoded X input with the operation at block 414. An RNN decoder 416 then decodes the output of block 414 to produce a predicted future trajectory Ŷ.
Predicting a distant future trajectory can be significantly more challenging than predicting a trajectory into the near future. Reinforcement learning, where an agent is trained to choose its actions to maximize long-term rewards to achieve a goal, is used to help determine likelier trajectories. The present embodiments learn an unknown reward function with an RNN model that assigns rewards to each prediction hypothesis Ŷi(k) and attaches a score si(k) based on the accumulated long-term rewards. Block 204 further refines the prediction hypotheses by learning displacements ΔŶi(k) to the actual prediction through a fully connected layer.
Block 204 receives iterative feedback from regressed predictions and makes adjustments to produce increasingly accurate predictions. During the iterative refinement, past motion history through the embedding vector X, semantic scene context through a CNN with parameters ρ, and interaction among multiple agents using interaction features are combined.
The score s of an individual prediction hypothesis Ŷi(k) for an agent i on a kth sample is defined as:
where Ŷj\i∀ is the prediction samples of agents other than agent I, ŷi,t(k) is the kth prediction sample of an agent i at time t, Ŷτ<t∀ is the prediction samples before a time-step t, T is the maximum prediction length, and ψ(⋅) is a reward function that assigns a reward value at each time-step. The reward function ψ(⋅) may be implemented as a fully connected layer that is connected to the hidden vector at t of the RNN module.
The parameters of the fully connected layer are shared over all the time steps, such that the score s includes accumulated rewards over time, accounting for the entire future rewards being assigned to each hypothesis. The reward function ψ(⋅) includes both scene context as well as the interaction between agents through the past trajectories.
Block 204 estimates a regression vector ΔŶi(k) that refines each prediction sample Ŷi(k). The regression vector for each agent I is obtained with a regression function η defined as follows:
ΔŶi(k)=η(Ŷi(k);,X,Ŷj\i∀
Represented as parameters of a neural network, the regression function η accumulates both scene contexts and all other agent dynamics from the past to entire future frames and estimates the best displacement vector ΔYi(k) over the entire time horizon T. Similar to the score s, it accounts for what happens in the future both in terms of scene context and interactions among dynamic agents to produce an output. The regression function η is implemented as a fully connected layer that is connected to the last hidden vector of the RNN, which outputs an M×T dimensional vector, with M=2 being the dimension of the location state.
There are two loss terms in ranking and refinement block 204: a cross-entropy loss and a regression loss. The cross-entropy loss is expressed as lCE=H(p,q), where a target distribution q is obtained by softmax(−d(Yi,Ŷi(k))) and where d(Yi,Ŷi(k))=max∥Ŷi(k)−Yi∥. The regression loss is expressed as
The total loss of the entire network is defined as a multi-task loss:
where N is the total number of agents in a batch.
Referring now to
The feature pooling block 502 provides its output to the RNN decoder 506, which processes the feature pooling output and the prediction samples, providing its output to scoring block 508, which scores the samples and tracks accumulated rewards, and to regression block 510, which provides the regression vector ΔŶi(k) as feedback that is combined with input 501 for the next iteration.
The RNN decoder 506 in the ranking and refinement block 204 makes use of information about the past motion context of individual agents, the semantic scene context, and the interaction between multiple agents to provide hidden representations that can score and refine the prediction Ŷi(k). The RNN decoder 506 therefore takes as input:
x
t=[γ({circumflex over ({dot over (y)})}i,t,p(ŷi,t;ρ()),r(ŷi,t;ŷj\t,hŶ
where {circumflex over ({dot over (y)})}i,t is a velocity of Ŷi(k) at a time t, γ is a fully connected layer with an activation that maps the velocity to a high-dimensional representation space, p(ŷi,t;ρ()) is a pooling operation that pools the CNN feature ρ() at the location ŷi,t, and r(ŷi,t;ŷj\t,hŶ
The feature pooling block 502 implements a spatial grid. For each sample k of an agent i at time t, spatial grid cells are defined centered at ŷi,t(k). Over each grid cell g, the hidden representation of all the other agents' samples that are within the spatial cell are pooled, ∀j≠i, ∀k, ŷj,t(k)ϵg. Average pooling and a log-polar grid may be used to define the spatial grid.
Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
Each computer program may be tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Referring now to
A first storage device 622 and a second storage device 624 are operatively coupled to system bus 602 by the I/O adapter 620. The storage devices 622 and 624 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid state magnetic device, and so forth. The storage devices 622 and 624 can be the same type of storage device or different types of storage devices.
A speaker 632 is operatively coupled to system bus 602 by the sound adapter 630. A transceiver 642 is operatively coupled to system bus 602 by network adapter 640. A display device 662 is operatively coupled to system bus 602 by display adapter 660.
A first user input device 652, a second user input device 654, and a third user input device 656 are operatively coupled to system bus 602 by user interface adapter 650. The user input devices 652, 654, and 656 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present principles. The user input devices 652, 654, and 656 can be the same type of user input device or different types of user input devices. The user input devices 652, 654, and 656 are used to input and output information to and from system 600.
Of course, the processing system 600 may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements. For example, various other input devices and/or output devices can be included in processing system 600, depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art. For example, various types of wireless and/or wired input and/or output devices can be used. Moreover, additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art. These and other variations of the processing system 600 are readily contemplated by one of ordinary skill in the art given the teachings of the present principles provided herein.
Referring now to
A training module 708 provides training for neural networks in prediction sample module 710 according to a set of input data that may be stored in the memory 704. The training module 708 uses previous examples of agent trajectories, using the known future trajectories from a particular point in time to train predictions based on a past trajectory. After training, the prediction sample module 710 generates sets of such predictions for the ranking/refinement module 712 to work with, ultimately producing one or more predictions that represent the most likely future trajectories for agents, taking into account the presence and likely actions of other agents in the scene.
A user interface 714 is implemented to provide a display that shows the future trajectories as an overlay of a most recent piece of sensor data, for example overlaying the most likely trajectory of agents in the field of view of a camera. A response module 716 provides manual or automated actions responsive to the determined trajectories, where a human operator can trigger a response through the user interface 714 or a response can be triggered automatically in response to the trajectories matching certain conditions. For example, a large number of agents being detected at a crosswalk, with likely trajectories of crossing the street, may trigger a change in a lighting system's pattern to provide a “walk” signal to those agents. In another embodiment, the response module 716 may recognize that an agent is likely to enter an area that is dangerous or off-limits, and the response module 716 may then raise an alarm or trigger a barrier to the user's progress (e.g., by locking a door).
The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that those skilled in the art may implement various modifications without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.
This application claims priority to U.S. Provisional Application Ser. Nos. 62/414,288, 62/418,442, and 62/422,086, filed on Oct. 28, 2016, Nov. 7, 2016, and Nov. 15, 2016 respectively, each of which is incorporated herein by reference herein its entirety.
Number | Date | Country | |
---|---|---|---|
62414288 | Oct 2016 | US | |
62418442 | Nov 2016 | US | |
62422086 | Nov 2016 | US |