This is the first application for this disclosure.
The present disclosure relates to predicting depth in monocular videos. Specifically, the present disclosure relates to generating depth estimation for monocular videos using a meta-learning approach.
With the burgeoning usage of social media and improvements in camera technology, especially in the context of smartphone devices, videos have become a core form of disseminating entertainment, education and awareness.
Every minute, hundreds of hours or more of new video content is uploaded to social media-based video sharing and streaming platforms. Most of the videos are monocular videos filmed with smartphone or other mobile devices. Accurate depth estimation for these videos is often needed across a wide range of tasks and applications. In addition, depth estimation is also important when it comes to analyzing surveillance videos, or when a vehicle needs to use a real time camera video feed to generate guidance for drivers or for a computing system onboard an autonomous driving vehicle.
One known technique for predicting a depth value for a video frame is using future frame prediction, where a future novel frame, in terms of RGB colours, is estimated based on past observed video frames or sequences. Such methods tend to generate blurry and distorted future frames. Most recent methods focus on extracting effective temporal representations from video frames and then predict a future frame according to the temporal representations.
In addition, some other methods concentrate on generating a semantic map for a future frame, where the semantic map may include data representing depth, optical flow and semantic maps altogether, and the accuracy of the predicted depth in a semantic map tends to be far from satisfactory.
The estimation of depth in a video relies on both temporal and spatial features of frames. Existing methods for predicting depth values for a future video frame using machine learning models is rather limited, as they are highly reliant on training data, which means the resulting machine learning model cannot be used to predict a depth for a future frame of a brand new video with a high accuracy.
The present disclosure describes example embodiments implemented to predict a depth for a future video frame using neural network models trained using a meta-learning approach. Such neural network models, once trained using one set of training data from one or more videos, can quickly adapt to a novel environment (e.g., a new scene from a new video) with a very few samples from the novel environment, which means the trained neural network model can be used to predict a depth for a future frame in a new video based on as few as two or three existing video frames from the new video, achieving an improved computing efficiency in estimating depths values for the new video, and conserving computing resources at the same time.
The systems, methods and computer-readable medium disclosed herein can provide unique technical advantages such as quickly and efficiently generating estimated depth values for a video frame produced by monocular videos, as the system is configured to quickly adapt to novel video scenes using a meta-learning approach, with self-supervised training. The system can be used to predict depth values of a future video frame based on a few recent video frames from a current video, which may be used to generate a future state by a computing system on an autonomous vehicle to assist with navigation. The system is able to handle previously unseen video data with only a few batches of sample data, which means large-scale training datasets are not required for the system to adapt to new video scenes in order to generate estimated depth values for a frame in the new video.
In one aspect, the present disclosure provides a computer-implemented method for predicting a depth for a video frame, the method may include: receiving a plurality of training data Di=(Diimg, Didepth), i=1 . . . N, and for each Di: Diimg=(Di1img, Di2img . . . Ditimg), wherein Di1img, Di2img . . . Ditimg each respectively represents a video frame from a plurality of t consecutive video frames with consecutive timestamps; and Didepth is a depth representation of a future video frame immediately subsequent to the video frame Ditimg; receiving a pre-trained neural network model fθ having a plurality of weights θ; while the pre-trained neural network model fθ has not converged: computing a plurality of second weights θi′, based on the plurality of consecutive video frames Diimg in each Di, i=1 . . . N and the pre-trained neural network model f∂4, and updating the plurality of weights θ, based on the plurality of training data Di=(Diimg, Didepth), i=1 . . . N and the plurality of second weights θi′; receiving a plurality of m new consecutive video frames Dnew=(D1new
In some embodiments, computing the plurality of second weights θi′ may be based on the equation:
θi′=θ−α∇LD
In some embodiments, updating the plurality of weights θ may be based on the equation:
θ=θ−βΣi=1N∇LT
In some embodiments, predicting the depth representation of video frame Dm+1new_img may include the steps of: updating the plurality of second weights θi′, based on the plurality of new consecutive video frames Dnew=(D1new
In some embodiments, updating the plurality of second weights θi′ may be based on the equation:
θi′=θ−α∇LD
In some embodiments, a training process of the pre-trained neural network model fθ may include a current frame reconstruction process and a future depth prediction process.
In some embodiments, the training process of the pre-trained neural network model fθ may include: receiving a plurality of consecutive video frames F1img, F2img . . . Fjimg with consecutive timestamps; extracting a plurality of spatial features from the plurality of consecutive video frames F1img, F2img . . . Fjimg; setting a plurality of initial parameters of fθ with random values to be the plurality of weights θ; extracting a plurality of spatial features from the plurality of consecutive video frames F1img, F2img . . . Fjimg; during the current frame reconstruction process: reconstructing each of the plurality of consecutive video frames F1img, F2img . . . Fjimg based on the plurality of spatial features; and updating values for at least one of the plurality of weights θ based on the reconstructed video frames; and during the future depth prediction process: extracting temporal features of the plurality of consecutive video frames F1img, F2img . . . Fjimg based on the plurality of spatial features; generating a depth prediction for a video frame Fj+1img immediately subsequent to the video frame Fjimg based on the temporal features of the plurality of consecutive video frames F1img, F2img . . . Fjimg; and updating values for at least one of the plurality of weights θ based on the depth prediction for the video frame Fj+1img.
In some embodiments, extracting the temporal features may include using a 3D convolutional neural network to extract the temporal features.
In some embodiments, the depth presentation of any video frame may include, for one or more surfaces in the video frame, a depth value representing an estimated distance from the respective surface from a viewpoint.
In some embodiments, the depth presentation of any video frame may be a depth map for the video frame.
In another aspect, a system for predicting a depth for a video frame is disclosed, the system may include: a processing unit; and a memory coupled to the processing unit, the memory storing machine-executable instructions that, when executed by the processing unit, cause the system to: receive a plurality of training data Di=(Diimg, Didepth), i=1 . . . N, and for each Di: Diimg=(Di1img, Di2img . . . Ditimg), wherein Di1img, Di2img . . . Ditimg each respectively represents a video frame from a plurality of t consecutive video frames with consecutive timestamps; and Didepth is a depth representation of a future video frame immediately subsequent to the video frame Ditimg; receive a pre-trained neural network model fθ having a plurality of weights θ; while the pre-trained neural network model fθ has not converged: compute a plurality of second weights θi′, based on the plurality of consecutive video frames Diimg in each Di, i=1 . . . N and the pre-trained neural network model fθ; and update the plurality of weights θ, based on the plurality of training data Di=(Diimg, Didepth), i=1 . . . N and the plurality of second weights θi′; receive a plurality of m new consecutive video frames Dnew=(D1new
In some embodiments, computing the plurality of second weights θi′ may be based on the equation:
θi′=θ−α∇LD
In some embodiments, updating the plurality of weights θ may be based on the equation:
θ=θ−βΣi=1N∇LT
In some embodiments, predicting the depth representation of video frame Dm+1new_img may include: updating the plurality of second weights θi′, based on the plurality of new consecutive video frames Dnew=(D1new
In some embodiments, updating the plurality of second weights θi′ may be based on the equation:
θi′=θ−α∇LD
In some embodiments, a training process of the pre-trained neural network model fθ may include a current frame reconstruction process and a future depth prediction process.
In some embodiments, during the training process of the pre-trained neural network model fθ, the machine-executable instructions, when executed by the processing unit, cause the system to: receive a plurality of consecutive video frames F1img, F2img . . . Fjimg with consecutive timestamps; extract a plurality of spatial features from the plurality of consecutive video frames F1img, F2img . . . Fjimg; set a plurality of initial parameters of fθ with random values to be the plurality of weights θ; extract a plurality of spatial features from the plurality of consecutive video frames F1img, F2img . . . Fjimg; during the current frame reconstruction process: reconstruct each of the plurality of consecutive video frames F1img, F2img . . . Fjimg based on the plurality of spatial features; and update values for at least one of the plurality of weights θ based on the reconstructed video frames; and during the future depth prediction process: extract temporal features of the plurality of consecutive video frames F1img, F2img . . . Fjimg based on the plurality of spatial features; generate a depth prediction for a video frame Fj+1img immediately subsequent to the video frame Fjimg based on the temporal features of the plurality of consecutive video frames F1img, F2img . . . Fjimg; and update values for at least one of the plurality of weights θ based on the depth prediction for the video frame Fj+1img.
In some embodiments, during extracting the temporal features, the machine-executable instructions, when executed by the processing unit, cause the system to: use a 3D convolutional neural network to extract the temporal features.
In some embodiments, the depth presentation of any video frame may include, for one or more surfaces in the video frame, a depth value representing an estimated distance from the respective surface from a viewpoint.
In some embodiments, the depth presentation of any video frame may be a depth map for the video frame.
In another aspect, a non-transitory computer readable medium storing machine-readable instructions for configuring a processing unit to predict a depth for a video frame is disclosed, the machine-readable instructions are configured to processing unit to: receive a plurality of training data Di=(Diimg, Didepth), i=1 . . . N, and for each Di: Diimg=(Di1img, Di2img . . . Ditimg), wherein Di1img, Di2img . . . Ditimg each respectively represents a video frame from a plurality of t consecutive video frames with consecutive timestamps; and Didepth is a depth representation of a future video frame immediately subsequent to the video frame Ditimg; receive a pre-trained neural network model fθ having a plurality of weights θ; while the pre-trained neural network model fθ has not converged: compute a plurality of second weights θi′, based on the plurality of consecutive video frames Diimg in each Di, i=1 . . . N and the pre-trained neural network model fθ; and update the plurality of weights θ, based on the plurality of training data Di=(Diimg, Didepth), i=1 . . . N and the plurality of second weights θi′; until the pre-trained neural network model fθ has converged: receive a plurality of m new consecutive video frames Dnew=(D1new
Reference will now be made, by way of example, to the accompanying drawings which show example embodiments of the present application, and in which:
Similar reference numerals may have been used in different figures to denote similar components.
The present disclosure is made with reference to the accompanying drawings, in which embodiments are shown. However, many different embodiments may be used, and thus the description should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. Like numbers refer to like elements throughout, and prime notation is used to indicate similar elements, operations or steps in alternative embodiments. Separate boxes or illustrated separation of functional elements of illustrated systems and devices does not necessarily require physical separation of such functions, as communication between such elements may occur by way of messaging, function calls, shared memory space, and so on, without any such physical separation. As such, functions need not be implemented in physically or logically separated platforms, although they are illustrated separately for ease of explanation herein. Different devices may have different designs, such that although some devices implement some functions in fixed function hardware, other devices may implement such functions in a programmable processor with code obtained from a machine readable medium.
As used herein, a “module” or “operation” can refer to a combination of a hardware processing circuit and machine-readable instructions (software and/or firmware) executable on the hardware processing circuit. A hardware processing circuit can include any or some combination of a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, a digital signal processor, a general processing unit, an accelerator unit, or another hardware processing circuit. In some examples, module can refer to a purpose configured hardware processing circuit.
As used here, a “model” can refer to a combination of a hardware processing circuit and machine-readable instructions (software and/or firmware) executable on the hardware processing circuit that is configured to apply a processing function to input data to generate a corresponding output. A “machine learning model” can refer to a model for which the processing function has been learned or trained using machine learning (ML) techniques. Machine learning models can include but not limited to models that are based on or more of Convolution Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) and/or transformer architectures. Other possible types of machine learning classification modules include models that are based on decision tree algorithms, support vector machine algorithms, and logistic regression algorithms.
As mentioned, current future frame prediction methods usually focus on predicting future frames in terms of RGB values, or semantic maps. These methods usually have poor adaptivity, which means the machine learning models need to be trained on a new, large-scale dataset before it can be used to predict a depth value for a new video. In this disclosure, novel methods are described to predict depth values for a future video frame without having to first obtaining the RGB values for the future video frame, and using only a few samples (e.g., video frames) of the new video to make the prediction. The systems and methods disclosed herein provide a technical solution that requires less computing resource and less time than known approaches to estimate depth values for a frame in a given video.
In some example embodiments, a meta-learning approach is used to help train a machine learning (e.g., neural network) model to quickly adapt to a novel environment with just a few samples. The basic concept of meta-learning is to teach a pre-trained machine learning model to generalize based on a new set of training data (e.g., video frames from a new video) quickly and efficiently, when the pre-trained models generally have difficulties adapting to new video scenes without extensive training.
In some embodiments, during meta-learning, a pre-trained machine learning model can be trained to adapt to a specific type of environment or scene based on training data that are readily available for that environment (e.g., frames from a new video showing the environment), and deployed to generate depth values for a future video frame in the same or similar environment in a computationally efficient manner.
In some embodiments, a pre-trained neural network model 117 may be a neural network model fθ117 executed to receive a plurality of video frames F1, F2, . . . Fj with consecutive timestamps, and generate a set of depth values for the next immediately video frame Fj+1 based on the spatial and temporal features of the plurality of video frames F1, F2, . . . Fj. For example, the neural network model fθ117 may have been pre-trained using Convolution Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and/or Long Short-Term Memories (LSTMs). A neural network model fθ117, once pre-trained, may have a plurality of parameters or weights, collectively represented by θ, as learned from the spatial and/or temporal features of the plurality of video frames F1, F2, . . . Fj.
Example processes of pre-training a neural network model is described next in detail with respect to
The system 200 may include a spatial feature extraction branch or process 203, current frame reconstruction branch or process 205 and a future depth prediction branch or process 207. A plurality of consecutive video frames may be received by the system 200. The plurality of consecutive video frames may be represented as F1img, F2img . . . Fjimg, where j indicates a total number of consecutive video frames, and can be any positive natural number starting with 2. In this particular example, the system 200 receives three consecutive video frames F1img 210a, F2img 210b, F3img 210c, having consecutive timestamps t−2, t−1, and t, respectively. The third video frame F3img 210c is the most recent and noted as the current frame.
The system 200 includes a plurality of spatial feature extraction encoders 220a, 220b, 220c, from which a respective encoder is assigned for each of the plurality of consecutive video frames F1img 210a, F2img 210b, F3img 210c. During the spatial feature extraction process 203, the spatial feature extraction encoder 220a is configured to receive the video frame F1img 210a as input, and generate a plurality of spatial features 230a and associated weights θF1 for the spatial features 230a. The spatial feature extraction encoder 220b is configured to receive the video frame F2img 210b as input, and generate a plurality of spatial features 230b and associated weights θF2 for the spatial features 230b. The spatial feature extraction encoder 220c is configured to receive the video frame F3img 210c as input, and generate a plurality of spatial features 230c and associated weights θF3 for the spatial features 230c. The weights θF1, θF2, θF3 can be part of the plurality of weights θ, and updated throughout the current frame reconstruction process 205.
During the current frame reconstruction branch or process 205, a respective decoder 240a, 240b, 240c can be configured to reconstruct, respectively, each of the plurality of consecutive video frames F1img 210a, F2img 210b, F3img 210c based on the plurality of spatial features 230a, 230b, 230c from the spatial feature extraction process 203. In some embodiments, the decoder 240a may share weights θF1 with the encoder 220a to reconstruct the video frame F1img with the output being a reconstructed video frame 250a; the decoder 240b may share weights θF2 with the encoder 220b to reconstruct the video frame F2img with the output being a reconstructed video frame 250b; and the decoder 240c may share weights θF3 with the encoder 220c to reconstruct the video frame F3img with the output being a reconstructed video frame 250c. During this process, values for at least one of the plurality of weights θF1, θF2, θF3 may be updated based on the reconstructed video frames 250a, 250b, 250c.
During the future depth prediction branch or process 207, a 3D Convolutional Neural Network (CNN) 260 is configured to extract temporal features 270 of the plurality of consecutive video frames F1img, F2img, F3img based on the plurality of spatial features 230a, 230b, 230c, and a decoder 280 is configured to generate a depth prediction 290 for a video frame F4img immediately subsequent to the video frame F3img based on the temporal features of the plurality of consecutive video frames 270 of the plurality of consecutive video frames F1img, F2img, F3img. During this process, values for at least one of the plurality of weights θF1, θF2, θF3 may be updated based on the depth prediction 290. In some embodiments, the future depth prediction process 207 only takes the spatial features from the most current video frame F3img 210c for concatenation. The system 200 can be directly used for estimated depth values for a video frame that is subsequent to the plurality of consecutive video frames F1img, F2img, F3img.
During the spatial feature extraction process 303, an encoder, which may include multiple 2D convolution layers 340a, 340b, 340c, 340d, 340e, may be used to extract the spatial features 230a, 230b, 230c from the input F1img 210a, F2img 210b, F3img 210c. Each 2D convolution layer 340a, 340b, 340c, 340d, 340e may include at least one 2D convolutional neural network (CNN). For example, “3×3 conv, 64” represents a 2D CNN with 3×3 kernel and 64 output channels for outputting the spatial features. For another example, ‘3×3 conv, 128’ represents a 2D CNN with 3×3 kernel and 128 output channels for outputting the spatial features. For yet another example, ‘3×3 conv, 256’ represents a 2D CNN with 3×3 kernel and 256 output channels for outputting the spatial features. In some of the 2D convolution layers 340a, 340b, 340c, 340d, which includes two or more 2D CNNs, a pooling layer (e.g., a max pooling filter) may be applied to down sample the spatial features generated by the 2D CNNs.
Each of the 2D convolution layers 340a, 340b, 340c, 340d, 340e may generate a set of spatial features 330a, 330b, 330c, 330d, 330e, which are passed onto the decoders in the current frame reconstructions process 305 and the decoders in the future depth prediction process 307.
In the current frame reconstructions process 305, a decoder may include multiple devolution layers 350a, 350b, 350c, 350d to generate the reconstructed video frames 250a, 250b, 250c based on the spatial features 330a, 330b, 330c, 330d, 330e. Each of the devolution layers 350a, 350b, 350c, 350d may include a deconvolutional neural network (shown as ‘deconv’) and multiple 2D CN Ns (e.g., ‘3×3 conv, 512’). The numbers ‘1024’, ‘512’, ‘256’, ‘128’ after each concatenation operation (shown as ‘C’) represents the number of spatial features after the respective concatenation operation.
In the future depth prediction process 307, the set of spatial features 330e from the spatial feature extraction process 303 are sent to a 3D CNN 360, which then process the spatial features 330e to generate temporal features 270 which are then passed onto a decoder. The decoder may include multiple devolution layers 370a, 370b, 370c, 370d to generate the depth prediction 290 for the video frame F4img immediately subsequent to the video frame F3img, based on the spatial features 330a, 330b, 330c, 330d, 330e. Each of the devolution layers 370a, 370b, 370c, 370d may include a deconvolutional neural network (shown as ‘deconv’) and multiple 2D CNNs (e.g., ‘3×3 conv, 256’). The numbers ‘1024’, ‘512’, ‘256’, ‘128’ after each concatenation operation represents the number of spatial features after the respective concatenation operation. In some embodiments, the future depth prediction process 307 only takes the spatial features from the most current video frame F3img 210c for concatenation.
During both processes 305 and 307, values for some of the plurality of weights θ of the neural network model fθ117 may be updated based on the reconstructed video frames 250a, 250b, 250c, and/or the depth prediction 290.
As mentioned, a pre-trained neural network model 117 generally has difficulties adapting to new video scenes without extensive training. Referring back to
Within the meta-training component 110, N batches of input data 112a . . . 112n may be obtained. Each batch of input data 112a or 112n may be represented by Di=(Diimg, Didepth), where Diimg and Didepth are randomly sampled data. i indicates different batches, ranging from 1 to N, where N is the total number of batches. To be more specific, Diimg and Didepth are paired data: Diimg includes a plurality of consecutive video frames in terms of RGB values (e.g., Di_1img, Di_2img, Di_3img); and Didepth represents depth values (e.g., ground truth) of a future video frame Di+1img immediately subsequent to Diimg. Depth values may be represented in a matrix corresponding to RGB values. For example, Didepth may be a matrix, where each element of the matrix corresponds to a depth value (e.g., 2 meters) for a respective pixel (or a respective group of pixels) of the video frame Di+1img. A depth value may represent a distance between a surface to which the pixel (or group of pixels) belongs and the viewpoint of the camera used to take the video. For instance, for a video frame containing a scene involving a table, a depth value may represent a distance (e.g., 0.8 meters) from a pixel (or a group of pixels) showing a surface of the table to the viewpoint of the camera.
In some embodiments, Didepth containing depth values for a video frame may be represented as a depth map, which may be a graphical representation based on the depth values. In a depth map, depth values may be depicted by one or more colours based on a predetermined set of rules, which may include, for example: depth values within a first range (e.g., 0-2 meters) are represented by a first colour (e.g., red), depth values within a second range (e.g., 2.1 to 5 meters) are represented by a second colour (e.g., yellow), depth values within a third range (e.g., beyond 5 meters) are represented by a third colour (e.g., blue), and so on.
Through the training process 115 in the meta-training component 110, the N batches of data 112a . . . 112n are used to train the pre-trained neural network model fθ117, which may include a current frame reconstruction phase (referred to as an inner loop update), and a future depth prediction phase (referred to as an outer loop update).
In some embodiments, the meta-training component 110 is configured to adapt the pre-trained neural network model fθ117 to a new task by updating the existing weights θ of the pre-trained neural network model fθ117 to updated weights θi′. The updated weights θi′ may be computed using one or more gradient descent updates on the new task.
During the current frame reconstruction phase of training process 115, the pre-trained neural network model fθ117 with a loss of L1 may be adapted for current frame reconstruction using Diimg, and one gradient update may be used to update the weights of a spatial feature encoder (e.g., encoders 220a, 220b, 220c) and the weights of the spatial feature decoder (e.g., decoders 230a, 230b, 230c) from θ to θi′ based on the equation:
θ1′=θ−α∇LD
where α represents a learning rate or step size, LD
At this point, θi′, which may be referred to as a set or plurality of second weights, is the updated model weights of the pre-trained neural network model fθ 117 by data Diimg. LD
After the adapted model weights θ′ including θi′ are obtained using equation (1) above, the pre-trained neural network model fθ117 with weights θ′ may be performed during a future depth prediction phase with paired data Di=(Diimg, Didepth), i=1 . . . N for an outer loop update. The goal of the outer loop update is to ensure the features from the spatial feature encoder is suitable for predicting depth values for a future video frame. Each iteration of the outer loop update may update the plurality of weights θ is based on the equation:
θ=θ−βΣi=1N∇LT
where β represents a learning rate or step size, LT
In some embodiments, LT
The loss function LD
L
D
=∥f
θ(Diimg)−Diimg∥1
L
T
=∥f
θ′(Diimg)−Didepth∥1
During the meta-testing component 120, the pre-trained neural network model fθ117 with updated weights θ may be trained based on previously unseen data to quickly and efficiently generate a depth prediction for a future video frame. In some embodiments, the previously unseen data may include a plurality of m new consecutive video frames 122, represented as Dnew=(D1new
During meta-testing, a few videos frames from the plurality of m new consecutive video frames 122 may be used as input to conduct the inner loop update with equation (1) during an adaption process 125, and obtain the adapted neural network model with updated plurality of second weights θi′. Then the neural network model 127 may be applied the rest of the plurality of m new consecutive video frames 122 to measure its performance. At last, the system 100 may generate a depth representation 129 of video frame Dm+1new
An overall example process performed by the meta-learning system 100 is presented below using pseudo code, as a non-limiting example. During the meta-training component 110:
θi′=θ−α∇LD
θ=θ−βΣi=1N∇LT
N may be a total number of training data batches.
During the meta-testing component 120:
θi′=θ−α∇LD
Estimated_Depth_Values=f(θ′(Dimg))
The output Estimated_Depth_Values is the final output 129 of the system 100.
The processor 402 may be embodied as any processing resource capable of executing computer program instructions, such as one or more processors on a computer or computing platform(s). The memory 404 may be embodied as any data storage resource, such as one or more disk drives, random access memory, or volatile or non-volatile memory on one or more computing platforms.
The memory 404 has stored thereon several types of computer programs in the form of executable instructions. It has thereon a set of executable instructions 410 for carrying out the methods described herein. It also has stored thereon one or more sets of instructions of trained neural networks or other machine learned models to generate estimated depth values for one or more video frames.
The memory 404 may have stored thereon several types of data 480. The data 480 may include, for example, matrix representations 412 representing pre-trained neural network model fθ with weights θ. The matrix representations 412 may include matrices or weights used as input to a neural network (e.g., pre-trained neural network model fθ), as well as matrices updated or generated by the neural network. The data 480 may also include matrix representations 414 representing a plurality of second weights θ′ as updated during the training process 115 and adaption process 125 in system 100. The data 480 may further include matrix representations 416 representing a plurality of new video frames and matrix representations 418 representing estimated depth values for a future video frame generated based on the plurality of new video frames 416.
At operation 510, the system may receive a plurality of training data Di=(Diimg, Didepth), i=1 . . . N, and for each Di: Diimg=(Di1img, Di2img . . . Ditimg). Di1img, Di2img . . . Ditimg each respectively represents a video frame from a plurality of t consecutive video frames with consecutive timestamps, and Didepth is a depth representation of a future video frame immediately subsequent to the video frame Ditimg.
At operation 520, the system may receive a pre-trained neural network model fθ having a plurality of weights θ. This neural network model may be pre-trained based on a process described in
In some embodiments, a training process of the pre-trained neural network model fθ may include a current frame reconstruction process and a future depth prediction process.
In some embodiments, during the training process of the pre-trained neural network model fθ, a system, which may be a separate system from system 100, may: receive a plurality of consecutive video frames F1img, F2img . . . Fjimg with consecutive timestamps; extract a plurality of spatial features from the plurality of consecutive video frames F1img, F2img . . . Fjimg; set a plurality of initial parameters of fθ with random values to be the plurality of weights θ; extract a plurality of spatial features from the plurality of consecutive video frames F1img, F2img . . . Fjimg; during the current frame reconstruction process: reconstruct each of the plurality of consecutive video frames F1img, F2img . . . Fjimg based on the plurality of spatial features; and update values for at least one of the plurality of weights θ based on the reconstructed video frames; and during the future depth prediction process: extract temporal features of the plurality of consecutive video frames F1img, F2img . . . Fjimg based on the plurality of spatial features; generate a depth prediction for a video frame Fj+1img immediately subsequent to the video frame Fjimg based on the temporal features of the plurality of consecutive video frames F1img, F2img . . . Fjimg; and update values for at least one of the plurality of weights θ based on the depth prediction for the video frame Fj+1img.
In some embodiments, a 3D CNN may be used to extract the temporal features during the future depth prediction process.
While the pre-trained neural network model fθ has not converged, operations 530 and 540 are performed. At operation 530, the system may compute a plurality of second weights θi′, based on the plurality of consecutive video frames Diimg in each Di, i=1 . . . N and the pre-trained neural network model fθ. For example, computing the plurality of second weights θi′ may be based on the equation (1) below using Diimg:
θi′=θ−α∇LD
where α represents a learning rate or step size, LD
At operation 540, the system may update the plurality of weights θ, based on the plurality of training data Di=(Diimg, Didepth), i=1 . . . N and the plurality of second weights θi′. For example, updating the plurality of weights θ may be based on the equation (2) below:
θ=θ−βΣi=1N∇LT
where β represents a learning rate, LT
At operation 550, the system may receive a plurality of m new consecutive video frames Dnew=(D1new
At operation 560, the system may predict a depth representation of video frame Dm+1new
In some embodiments, predicting the depth representation of video frame Dm+1new
In some embodiments, updating the plurality of second weights θi′ may be based on the equation:
θi′=θ−α∇LD
where α is the learning rate or step size, LD
In some embodiments, the depth values may be represented using a depth map.
The systems and methods described herein can quickly and efficiently generate or predict depth values for a video frame produced by monocular videos, as the system is configured to quickly adapt to novel video scenes using a meta-learning approach, with self-supervised training. The system can be used to predict depth values of a future video frame based on a few recent video frames from a current video, which may be used to generate a future state by a computing system on an autonomous vehicle to assist with navigation. The system is able to handle previously unseen video data with only a few batches of sample data, which means large-scale training datasets are not required for the system to adapt to new video scenes in order to generate estimated depth values for a frame in the new video.
Although the present disclosure describes methods and processes with steps in a certain order, one or more steps of the methods and processes may be omitted or altered as appropriate. One or more steps may take place in an order other than that in which they are described, as appropriate.
Although the present disclosure is described, at least in part, in terms of methods, a person of ordinary skill in the art will understand that the present disclosure is also directed to the various components for performing at least some of the aspects and features of the described methods, be it by way of hardware components, software or any combination of the two. Accordingly, the technical solution of the present disclosure may be embodied in the form of a software product. A suitable software product may be stored in a pre-recorded storage device or other similar non-volatile or non-transitory computer readable medium, including DVDs, CD-ROMs, USB flash disk, a removable hard disk, or other storage media, for example. The software product includes instructions tangibly stored thereon that enable a processing device (e.g., a personal computer, a server, or a network device) to execute examples of the methods disclosed herein.
The present disclosure may be embodied in other specific forms without departing from the subject matter of the claims. The described example embodiments are to be considered in all respects as being only illustrative and not restrictive. Selected features from one or more of the above-described embodiments may be combined to create alternative embodiments not explicitly described, features suitable for such combinations being understood within the scope of this disclosure.
All values and sub-ranges within disclosed ranges are also disclosed. Also, although the systems, devices and processes disclosed and shown herein may comprise a specific number of elements/components, the systems, devices and assemblies could be modified to include additional or fewer of such elements/components. For example, although any of the elements/components disclosed may be referenced as being singular, the embodiments disclosed herein could be modified to include a plurality of such elements/components. The subject matter described herein intends to cover and embrace all suitable changes in technology.