The present disclosure relates generally to the field of computer vision. More particularly, the present disclosure relates to computer vision systems and methods for unsupervised representation learning by sorting sequences.
Convolutional Neural Networks (CNNs) have been used in visual recognition tasks involving millions of manually annotated data of images. While CNNs have shown dominant performance in high-level recognition problems such as classification and detection, training a deep network often requires processing millions of manually-labeled images. In addition to being time-consuming and inefficient, this approach substantially limits the scalability of CNNs to new problem domains because manual annotations are often expensive and, in some cases, scarce (e.g., labeling medical images requires significant expertise on the part of humans, such as healthcare professionals).
The inherent limitation from the fully supervised training paradigm highlights the importance of unsupervised learning to leverage vast amounts of unlabeled data. A vast amount of free unlabeled images and videos are readily available. Before the resurgence of CNNs, hand-craft features have been used to discover semantic classes using clustering, or mining discriminative mid-level features. With deep learning techniques, rich visual representations can be learned and extracted directly from images. Some systems focuses on reconstruction-based learning. Inspired from the original single-layer auto-encoders, several variants have been developed, including stack layer-by-layer restricted Boltzmann machines (RBMs), and auto encoders.
Some solutions attempt to leverage the inherent structure of raw images and formulate a discriminative or reconstruction loss function to train formulated models. These solutions define a supervisory signal for learning using the structure of the raw visual data. The spatial context in an image provides a rich source of supervision. Accordingly, some solutions include predicting the relative position of patches, reconstructing missing pixel values conditioned on the known surrounding area, predicting one subset of the data channels from another (e.g., predicting color channels from a gray image), solving jigsaw puzzles, in-painting missing regions based on their surroundings, and using cross-channel prediction and split-brain auto-encoders. In addition to using only individual images, some solutions are directed to grouping visual entities using co-occurrence in space and time, using graph-based constraints, and cross-modal supervision from sounds. Compared to image data, videos potentially provide much richer information as they not only consist of large amounts of image samples, but also provide scene dynamics. In comparison to images, videos provide the advantage of having an additional time dimension. Videos provide examples of appearance variations of objects over time.
Therefore, there exists a need for a surrogate task for self-supervised learning using a large collection of unlabeled videos. These and other needs are addressed by the computer vision systems and methods of the present disclosure.
Computer vision systems and methods for unsupervised representation learning by sorting sequences are provided. An unsupervised representation learning approach is provided which uses videos without semantic labels. The temporal coherence as a supervisory signal can be leveraged by formulating representation learning as a sequence sorting task. A plurality of temporally shuffled frames (i.e., in non-chronological order) can be used as inputs and a convolutional neural network can be trained to sort the shuffled sequences and to facilitate machine learning of features by the convolutional neural network Similar to comparison-based sorting algorithms, features can be extracted from all frame pairs and aggregated to predict the correct sequence order. As sorting shuffled image sequence requires an understanding of the statistical temporal structure of images, training with such a proxy task can allow a computer to learn rich and generalizable visual representations from digital images.
The foregoing features of the present disclosure will be apparent from the following Detailed Description, taken in connection with the accompanying drawings, in which:
The present disclosure relates to computer vision systems for unsupervised representation learning by sorting sequences, as discussed in detail below in connection with
The present disclosure can use up to four randomly shuffled frames sampled from a video as the input in step 6 discussed above. The sequence sorting problem can be described as a multi-class classification task. For each tuple of four frames, there are 4!=24 possible permutations. However, as some actions are coherent forward and backward (e.g., opening/closing a door), both forward and backward permutations can be grouped into the same class (e.g., 24/2 classes for four frames). This forward-backward grouping is conceptually similar to the commonly used horizontal flipping for images.
In step 16, sample candidate frames from unlabeled input video are chosen based on motion magnitude. Motion-aware tuple selection can use the magnitude of optical flow to select frames with large motion regions. In addition to using optical flow magnitude for frame selection, the system of the present disclosure can further select patches with large motion. Specifically, for video frames in the range [tmin, tmax], can use sliding windows to mine frame tuple {ta, tb, tc, td} with large motion. The system can also use a sliding windows approach on the optical flow fields to extract patches tuple with large motion magnitude.
Turning back to
The system of the present disclosure can be implemented in the Caffe toolbox. In particular, CaffeNet can be used to implement the convolutional layers of the CNN, and is a slight modification of AlexNet. The network of the present disclosure can take 80×80 patches as inputs. This can reduce the number of parameters and training time. This implementation depends on only 5.8M parameters up to fc7. The system can use stochastic gradient descent with a momentum of 0.9 and a dropout rate of 0.5 on fully connected layers. The system can also use batch normalization on all layers. The system can extract 280 k tuples from the UCF-101 dataset as the training data. To train the CNN, the batch size can be set as 128, and the basic learning rate as 10-2. The system can reduce the learning rate by a factor of 10 at 130 k and 350 k iterations, with a total of 200 k iterations. The total training process can take about 40 hours on one Titan X GPU. Other GPUs can be used within the spirit of the present disclosure.
The system of the present disclosure also provides an experimental approach for determining the accuracy of the computer vision recognition system. The split 1 of the UCF-101 and HMDB-51 action recognition benchmark datasets can be used to evaluate the performance of the unsupervised pre-trained CNN. The UCF-101 dataset include 101 action categories with about 9.5 k videos for training and 3.5 k videos for testing. The HMDB-51 dataset consists of 51 action categories with about 3.4 k videos for training and 1.4k videos for testing. Tables 1 and 2 below show the results of the system of the present disclosure against other systems.
As can be seen above, the quantitative results imply that more difficult tasks provide stronger semantic supervisory signals and guide the network to learn more meaningful features. The system of the present disclosure obtains 57.3% accuracy compared to 52.1% of from Vondrick et al. on the UCF-101 dataset. To compare with Purushwalkam et al., the CNN of the present disclosure can also be trained using VGG-M-2048. Note that Purushwalkam et al. uses the UCF-101, HMDB-51 and ACT datasets to train their model (about 20 k videos). In contrast, the system of the present disclosure uses videos from the UCF-101 training set and outperforms Purushwalkam et al. by 5.1%.
The system of the present disclosure can also be compared with a O3N system. To account for the use of stacks of frame differences (15 channels) as inputs rather than RGB images, the system of the present disclosure can take single frame difference Diff(t)=RGB(t+1)−RGB(t) as inputs to train our model. The system can initialize the network with models trained on RGB and Diff features. As shown in Table 3 below, the system of the present disclosure compares favorably against O3N by more than 10% gain on the UCF-101 dataset and 5% on the HMDB-51 dataset. The performance of initializing with the model trained on RGB features is similar to with the CNN trained on frame difference. The results demonstrate the generalizability of the present disclosure.
The system of the present disclosure can also be evaluated for transferability of learned features. The system can initialize the weights with the model trained on the UCF-101 training set (without using any labels). Table 2 above shows the results where the present system achieves 22.5% compared to 15.2% of another system under the same setting. The present system achieves slightly higher performance when there is no domain gap (i.e., using training videos from the HMDB-51 dataset). The results suggest that the present system method is not heavily data dependent and is capable of learning generalizable representations.
To evaluate the generalization ability of the present system, the system can be used as pre-trained weights for classification and detection tasks. The PASCAL VOC 2007 dataset has 20 object classes and contains 5,011 images for training and 4,952 images for testing. For both tasks, a fine-tuning strategy known in the art can be used without a rescaling method. The CaffeNet architecture can be used, and the Fast-RCNN pipeline for the detection task can also be employed. The system can use the mean average precision (mAP). Since the present system has fully connected layers that can be different from a standard CNN, the weights of the convolutional layers can be copied and initialized the fully connected layers from a Gaussian distribution with mean 0 and standard deviation 0.005. Table 4 below lists the summary of methods using static images and method using videos.
The system of the present disclosure can also perform ablation analysis. First, unsupervised pre-training can be performed using the videos from the training set. The learned weights are then used as the initialization for the supervised action recognition problem. The training tuples can be selected according to the magnitude of optical flow. The optical flow direction can also be used as a further restriction. Specifically, the motion in the selected interval must remain in the same direction. Table 5 below shows the results of how these tuple selection methods affect the final performance. Random selection degrades the performance because the training data contain many similar patches that are difficult to be sorted (e.g., static regions). Direction constraints can also be eliminated to improve performance because it oversimplifies the task. In particular, direction constraints eliminates many tuples with shape deformation (e.g., pitching contains motions in reverse direction). The CNN thus is unable to learn meaningful high-level features.
Different patch sizes can also be used for training the CNN. Due to the structure of fully connected layers, the patch size selection can affects the number of parameters and thus the training time. Table 6 below shows the comparison among using patch size 80×80, 120×120, and the entire image. It shows that using 80×80 patches has an advantage in terms of the number of parameters, training time, and most importantly, the performance. One potential reason for lesser performance of using larger patches can be the insufficient amount of video training data.
Spatial jittering can be applied to frames in a tuple to avoid the CNN from learning low-level statistics as noted above. Table 7 shows the results that spatial jittering helps the CNN learn better features.
The system can also show the effect of the pair-wise comparison stage as well as the performance correlation between the sequence sorting task and action recognition. The order prediction task can be evaluated on a held-out validation set from the automatically sampled data. Table 8 shows the results. For both 3-tuple and 4-tuple, models with the pairwise comparison perform better then models with simple concatenation on both order prediction and action recognition tasks. The improvement of the pairwise comparison over concatenation is larger on 4-tuple than on 3-tuple due to the difficulty of the order prediction task.
The quality of the learned features can be demonstrated by visualizing low-level first layer filter (conv1) as well as high-level activations (pool5).
In
The present disclosure is not limited to the 3-tuple and 4-tuple video frames, but rather 5-tuple OPN can be used. For 5-tuple input, the system can take a tuple of 5 frames as the input and the CNN can predicts 5!/2 =60 classes. Table 10 below shows the results of the 5-tuple OPN on the action recognition, classification, and detection.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art may make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is intended to be protected by Letters Patent is set forth in the following claims.
This application claims the benefit of U.S. Provisional Patent Application No. 62/620,700 filed on Jan. 23, 2018, the entire disclosure of which is expressly incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62620700 | Jan 2018 | US |