The present invention relates to the field of image processing and motion image sequence processing, more particularly, to the field of motion estimation, detection, and prediction of body poses.
In recent years, impressive motion capture results have been demonstrated using depth cameras, but three-dimensional (3D) body pose recovery from ordinary monocular video sequences remains extremely challenging. Nevertheless, there is great interest in doing so, both because cameras are becoming ever cheaper and more prevalent and because there are many potential applications. These include athletic training, surveillance, and entertainment.
Early approaches to monocular 3D pose tracking involved recursive frame-to-frame tracking and were found to be brittle, due to distractions and occlusions from other people or objects in the scene [43]. Since then, the focus has shifted to “tracking by detection” which involves detecting human pose more or less independently in every frame followed by linking the poses across the frames [2, 31], which is much more robust to algorithmic failures in isolated frames. More recently, an effective single-frame approach to learning a regressor from a kernel embedding of two-dimensional (2D) HOG features to 3D poses has been proposed by [17], hereinafter referred to as Ionescu. Excellent results have also been reported using a Convolutional Neural Net (CNN) [25], hereinafter referred to as Li.
However, inherent ambiguities of the projection from 3D to 2D, including self-occlusion and mirroring, can still confuse these state-of-the-art approaches. A linking procedure can correct for these ambiguities to a limited extent by exploiting motion information a posteriori to eliminate erroneous poses by selecting compatible candidates over consecutive frames. However, when such errors happen frequently for several frames in a row, enforcing temporal consistency afterwards is not enough. Therefore, in light of these deficiencies of the background art, strongly improved methods, devices, and systems are desired.
According to one aspect of the present invention, a method for predicting three-dimensional body poses from image sequences of an object is provided, the method performed on a processor of a computer having memory. Preferably, the method includes the steps of accessing the image sequences from the memory, finding bounding boxes around the object in consecutive frames of the image sequence, compensating motion of the object to form spatio-temporal volumes, and learning a mapping from the spatio-temporal volumes to a three-dimensional body pose in a central frame based on a mapping function.
According to another aspect of the present invention, a device for predicting three-dimensional body poses from image sequences of an object is provided, the device including a processor having access to a memory. Preferably, the processor configured to access the image sequences from the memory, find bounding boxes around the object in consecutive frames of the image sequence, compensate motion of the object to form spatio-temporal volumes, and learn a mapping from the spatio-temporal volumes to a three-dimensional body pose in a central frame based on a mapping function.
According to still another aspect of the present invention, a non-transitory computer readable medium is provided. Preferably, the computer readable medium has computer instructions recorded thereon, the computer instructions configured to perform a method for predicting three-dimensional body poses from image sequences of an object when executed on a computer having memory. Moreover, the method further preferably includes the steps of accessing the image sequences from the memory, finding bounding boxes around the object in consecutive frames of the image sequence, compensating motion of the object to form spatio-temporal volumes, and learning a mapping from the spatio-temporal volumes to a three-dimensional body pose in a central frame based on a mapping function.
The above and other objects, features and advantages of the present invention and the manner of realizing them will become more apparent, and the invention itself will best be understood from a study of the following description with reference to the attached drawings showing some preferred embodiments of the invention.
The accompanying drawings, together with the tables, which are incorporated herein and constitute part of this specification, illustrate the presently preferred embodiments of the invention, and together with the general description given above and the detailed description given below, serve to explain features of the invention:
Table 1 shows different results of 3D joint position errors in Human3.6m using the metric of average Euclidean distance;
Table 2 shows different results for two actions, one for the Walking Dog having more movement, and one for the Greeting action with less motion;
Table 3 shows different results that demonstrates the influence of the size of the temporal window;
Table 4 shows different results of 3D joint position errors (in mm) on the Walking and Boxing sequences of the HumanEva-I dataset;
Table 5 shows different results of 3D joint position errors (in mm) on the Combo sequence of the HumanEva-II dataset; and
Table 6 shows a comparison on the KTH Multiview Football II results of the present method using a single camera to those of using either single or two cameras.
Herein, identical reference numerals are used, where possible, to designate identical elements that are common to the figures. Also, the representations in the drawings are simplified for illustration purposes and may not be depicted to scale.
According to one aspect of the present invention, motion information is used from the start of the process. To this end, we learn a regression function that directly predicts the 3D pose in a given frame of a sequence from a spatio-temporal volume centered on it. This volume comprises bounding boxes surrounding the person in consecutive frames coming before and after the central one. It is shown that this approach is more effective than relying on regularizing initial estimates a posteriori. Different regression schemes have been evaluated and the best results are obtained by applying a Deep Network to the spatiotemporal features [21, 45] extracted from the image volume. Furthermore, we show that, for this approach to perform to its best, it is essential to align the successive bounding boxes of the spatio-temporal volume so that the person inside them remains centered. To this end, we trained two Convolutional Neural Networks to first predict large body shifts between consecutive frames and then refine them. This approach to motion compensation outperforms other more standard ones [28] and improves 3D human pose estimation accuracy significantly.
According to another aspect of the present method, device and system, one advantage is a principled approach to combining appearance and motion cues to predict 3D body pose in a discriminative manner. Furthermore, it is demonstrated that what makes this approach both practical and effective is the compensation for the body motion in consecutive frames of the spatiotemporal volume. It is shown that the proposed method, device and system substantially improves upon background methods [2, 3, 4, 17, 25] by a large margin on Human3.6m of Ionescu [25], HumanEva [36], and KTH Multiview Football [6] 3D human pose estimation benchmarks.
Approaches to estimating the 3D human pose can be classified into two main categories, depending on whether they rely on still images or image sequences. These two categories are briefly discussed infra. In the results shown infra, it is demonstrated that the present method, device, and system outperforms the background art representatives of each of these two categories.
With respect to the first category, the 3D human pose estimation in single images, early approaches tended to rely on generative models to search the state space for a plausible configuration of the skeleton that would align with the image evidence [12, 12, 27, 35]. These methods remain competitive provided that a good enough initialization can be supplied. More recent ones [3, 6] extend 2D pictorial structure approaches [10] to the 3D domain. However, in addition to their high computational cost, they tend to have difficulty localizing people's arms accurately because the corresponding appearance cues are weak and easily confused with the background [33].
By contrast, discriminative regression-based approaches [1, 4, 16, 40] build a direct mapping from image evidence to 3D poses. Discriminative methods have been shown to be effective, especially if a large training dataset, such as in Ionescu is available. Within this context, rich features encoding depth [34] and body part information [16, 25] have been shown to be effective at increasing the estimation accuracy. However, these methods can still suffer from ambiguities such as self-occlusion, mirroring and foreshortening, as they rely on single images. To overcome these issues, the present application shows how to use not only appearance, but also motion features for discriminative 3D human pose estimation purposes.
In another notable study, [4] investigates merging image features across multiple views. Our method is fundamentally different as we do not rely on multiple cameras. Furthermore, we compensate for apparent motion of the person's body before collecting appearance and motion information from consecutive frames.
With respect to the second category, the 3D human pose estimation in image sequences, these approaches also fall into two main classes.
The first class involves frame-to-frame tracking and dynamical models [43] that rely on Markov dependencies on previous frames. Their main weakness is that they require initialization and cannot recover from tracking failures.
To address these shortcomings, the second class focuses on detecting candidate poses in individual frames followed by linking them across frames in a temporally consistent manner. For example, in [2], initial pose estimates are refined using 2D tracklet-based estimates. In [47], dense optical flow is used to link articulated shape models in adjacent frames. Non-maxima suppression is then employed to merge pose estimates across frames in [7]. By contrast to these approaches, in the present method, device, and system, the temporal information is captured earlier in the process by extracting spatiotemporal features from image cubes of short sequences and regressing to 3D poses. Another approach [5] estimates a mapping from consecutive ground-truth 2D poses to a central 3D pose. Instead, the present method, device, and system does not require any such 2D pose annotations and directly use as input a sequence of motion-compensated frames.
While they have long been used for action recognition [23, 45], person detection [28], and 2D pose estimation [11], spatiotemporal features have been underused for 3D body pose estimation purposes. The only recent approach is [46] that involves building a set of point trajectories corresponding to high joint responses and matching them to motion capture data. One drawback of this approach is its very high computational cost. Also, while the 2D results look promising, no quantitative 3D results are provided in the paper and no code is available for comparison purposes.
According to one aspect of the present method, device, and system, the approach involves finding bounding boxes around people in consecutive frames, compensating for the motion to form spatiotemporal volumes, and learning a mapping from these volumes to a 3D pose in their central frame. In the following discussion, the formalism and terms used in the present application are presented and then describe each individual step, depicted by
According to one aspect of the proposed method, device, and system, an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people is provided. Previous approaches typically compute candidate poses in individual frames and then link them in a post-processing step to resolve ambiguities. By contrast, with one aspect of the present method, device, and system, regress from a spatio-temporal volume of bounding boxes to a 3D pose in the central frame.
In addition, it is show that, for the present method, device and system to achieve its full potential, it is preferable to compensate for the motion in consecutive frames so that the subject remains centered. This then allows us to effectively overcome ambiguities and improve upon the state-of-the-art by a large margin on the Human3.6m, HumanEva, and KTH Multiview Football 3D human pose estimation benchmarks.
In the present application, 3D body poses are represented in the figures in terms of skeletons, such as those shown in
Let Ii be the i-th image of a sequence containing a subject and Yiε3.D be a vector that encodes the corresponding 3D joint locations. Typically, regression-based discriminative approaches to inferring Yi involve learning a parametric [1, 18] or non-parametric [42] model of the mapping function, Xi→Yi≈f(Xi) over training examples, where Xi=Ω(Ii; m1) is a feature vector computed over the bounding box or the foreground mask, mi, of the person in Ii. The model parameters are usually learned from a labeled set of N training examples, T={(Xi, Yi)}i=1N. As discussed supra, in such a setting, reliably estimating the 3D pose is hard to do due to the inherent ambiguities of 3D human pose estimation such as self-occlusion and mirror ambiguity.
Instead, the mapping function f is modelled conditioned on a spatiotemporal 3D data volume that is made of a sequence of T frames centered at image i,
V
i
=[I
i−T/2+1
, . . . ,I
i
, . . . ,I
i+T/2] (1)
Z
i
→Y
i
≈f(Zi) (2)
where
Z
i=ξ(Vi;mi−T/2+1, . . . ,mi, . . . ,mi+T/2) (3)
Zi is a feature vector computed over the data volume, Vi. The training set, in this case, is:
T={(Zi,Yi)}i=1N (4)
where Yi is the pose in the central frame of the image stack. In practice, every block of consecutive T frames are collected across all training videos to obtain data volumes. It is shown that in the results section that this significantly improves performance and that the best results are obtained for volumes of T=24 to 48 images, that is 0.5 to 1 second given the 50 fps of the sequences of the Human3.6m dataset of Ionescu.
Regarding the spatiotemporal features, the feature vector Z is based on the 3D histogram of oriented gradients (HOG) descriptor [45], which simultaneously encodes appearance and motion information. It is computed by first subdividing a data volume such as the one depicted by
An alternative to encoding motion information in this way would have been to explicitly track body pose in the spatiotemporal volume, as done in [2]. However, this involves detection of the body pose in individual frames which is subject to ambiguities caused by the projection from 3D to 2D as explained in the background art discussion and not having to do this is a contributing factor to the good results we will show in below in the results presented in Tables 1-6.
Another approach for spatiotemporal feature extraction could be to use 3D CNNs directly operating on the pixel intensities of the spatiotemporal volume. However, in our experiments, we have observed that, 3D CNNs did not achieve any notable improvement in performance compared to spatial CNNs. This is likely due to the fact that 3D CNNs remain stuck in local minima due to the complexity of the model and the large input dimensionality. This is also observed in [19, 26].
Regarding motion compensation with CNNs, for the 3D HOG descriptors introduced above to be representative of a pose of a person, the temporal bins must correspond to specific body parts, which implies that the person should remain centered from frame to frame in the bounding boxes used to build the image volume. In the present application, the Deformable Part Model detector (DPM) [10] is used to obtain these bounding boxes, as it proved to be effective in various applications. However, in practice, these bounding boxes may not be well-aligned on the person. Therefore, these boxes are shifted as shown in
Accordingly, with one aspect of the present method, device, and system, an object-centric motion compensation scheme inspired by the one proposed in [32] for drone detection purposes, which was shown to perform better than optical-flow based alignment [28]. To this end, regressors are trained to estimate the shift of the person from the center of the bounding box. These shifts are applied to the frames of the image stack so that the subject remains centered, and obtain what is called a rectified spatio-temporal volume (RSTV), as depicted in
A schematic representation of the method as a flowchart, according to one aspect of the present invention, is shown in
More formally, let m be an image patch extracted from a bounding box returned by DPM. An ideal regressor ψ(.) for our purpose would return the horizontal and vertical shifts u and v of the person from the center of m: ψ(m)=(δu,δv). In practice, to make the learning task easier, two separate regressors ψcoarse and ψfine are introduced. The first one is trained to handle large shifts and the second to refine them. These regressors are iteratively used as illustrated by the algorithm shown below that describes an object-centric motion compensation.
After each iteration, the images are shifted by the computed amount and estimate a new shift. This process typically takes only four (4) iterations, two (2) using ψcoarse and two (2) using ψfine.
Both CNNs feature the same architecture, which includes fully connected, convolutional, and pooling layers, as schematically depicted by
Training the CNNs requires a set of image windows centered on a subject, shifted versions, such as the one depicted by
Using the CNNs requires an initial estimate of the bounding box for every person, which is given by DPM. However, applying the detector to every frame of the video is time consuming. Thus, the DPM is only applied to the first frame.
The position of the detection is then refined and the resulting bounding box is used as an initial estimate in the second frame. Similarly, its position is then corrected and the procedure is iterated in subsequent frames. The initial person detector provides rough location estimates and our motion compensation algorithm naturally compensates even for relatively large positional inaccuracies using the regressor, ψcoarse. Some examples of our motion compensation algorithm, an analysis of its efficiency as compared to optical-flow.
Regarding the pose regression, a 3D pose estimation is casted in terms of finding a mapping Z→f(Z)≈Y, where Z is the 3D HOG descriptor computed over a spatiotemporal volume and Y is the 3D pose in its central frame. To learn f, Kernel Ridge Regression (KRR) [14] and Kernel Dependency Estimation (KDE) [8] have been considered, as they were used in previous works on this task [16, 17], as well as Deep Networks (DN).
The KRR trains a model for each dimension of the pose vector separately. To find the mapping from spatiotemporal features to 3D poses, it solves a regularized least-squares problem of the following form:
where (Zi, Y1) are training pairs and ΦZ is the Fourier approximation to the exponential-χ2 kernel as in Ionescu. This problem can be solved in closed-form by W=(ΦZ(Z)TΦZ(Z)+I)−1ΦZ(Z)TY.
The KDE is a structured regressor that accounts for correlations in 3D pose space. To learn the regressor, not only the input as in the case of KRR, but also the output vectors are lifted into high-dimensional Hilbert spaces using kernel mappings ΦZ and ΦY, respectively [8, 17]. The dependency between high dimensional input and output spaces is modeled as a linear function. The corresponding matrix W is computed by standard kernel ridge regression:
To produce the final prediction Y, the difference between the predictions and the mapping of the output in the high dimensional Hilbert space is minimized by finding the following:
Although the problem is non-linear and non-convex, it can nevertheless be accurately solved given the KRR predictors for individual outputs to initialize the process. In practice, an input kernel is used embedding based on 15,000-dimensional random feature maps corresponding to an exponential-χ2 kernel, a 4000-dimensional output embedding corresponding to radial basis function kernel as shown in [24].
The DN rely on a multilayered architecture to estimate the mapping to 3D poses. Three (3) fully-connected layers are used with the rectified linear unit (ReLU) activation function in the first two (2) layers and a linear activation function in the last layer. The first two layers is made of 3000 neurons each and the final layer has fifty-one (51) outputs, corresponding to seventeen (17) 3D joint positions. Cross-validations were performed across the network's hyperparameters and the ones with the best performance on a validation set were chosen. The squared difference were minimized between the prediction and the ground-truth 3D positions to find the mapping f parametrized by Θ:
The ADAM [20] gradient update method was used to steer the optimization problem with a learning rate of 0.001 and dropout regularization to prevent overfitting. In the results section it is shown that the proposed DN-based regressor outperforms KRR and KDE [16, 17].
Next, the experimental results of the present method, device, and system were evaluated on the Human3.6m of Ionescu, HumanEva-I/II [36], and KTH Multiview Football II [6] datasets. Human3.6m is a recently released large-scale motion capture dataset that comprises 3.6 million images and corresponding 3D poses within complex motion scenarios. Eleven (11) subjects perform fifteen (15) different actions under four (4) different viewpoints. In Human3.6m, different people appear in the training and test data. Furthermore, the data exhibits large variations in terms of body shapes, clothing, poses and viewing angles within and across training/test splits [17]. The HumanEva-I/II datasets provide synchronized images and motion capture data and are standard benchmarks for 3D human pose estimation. Results on the KTH Multiview Football II dataset are further provided to demonstrate the performance of the present method, device, and system in a non-studio environment. In this dataset, the cameraman follows the players as they move around the pitch. Results of the present method are compared against several background art algorithms in these datasets. The datasets were chosen to be representative of different approaches to 3D human pose estimation, as discussed above. For those which there was not access to the code, the published performance numbers were used, and the present method was used on the corresponding data.
Regarding the evaluation on the Human3.6m dataset, to quantitatively evaluate the performance of the present method, device, and system, first the recently released Human3.6m [17] dataset was used. On this dataset, the regression-based method of [17] performed best at the time and therefore this method was used as a baseline. That method relies on a Fourier approximation of 2D HOG features using the χ2 comparison metric, and it is herein referred as “eχ
Li reported results on subjects S9 and S11 of Human3.6m and those of Ionescu made their code available. To compare our results to both of those baselines, we therefore trained our regressors and those of Ionescu for fifteen (15) different actions. In the present method, five (5) subjects (S1, S5, S6, S7, S8) were used for training purposes and two (2) (S9 and S11) for testing. Training and testing is carried out in all camera views for each separate action, as described in Ionescu. Recall from the discussion above that 3D body poses are represented by skeletons with seventeen (17) joints. Their 3D locations are expressed relative to that of a root node in the coordinate system of the camera that captured the images.
Table 1 summarizes our results on Human3.6m and
Overall, our method significantly outperforms Ionescu's eχ
In the following, the importance of motion compensation and of the influence of the temporal window size on pose estimation accuracy is analyzed. To highlight the importance of motion compensation, the features were recomputed without the motion compensation. We will refer to this method as STV. Also, a recent optical flow (OF) algorithm was tested for motion compensation [28].
Table 2 shows the results for two actions, which are representative in the sense that the Walking Dog one involves a lot of movement while subjects performing the Greeting action tend not to walk much. Even without the motion compensation, regression on the features extracted from spatiotemporal volumes yields better accuracy than the method of Ionescu. Motion compensation significantly improves pose estimation performance as compared to STVs. Furthermore, our CNN-based approach to motion compensation (RSTV) yields higher accuracy than optical-flow based motion compensation [28]. Table 2 therefore demonstrates the importance of motion compensation. The results of Ionescu are compared against the results of the present method, device, and system, without motion compensation and with motion compensation using either optical flow (OF) of [28] or the present method, device, and system.
Table 3 shows the influence of the size of the temporal window. In this table, the results of Ionescu against those obtained using the present method are compared, RSTV+DN, with increasing temporal window sizes. In these experiments, the effect of changing the size of our temporal windows from twelve (12) to forty-eight (48) frames is reported, again for two representative actions. Using temporal information clearly helps and the best results are obtained in the range of twenty four (24) to forty-eight (48) frames, which corresponds to 0.5 to 1 second at 50 fps. When the temporal window is small, the amount of information encoded in the features is not sufficient for accurate estimates. By contrast, with too large windows, overfitting can be a problem as it becomes harder to account for variation in the input data. Note that a temporal window size of twelve (12) frames already yields better results than the method of Ionescu. The experiments carried out on Human3.6m, twenty-four (24) frames were used as it yields both accurate reconstructions and efficient feature extraction.
Next, the present method was further evaluated on HumanEva-I and HumanEva-II datasets. The baselines that were considered are frame-based methods of [4, 9, 15, 22, 39, 38, 44], frame-to-frame-tracking approaches which impose dynamical priors on the motion [37, 41] and the tracking-by-detection framework of [2]. The mean Euclidean distance between the ground-truth and predicted joint positions is used to evaluate pose estimation performance. As the size of the training set in HumanEva is too small to train a deep network, RSTV+KDE was used, instead of RSTV+DN.
With the results shown in Tables 4 and 5 that using temporal information earlier in the inference process in a discriminative bottom-up fashion yields more accurate results than the above-mentioned approaches that enforce top-down temporal priors on the motion. Table 4 shows 3D joint position errors, in the example shown in mm, on the Walking and Boxing sequences of HumanEva-I. The results of the present method were compared against methods that rely on discriminative regression [4, 22], 2D pose detectors [38, 39, 44], 3D pictorial structures [3], CNN-based markerless motion capture method of [9] and methods that rely on top-down temporal priors [37, 41]. ‘-’ indicates that the results are not reported for the corresponding sequences.
For the experiments that were carried out on HumanEva-I, the regressor was trained on training sequences of Subject 1, 2 and 3 and evaluate on the “validation” sequences in the same manner as the baselines we compare against [3, 4, 9, 22, 37, 38, 39, 41, 44]. Spatiotemporal features are computed only from the first camera view. In Table 4, the performances of the present method, device, and system were reported on cyclic and acyclic motions, more precisely Walking and Boxing, and example 3D pose estimation results were depicted in
On HumanEva-II, the present method, device, and system was compared against [2, 15] as they report the best monocular pose estimation results on this dataset. HumanEva-II provides only a test dataset and no training data, therefore, the regressors were trained on HumanEva-I using videos captured from different camera views. This demonstrates the generalization ability of the present method, device, and system to different camera views. Following [2], subjects S1, S2 and S3 from HumanEva-I were used for training and report pose estimation results in the first 350 frames of the sequence featuring subject S2. Global 3D joint positions in HumanEva-I are projected to camera coordinates for each view. Spatiotemporal features extracted from each camera view are mapped to 3D joint positions in its respective camera coordinate system, as done in [29]. Whereas [2] uses additional training data from the “People” [30] and “Buffy” [11] datasets, only the training data from HumanEva-I was used. We evaluated the method by using the official online evaluation tool. Table 5 shows 3D joint position errors (in mm) on the Combo sequence of the HumanEva-II dataset. The results of the present method were compared against the tracking-by-detection framework of [2] and recognition-based method of [15]. ‘−’ indicates that the result is not reported for the corresponding sequence. As shown in the comparison of Table 5, the present method, device and system, achieves or exceeds the performance of the background art.
Moreover, the KTH Multiview Football Dataset has been evaluated with the present method, device, and system. As shown in [3, 6], the method was tested on the sequence containing Player 2. The first half of the sequence is used for training and the second half for testing, as in the original work [6]. To compare the results of the present method to those of [3, 6], pose estimation accuracy in terms of percentage of correctly estimated parts (PCP) score are reported. As in the HumanEva experiments, the results are provided for RSTV+KDE.
Accordingly, in the present application, it has been demonstrated that taking into account motion information very early in the modeling process yields significant performance improvements over doing it a posteriori by linking pose estimates in individual frames. It has been shown that extracting appearance and motion cues from rectified spatiotemporal volumes disambiguate challenging poses with mirroring and self-occlusion, which brings about substantial increase in accuracy over the background art methods on several 3D human pose estimation benchmarks. The proposed method is generic to different types of motions, and could be used for other kinds of articulated motions.
While the invention has been disclosed with reference to certain preferred embodiments, numerous modifications, alterations, and changes to the described embodiments, and equivalents thereof, are possible without departing from the sphere and scope of the invention. Accordingly, it is intended that the invention not be limited to the described embodiments, and be given the broadest reasonable interpretation in accordance with the language of the appended claims.
The present application claims priority to the United States provisional patent application with the Ser. No. 62/329,211 that was filed on Apr. 29, 2016, the entire contents thereof herewith incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62329211 | Apr 2016 | US | |
62329211 | Apr 2016 | US |