Pose estimation, i.e., locating body parts in images, has been a computer vision task of increasing importance. Similarly, locating body parts in video, and locating body parts for multiple people in video, has become increasingly desired.
While techniques exist for pose estimation and multi-person pose estimation in video, existing methods are inadequate. Many existing methods combine two separate models. These existing methods do pose estimation on each frame, track the estimated results, and, then, after performing the pose estimation, correct the results using temporal information contained in videos. This makes existing methods computationally complicated and limits the running speed of the existing methods.
Embodiments provide a novel deep learning model particularly designed and optimized for video pose estimation, which inherently takes a pose estimation result of previous frames as input to refine a new pose estimation of a current frame. Embodiments track estimated poses and make a model, i.e., a trained neural network, insensitive to occlusions. Moreover, embodiments of present invention apply a backward reconstruction loop and temporal consistency to an objective function to alleviate inconsistent estimation between adjacent frames. This significantly mitigates shaking and vibration phenomena of estimated pose skeletons in video pose estimation.
An example embodiment is directed to a method of identifying joints and limbs in a current frame of video. Such an example embodiment, first, processes the current frame of video to determine initial predictions of joint and limb locations in the current frame. In turn, indications of the joint and limb locations in the current frame are generated by refining the initial predictions of the joint and limb locations based on indications of respective joint and limb locations from a previous frame.
Another embodiment generates an indication of pose for at least one object based upon the indications of the joint and limb locations in the current frame. Embodiments may be used to identify limbs and joints of any type of object. For example, in an embodiment, the indications of the joint and limb locations in the current frame correspond to joints and limbs of at least one of: a human, animal, machine, and robot, amongst other examples. According to an embodiment, the indication of joint locations in the current frame indicates a probability of a joint at each location in the current frame and the indication of limb locations in the current frame indicates a probability of a limb at each location in the current frame. In an example embodiment, the previous frame is adjacent in time to the current frame in the video.
In an embodiment, generating the indications of the joint and limb locations in the current frame comprises processing the initial prediction of joint locations in the current frame and the indications of joint locations from the previous frame with a first deep convolutional neural network to generate the indication of joint locations in the current frame. Further, in such an embodiment, an initial prediction of limb locations in the current frame and the indications of limb locations from the previous frame are processed with a second deep convolutional neural network to generate the indication of limb locations in the current frame.
Another embodiment processes the current frame of video to determine an initial prediction of limb orientation at each initial prediction of limb location in the current frame. Further, such an embodiment generates an indication of limb orientation in the current frame by refining the initial prediction of limb orientation at each initial prediction of limb location in the current frame using indications of limb orientations from the previous frame.
Another embodiment is directed to a computer system for identifying joints and limbs in a current frame of video. The computer system includes a processor and a memory with computer code instructions stored thereon. In such an embodiment, the processor and the memory, with the computer code instructions, are configured to cause the system to identify joints and limbs according to any embodiment described herein.
Yet another embodiment is directed to a computer program product for identifying joints and limbs in a current frame of video. The computer program product comprises one or more non-transitory computer-readable storage devices and program instructions stored on at least one of the one or more storage devices. The program instructions, when loaded and executed by a processor, cause an apparatus associated with the processor to identify joints and limbs in a frame of video as described herein.
An embodiment is directed to a method of training a neural network to identify joints and limbs in a current frame of video. Such a method embodiment performs forward and backward optimization between adjacent frames of video to refine joint location prediction results and limb location prediction results of a neural network. In turn, the neural network is updated based on the refined joint location prediction results and the refined limb location prediction results.
According to an embodiment, performing the forward optimization comprises calculating a loss between (i) joint location prediction results and limb location prediction results generated by the neural network for a frame of video and (ii) a ground truth indication of joint locations and limb locations in the frame of video. Further, according to an embodiment, performing the backward optimization comprises processing, with the neural network, (i) joint location prediction results generated by the neural network for a frame of video, (ii) limb location prediction results generated by the neural network for the frame of video, and (iii) a previous frame to determine an indication of joint locations and an indication of limb locations for the previous frame. Such an embodiment calculates a loss between (i) the determined indication of joint locations and the determined indication of limb locations for the previous frame and (ii) a ground truth indication of joint locations and limb locations for the previous frame.
In yet another embodiment, performing forward and backward optimization between adjacent frames of video to refine joint location prediction results and limb location prediction results of the neural network comprises calculating a temporal consistency loss by calculating a loss between (i) joint location prediction results and limb location prediction results of the neural network for a first frame and (ii) joint location prediction results and limb location prediction results of the neural network for a second frame, wherein the second frame is adjacent to the first frame.
It is noted that embodiments of the method, system, and computer program product may be configured to implement any embodiments described herein.
The foregoing will be apparent from the following more particular description of example embodiments, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments.
A description of example embodiments follows.
Pose estimation, which includes identifying joints and limbs in images and video, aims to estimate multiple poses of people or other such target objects in a frame of video and has been a long studied topic in computer vision [1, 6, 14, 9, 3] (bracketed numbers in this document refer to the enumerated list of references hereinbelow). Previous methods for human pose estimation utilized pictorial structures [1] or graphical models [3]. Recently, with the development and application of deep learning models, attempts to utilize deep convolutional neural networks to do 2D multi-person pose estimation have been made. These attempts can be categorized into two major categories, top-down methods and bottom-up methods.
Top-down methods detect persons by first using a person detector and then using single person pose estimation to get poses for all persons. He et al. [7] extended the Mask-RCNN framework to human pose estimation by predicting a one-hot mask for each body part. Papandreou et al. [11] utilized a Faster RCNN detector to predict person boxes and applied ResNet in a fully convolutional fashion to predict heatmaps for every body part. Fang et al. [5] designed a symmetric spatial transformer network to alleviate the inaccurate bounding box problem.
The existing top-down methods always utilize a separately trained person detector to first detect people in the image. With the knowledge of the detected people, i.e., bounding boxes of detected persons, top-down methods then do single-person keypoint estimation within each bounding box [7, 11, 5]. The problem with top-down methods is that if the person detection fails, the following keypoint estimation will also fail. Further, using two models, e.g., neural networks, in the top-down methods, also makes the top-down methods slower and makes utilizing top-down methods for real-time applications difficult.
Bottom-up methods do not utilize person detectors. Instead, bottom-up methods try to detect all of the body joints from the whole image and, then, associate those joints to each person to form their skeletons [12, 2, 10]. In general, bottom-up methods are less accurate compared to top-down methods. However, bottom-up methods can run faster than top-down methods in multi-person pose estimation. The inference time of bottom-up methods is less linear to the number of persons in the image.
Bottom-up methods detect body parts first and then associate body parts into persons. Insafutdinov et al. [12] proposed using an Inter Linear Program method to solve the body part association problem. Cao et al. [2] introduced Part Affinity Fields to predict the direction and activations for each limb to help associate body parts. Newell et al. [10] utilized predicted pixel-wise embeddings to assign detected body parts into different groups.
Video-based multi-person pose estimation often involves tracking methods as post processing. The post processing methods track the detected person across adjacent frames and then track the keypoints of that person to avoid detection failures caused by motion blur and occlusions. Those tracking methods cannot be applied to bottom-up methods because bottom-up methods do not provide any knowledge of a person in each frame. Tracking joints without knowing the movement of a person leads to unsatisfactory results. In video applications, bottom-up methods are applied on each frame, which leads to inconsistent pose estimation across adjacent frames. The inconsistency causes problems like shaking and jumping of keypoint detection.
Embodiments provide functionality for two-dimensional (2D) multi-person pose estimation in video. In embodiments, the pose estimation is formulated as detecting 2D keypoints, e.g., joints and limbs, and connecting the keypoints of the same person into skeletons. Embodiments provide a bottom-up method in multi-person pose estimation. Different from other methods, embodiments directly predict a confidence map for a human skeleton to associate the detected body parts.
Embodiments of the present invention provide a video-based state-of-the-art image-based bottom-up method for pose estimation that is specially optimized for video applications to solve the occluded and inconsistent detection between adjacent frames. To utilize the temporal information contained in the video and to avoid inconsistent detection across frames, embodiments use previous frames to refine the pose estimation result of the current frame. As such, embodiments track the poses across frames and use the determined results, e.g., pose, from a previous frame to refine the results for a current frame. By implementing this functionality, embodiments are resistant to pose occlusions. Moreover, embodiments build a backward path and reconstruct the previous pose estimation refined by the current estimation and minimize on the difference between the previous estimation and the reconstructed estimation. Assuming the movement between two adjacent frames is minor, an embodiment penalizes on the difference between the estimation on previous frame and the estimation on current frame to stabilize the pose estimation and alleviate any shaking and vibration on the predicted poses in the video.
Embodiments (1) utilize the pose estimation results of previous frames to refine the current frame results to track poses and handle occlusions (2) apply a backward loop to reconstruct the previous pose estimation from the current frames to minimize inconsistent detection and (3) penalize on the changes in detection between adjacent frames to avoid shaking and vibration in video pose estimation.
An embodiment of the method 220 further comprises generating an indication of pose for at least one object based upon the indications of the joint and limb locations in the current frame generated 222 for the current frame.
Embodiments of the method 220 may be used to identify limbs and joints of any type of object. For example, in an embodiment, the indications of the joint and limb locations in the current frame correspond to joints and limbs of at least one of: a human, animal, machine, and robot, amongst other examples. Moreover, embodiments may identify limbs and joints for multiple objects, e.g., people, in a frame.
According to an embodiment of the method 220, generating the indications of the joint and limb locations in the current frame 222 includes processing the initial prediction of joint locations in the current frame and the indications of joint locations from the previous frame with a first deep convolutional neural network to generate the indication of joint locations in the current frame. Further, in such an embodiment, the initial prediction of limb locations in the current frame and the indications of limb locations from the previous frame are processed with a second deep convolutional neural network to generate the indication of limb locations in the current frame.
Another embodiment of the method 220 identifies orientation of the identified limbs. Such an embodiment processes the current frame of video to determine an initial prediction of limb orientation at each initial prediction of limb location in the current frame. In turn, an indication of limb orientation in the current frame is generated by refining the initial prediction of limb orientation at each initial prediction of limb location in the current frame using indications of limb orientations from the previous frame. As such, the determination of limb orientation for a current frame is refined using the limb orientation results from a previous frame.
Hereinbelow, a problem formulation for limb and joint identification is provided and a framework for identifying joints and limbs according to an embodiment is described. Additional components of embodiments including joint prediction, limb prediction, backward reconstruction, temporal consistency, neural network training, and applying the trained neural network for video pose estimation are also further elaborated upon.
Problem Formulation
Let Fi be a frame sampled from a video sequence containing n frames {Fi}1n. Let Pi,j=(xi,j, yi,j) be the multi-person 2D pose keypoint coordinates of the jth person in the frame Fi. Given {Fi}1k frames, where 0<k≤n, which are the current frames and all the previous frames. In such an embodiment, the target is to estimate the current keypoints {Pk,j}j=1m where m is the number of persons in the current frame. Moreover, in the embodiment, a deep convolutional neural network model G takes the current frame and the previous frame as input and does pose estimation, which can be described as {Pk,j}j=1m=G(Fi, Fi-1).
An implemented in of the neural network follows an image-based 2D bottom-up pose estimation method [2] to estimate a joint heatmap Si and a limb heatmap Li and, then, associates the joint and limb heatmaps into keypoint results Pi,j using an association method denoted by M. Such an embodiment of the method can then be described by {Pkj}j=1m=M(G(Fi, Si-1, Li-1)).
Framework
In operation, G 331 takes the current frame 332 Fi as input and does an initial estimation using the submodules 333 GS0 and 334 GL0 to determine a joint heatmap 335 Si0 and limb heatmap 336 Li0, respectively. In turn, the initial estimations 335 (joints) and 336 (limbs) are refined by the submodule 339 GSR and submodule 340 GLR using the previous results 337 Si-1 and 338 Li-1. The refining by the submodules 339 and 340 produces 341 Si and 342 Li, which are the joint heatmap 341 and limb heatmap 342 for the frame 332 Fti. In the framework 330, 333 GS0, 339 GSR, 334 GL0, and 340 GLR are all deep convolutional neural networks. Further, in an embodiment, before inputting to 339 GSR and 340 GLR, 335 Si0, 336 Li0, 337 Si-1, 338 Li-1 are concatenated together in channel dimension.
Joint Heatmap Prediction
In the joint heatmap prediction, the proposed framework, e.g., the framework 330, generates a confidence map, e.g., 341, which is the probability of joints appearing at each location of the input image, e.g., the frame 332. For an input image of size H×W×3, the corresponding joint heatmap S will be of size H×W×p, where H and W are the height and width of the input image, and p is the number of joints.
To prepare a ground-truth heatmap prediction, e.g., the ground-truth predictions 441a and 442a discussed hereinbelow in relation to
where Pi,jl is the keypoints of the l-th joint of the j-th person in the i-th frame, and a is the standard deviation of the Gaussion distribution.
An embodiment employs the idea of intermediate supervision such that the joint heatmap prediction output from GS0 and GSR are compared with the ground-truth heatmap using a L2 loss function, which can be expressed as follows:
In an embodiment, when minimizing the above joint prediction loss, the submodule GS0 and GSR are trained to output the confidence map of the joint predictions for given images, i.e., frames.
Limb Prediction
For limb prediction, an embodiment predicts a vector field indicating the position and orientation of limbs in given frames. The prediction can also be seen as a confidence map with size H×W×2q, where q is the number of limbs defined. To prepare the ground-truth confidence map for limb prediction, e.g., the ground truth predictions 441b and 442b discussed hereinbelow in relation to
The limb region comprises all the points in a rectangle where their distance d from the given limb is within a threshold θ, which represents half the width of the limb. Within the limb region, an embodiment fills each location in the limb region with the normalized vector of the limb denoted as:
Similar to joint prediction, an embodiment calculates a L2 loss between the predicted limb locations Li and the ground-truth limb locations
Backward Reconstruction
Embodiments introduce a backward loop to reconstruct the joint heatmap and limb heatmap from the prediction in the current frame to increase the accuracy and robustness of inter-frame prediction. In detail, one such example embodiment inputs the current prediction and the previous frame to the neural network and predicts the joint heatmap and limb map of the previous frame. Then, such an embodiment compares the prediction with the ground-truth and calculates reconstruction losses of the joint heatmap and limb heatmap which can be expressed as follows:
Temporal Consistency
To mitigate the shaking and vibration due to inconsistent detection between adjacent frames, an embodiment penalizes on the difference between two predictions generated for adjacent frames assuming that the frame rate is fast enough which indicates that the inter-frame movement is relatively small. Such an embodiment introduces temporal consistency loss which is the L2 loss between the predictions of adjacent frames, using the following equations:
By minimizing the temporal consistency loss, such an embodiment minimizes the difference between two adjacent frames and obtains a stable prediction with minimum shaking and vibration.
To perform the forward optimization, a current frame 443 and a ground truth indication of joint location 442a and a ground truth indication of limb location 442b, for a frame prior to the frame 443 (e.g., the frame 449) are processed by the neural network 444 to determine the indication of joint location 445 and limb location 446 for the frame 443. In turn, the loss 447 between (i) the joint location prediction results 445 and limb location prediction results 446 generated by the neural network 444 for the frame of video 443 and (ii) a ground truth indication of joint locations 441a and limb locations 441b in the frame of video 443 is calculated. Further, the loss 447 may be calculated with the binary mask 448 which masks out unlabeled regions in the frame 443. According to an embodiment, in the dataset (e.g., a dataset used to train the neural network 444), not every single person has a label. As such, embodiments may output joint and limb predictions of unlabeled persons in the video. However, those predictions do not have any ground-truth label to calculate losses. Thus, embodiments may use masks 448 and 453 to mask out those unlabeled persons. The masks 448 and 453 serve to disable those unlabeled areas when calculating the losses According to an embodiment, the loss 447 is calculated as described hereinabove in relation to equations 2 and 5.
To perform the backward optimization in the framework 440, the neural network 444 processes (i) joint location prediction results 445 generated by the neural network 444 for the frame of video 443, (ii) limb location prediction results 446 generated by the neural network 444 for the frame of video 443, and (iii) a previous frame 449, to determine an indication of joint locations 450 and an indication of limb locations 451 for the previous frame 449. Then, the loss 452 is calculated. The loss 452 is the loss between (i) the determined indication of joint locations 450 and the determined indication of limb locations 451 for the previous frame 449 and (ii) a ground truth indication of joint locations 442a and limb locations 442b for the previous frame 449. Further, the loss 452 may be calculated with the binary mask 453 masking out unlabeled regions in the frame 449. According to an embodiment, the loss 452 is calculated as described hereinabove in relation to equations 6 and 7.
The framework 440 is also used to calculate the temporal consistency loss 454. The temporal consistency loss 454 is the loss between (i) the joint location prediction results 445 and limb location prediction results 446 of the neural network 444 for a first frame 443 and (ii) joint location prediction results 450 and limb location prediction results 451 of the neural network 444 for a second frame 449, wherein the second frame 449 is adjacent to the first frame 443. In an embodiment, the temporal consistency loss 454 is calculated as described hereinabove in relation to equations 8 and 9.
In an embodiment, the losses 447, 454, and 452 are used in the framework 440 to update and train the neural network 444. These losses 447, 454, and 452 may be implemented in Equation 10 as described hereinbelow. The losses 447, 454, and 452 are indications of errors in estimating the joint and limb locations. By minimizing the losses 447, 454, and 452, via the optimization process during training the neural network 444, the network 444 is trained to be more accurate on estimating the location of human body joints and limbs. The optimization process is done by mathematically updating the neural network 444 by descending the gradient of the overall objective. More detail of training the network can be found below.
Overall Objectives
In an embodiment, there is an overall objective function of current prediction loss, reconstruction loss, and temporal consistency loss, to optimize the proposed video 2D pose estimation neural network, which is denoted as
where λrec and λtemp are hyper-parameters which control the relative weights of the reconstruction loss and temporal consistency loss in the overall objective function. In an example implemented, λrec=0.1 and λtemp=0.05.
Below is a method for training the neural network for video 2D multi-person pose estimation with multi-frame refinement:
Initialize network parameters θG
While θG has not converged dO
Sample a pair of adjacent frames and keypoints {(Fi, Pi), (Fi-1, Pi-1)} from data distribution data(F, P);
Prepare ground-truth joints heatmaps
Predict initial joints and limbs for both frames:
Refine current frame results using previous frame ground-truth:
Refine previous frame results using current frame ground-truth:
Reconstruct previous frame results using current frame prediction:
Calculate loss functions:
Update G by descending its gradient:
Output: Converged model parameter θG.
Training Method Embodiment
In the training phase, first, the data used for training the model is prepared. Then, a pair of adjacent frames with their ground-truth keypoints (Fi, Pi) and (Fi-1, Pi-1) are randomly sampled from the data distribution data(F, P). Fi is of size (H×W×3), where H and W are the height of width of the frames. Pi is of size mi×p×2, where mi is the number of people in the frame and p is the number of joints. For each type of joint, a Gaussian response is put in the joint heatmap Si for each person in Pi. In turn, Si with size H×W×p is obtained. The limbs are defined as the region between joints of a width within a threshold E. For each limb region, such an embodiment fills each location with the limb direction denoted by a 2D normalized vector. Then, a limbs map of size H×W xx 2q is formed. Si and Li are downsampled to size H/4×W/4×p and H/4×W/4×2q using nearest neighbor. After preparing the input frames and ground truth joints heatmap and ground truth limbs heatmap, the variables are fed to the framework and the overall objectives are calculated. The network G is continuously updated by descending the gradient of the using new pairs of data sampled from data(F, P).
Network Architecture
Table 1 below shows an example network architecture of a proposed pose estimation neural network that may be used in an embodiment.
The example deep convolutional neural network comprises a backbone, and four submodules as shown in Table 1. In Table 1, N=Number of filters, K=Kernel size, S=Stride, P=Padding, RELU=Rectified Linear Units, MAXPOOL2d=Max pooling operation for spatial data, p=Number of joints, and q=Number of limbs.
The backbone is a VGG [13] style neural network used to extract pretrained features from a given frame. In an embodiment, the backbone is pretrained on ImageNet dataset [4] and fine-tuned in a pose estimation application. In the backbone, the input frame is downsampled twice with the MAXPOOL2d layer which reduces the height and width by 4 times when outputting the joints heatmap and limb heatmap. The backbone network is followed by a initial joint prediction submodule GS0 and a initial limb prediction module GL0, which take the output of the backbone as their inputs and predict their results. After that, the prediction results are refined by the two refinement submodules GSR and GLR, which utilize multi-frame refinement to improve the accuracy and consistency of the prediction results. Embodiments provide a neural network that is lightweight and runs quickly on devices, such as GPU enabled devices. To further speed up operation, in an embodiment, the convolutional layers can be replaced by a pair of equivalent depthwise convolution layers and pointwise convolution layers such as the architecture proposed in MobileNet[8].
In operation G 551 takes the current frame 556 as input and does an initial estimation using the submodules 552 and 553 to determine a joint heatmap 557 and a limb heatmap 558. In turn, the initial estimations 557 and 558 are refined by the submodule 554 and submodule 555 using the initial estimations themselves, 557 and 558. The refining by the submodules 554 and 555 produces 559 and 560, which are the estimation of the joint heatmap and limb heatmap of the frame 556. In this way, the system 550 implements self-refinement. In the framework 550, the submodules 552, 553, 554, and 555 are all deep convolutional neural networks.
The system 550 continues, and using the pose association module 561, constructs the one or more skeletons 562 in the frame 556 using both the joint prediction 559 and limb prediction 560. An embodiment may use pose association methods known in the art to assemble joints and limbs into skeletons.
In operation G 661 takes the current frame 666 as input and does an initial estimation using the submodule 662 and submodule 663 to determine a joint heatmap 667 and a limb heatmap 668. In turn, the initial estimations 667 and 668 are refined by the submodule 664 and submodule 665 using the joint estimation 673, i.e., heatmap, and limb estimation 674 from a previous frame of video. The refining by the submodules 664 and 665 produces the estimation of the joints heatmap 669 and the estimation of the limbs heatmap 670 for the frame 666. In this way, the system 660 refines the current estimation results 667 and 668 using the results 673 and 674 from a previous frame. In an embodiment, the refinement is done by the trained network 661 which includes the submodules 662, 663, 664, and 665. This refinement can handle difficult cases in video pose estimation such as motion blur and occlusion. The refinement can also improve the shaking and vibration of estimated results. In the framework 660, the submodules 662, 663, 664, and 665 are all deep convolutional neural networks.
The system 660 continues, and using the pose association module 671, constructs the one or more skeletons 672 in the frame 666 using both the joint prediction 669 and limb prediction 670.
Embodiments provide a novel deep learning model particularly optimized for video 2D multi-person pose estimation applications. Embodiments introduce multi-frame refinement and optimization to the bottom up pose estimation method. The multi-frame refinement and optimization includes a novel method of tracking, backward reconstruction, and temporal consistency. Multi-frame refinement enables the pose estimation model to track poses and handle occlusions. Backward reconstruction and temporal consistency minimize inconsistent detection, which mitigates the shaking and vibration and improves the robustness in video pose estimation applications.
Using multi-frame refinement as described herein can be considered as an equivalent process to tracking. Tracking is a method to refine results by considering the temporal movement of objects in the video. Traditionally, approaches use the final output results of pose estimation to do tracking based on statistic assumptions. Tracking methods often stabilize the estimation results and improve the accuracy. Embodiments train the neural network to learn the movement of human bodies by feeding the neural network with previous frames. Then, the neural network can track the poses from previous frames and estimate the current poses more accurately even under occlusions. Embodiments can also enforce temporal consistency between adjacent frames to stabilize the results. As such, embodiments can provide tracking by multi-frame refinement.
Embodiments tackle a video-based multi-person pose estimation problem using a deep learning framework with multi-frame refinement and optimization. In a particular embodiment, a method inherently tracks estimated poses and makes a model insensitive to occlusions. The method may employ a backward reconstruction loop and temporal consistency to an objective function that mitigates inter-frame inconsistency and significantly reduces shaking and vibration phenomena of estimated pose skeletons in video pose estimation.
An embodiment of the invention utilizes pose estimation results of previous frames to refine a current frame result to track poses and handle occlusions. An embodiment of the invention applies a backward loop to reconstruct a previous pose estimation from a current frame to improve robustness and minimize inconsistent estimation. An embodiment of the invention introduces a temporal consistency loss that penalizes on temporal changes in detection between adjacent frames to avoid shaking and vibration in video pose estimation.
Embodiments generate a more accurate and robust pose estimation than existing methods. An embodiment tracks multi-person human poses in videos and handles occlusions. Embodiments output pose estimation with temporal consistency across frames, which avoids shaking and vibration in video pose estimation. Embodiments are computationally less expensive compared to the other pose estimation methods which require extra tracking modules.
Embodiments can be applied in detecting human behaviors in monitoring systems. Embodiments can be applied in video games to use human body movement as input, such as Xbox® Kinect®. Embodiments can be applied in many interesting mobile apps that require human body movement as input such as personal fitting and training.
Video-based multi-person pose estimation often involves tracking methods to improve estimation accuracy by utilizing temporal information in videos. The tracking methods track a detected person across adjacent frames and then track key points of that person to avoid failure detection due to motion blur and occlusions. Those tracking methods cannot be applied on bottom-up methods since bottom-up methods do not provide any knowledge of the person in each frame. Tracking the person's joints (e.g., elbows shoulders, knees) without knowing the movement of the person leads to unsatisfactory results. In video applications, pose estimation is applied frame by frame, which leads to inconsistent pose estimation across adjacent frames. The inconsistency causes problems, like shaking and jumping of key point detection.
To solve the above problems, an embodiment of the invention of video multi-person pose estimation provides a state-of-the-art image-based bottom-up method that is specially optimized for a video application to solve the inconsistent detection between adjacent frames. To utilize the temporal information contained in the video and avoid inconsistent detection across frames, a previous frame is used to refine a pose estimation result of a current frame. An embodiment tracks the person's poses across frames to handle occlusions. Another embodiment builds a backward path and reconstructs a previous pose estimation refined by a current estimation and penalizes on inconsistency between adjacent pose estimation. Moreover, assuming the movement between two adjacent frames is minor, an embodiment also penalizes based on a difference between the estimation on a previous frame and the estimation on a current frame to stabilize the pose estimation and alleviate shaking and vibration of the estimated poses in videos. With the above techniques, embodiments establish a robust and stable multi-person pose estimation which can be deployed on many applications that require human pose input.
In an embodiment, the input joints location are results from the previous frame. The neural network takes the estimation from the previous frame to help estimate the joints location of the current frame. The refined results here are referred to results of the current frame. By comparing the results with the ground-truth location an embodiment can update the network to correctly predict the joints locations of current frames.
It should be understood that the example embodiments described herein may be implemented in many different ways. In some instances, the various methods and systems described herein may each be implemented by a physical, virtual, or hybrid general purpose computer, such as the computer system 770, or a computer network environment such as the computer environment 880, described herein below in relation to
Embodiments or aspects thereof may be implemented in the form of hardware, firmware, or software. If implemented in software, the software may be stored on any non-transient computer readable medium that is configured to enable a processor to load the software or subsets of instructions thereof. The processor then executes the instructions and is configured to operate or cause an apparatus to operate in a manner as described herein.
Further, firmware, software, routines, or instructions may be described herein as performing certain actions and/or functions of the data processors. However, it should be appreciated that such descriptions contained herein are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, instructions, etc.
It should be understood that the flow diagrams, block diagrams, and network diagrams may include more or fewer elements, be arranged differently, or be represented differently. But it further should be understood that certain implementations may dictate the block and network diagrams and the number of block and network diagrams illustrating the execution of the embodiments be implemented in a particular way.
Accordingly, further embodiments may also be implemented in a variety of computer architectures, physical, virtual, cloud computers, and/or some combination thereof, and thus, the data processors described herein are intended for purposes of illustration only and not as a limitation of the embodiments.
The teachings of all patents, published applications and references cited herein are incorporated by reference in their entirety.
While example embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the embodiments encompassed by the appended claims.
This application claims the benefit of U.S. Provisional Application No. 62/848,358, filed on May 15, 2019. The entire teachings of the above application are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/032595 | 5/13/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62848358 | May 2019 | US |