The present invention relates to the field of human behavior understanding, in particular to an interactive behavior understanding method for posture reconstruction based on features of skeleton and image.
In the existing technology, the commonly used methods for human behavior understanding comprise behavior understanding algorithm based on human body posture estimation and target detection algorithm based on image information, wherein, the advantage of human body posture classification algorithm relies on human skeleton key points is that the human skeleton key points information removes the redundant noise information in the image and guarantees the pure behavior information, but completely discarding the image information will cause the loss of effective information. The target detection algorithm relies on images that can obtain enough image features and human body features, but there is a lot of noise interference information, which is not conducive to behavior understanding.
The model can quickly and accurately extract the complete human skeleton information through the lightweight improvement of the Open Pose algorithm, occlusion prediction, and three-dimensional human body posture estimation algorithm. However, algorithms that rely solely on human skeleton information do not perform well on interactive behavior. It is easy to misjudge some ‘human-object’ interaction behaviors, such as playing badminton and tennis, reading with both hands, and holding a water cup with both hands. Meanwhile, the performance for some ‘human-human’ interaction behaviors, such as stealing, fighting, and hugging is still not good when simply using skeleton data to distinguish. The reason is that the simple skeleton data completely abandons the image features, that is, the environmental perception ability of the model is not considered.
In order to comprehensively utilize the advantages of skeleton features and image features, and enhance the model's environmental perception ability and interactive behavior understanding, it is necessary to propose an interactive behavior understanding method for posture reconstruction based on features of skeleton and image to further improve the accuracy of the model, which can quickly and accurately extract effective image features.
The objective of the present invention is to provide an interactive behavior understanding method for posture reconstruction based on features of skeleton and image, it not only retains the purity of skeleton features for human behavior information extraction, but also uses image features to retain effective image information such as environment, so as to further complement the model features information, and the skeleton features are extracted by the graph convolution network, which increases the relevance of the input skeleton point information and obtains the accurate skeleton features, the effective image features can be extracted quickly and accurately through the Vision Transformer network combined with the multi-head attention mechanism.
In order to achieve the above objective, the present invention provides an interactive behavior understanding method for posture reconstruction based on features of skeleton and image, the specific steps are as follows:
Preferably, in step S1, the construction and preprocessing of the data set comprise:
S11, construction of the data set: extraction of skeleton features, firstly, extracting a two-dimensional skeleton information of the human via improved OpenPose algorithm, and then generating a complete three-dimensional human skeleton data as the skeleton data via an occlusion prediction network and a three-dimensional human body posture estimation.
Preferably, in step S11 construction of the data set, the steps of a three-dimensional human body posture estimation algorithm in the case of occlusion are as follows:
Preferably, in step S2, the steps of the extraction of skeleton features are as follows:
S21, skeleton features weight network: for the three-dimensional posture data input in step S1, performing a basic initialization weight distribution, and setting an attention weight by normalizing an activation function, the specific formula is as follows:
w
ij
=v*α
ij
S22, graph convolution network: a convolution layer operation is obtained via a convolution operation of a signal x and a signal g, where the signal x denotes an input graph information, and the signal g denotes a convolution kernel, the convolution operation of the two is obtained via Fourier transform, where an F function denotes the Fourier transform, which is used to map the signal to the Fourier domain, as shown below:
Preferably, in step S3 image features extraction, each encoder is composed of two sub-modules: a multi-head attention module and a feedforward neural network module, as shown below:
Preferably, in step S4 fusion and reconstruction of features, the Wide module consists of a linear module y=wTx+b, where x denotes an input feature vector in the form of x=[x1, x2 . . . , xn], w=[w1, w2, . . . , wn] is a model training parameter, and b denotes a model bias term; the input fusion features comprise original input feature vectors and transformed feature vectors, where the transformed features are obtained by cross product transformation, as shown below, where cki denotes a Boolean variable, that is, if the i-th is a part of the k-th transformation φk, then it is 1, otherwise it is 0:
Preferably, in step S5 experimental evaluation and validation, a model training environment is ed in the Windows10 environment, using CUDA 10.1 to establish the GPU environment for training, and Python 3.6.5 as a compiler.
Therefore, the present invention adopts the above-mentioned interactive behavior understanding method for posture reconstruction based on features of skeleton and image, it not only retains the purity of skeleton features for human behavior information extraction, but also uses image features to retain effective image information such as environment, so as to further complement the model feature information, and the skeleton features are extracted by the graph convolution network, which increases the relevance of the input skeleton point information and obtains the accurate skeleton features, the effective image features can be extracted quickly and accurately through the Vision Transformer network combined with the multi-head attention mechanism.
Further detailed descriptions of the technical scheme of the present invention can be found in the accompanying drawings and embodiments.
The technical scheme of the present invention is further explained below by drawings and embodiments.
Wherein, in step S11 construction of the data set, the steps of a three-dimensional human body posture estimation algorithm in the case of occlusion are as follows:
In order to make the occlusion prediction have good universal applicability and adapt to different individuals and multiple target behaviors. The present invention chooses to use the image data in the COCO human body posture data set, and divides it into multiple actions to extract the key points of human skeletons via the improved OpenPose algorithm, and saves the complete key point data of human skeletons as a training data set. As shown in
The Human3.6M data set is by far the largest public data set for three-dimensional human body posture estimation. The data set collection target is seventeen actions performed by eleven professional actors, such as walking, calling, and participating in discussion, etc., for a total of 3.6 million samples. The data acquisition device uses 4 video cameras and 10 motion cameras, and the shooting area is 12 square meters. Wherein four cameras are shot from different angles as video data from different perspectives, and coordinate data of the key points of the three-dimensional human skeleton are collected by a motion capture device. Part of the video data in Human3.6M is shown in
In order to ensure the consistency between the data of the Human3.6M data set and the OpenPose algorithm structure, it is necessary to preprocess the data and align the positional relationship of different skeleton points. The skeleton point correspondence between the two is shown in the following table.
After obtaining the two-dimensional skeleton data, a nonlinear model is established to learn the mapping relationship between two-dimensional data and three-dimensional data. The input of the nonlinear network is designed as two-dimensional human body posture data X∈2n, the network output form Y∈
3n, and a learning function expression of the nonlinear network is G*:X∈
2n→Y∈
3n, the purpose of minimizing the mean square error between the network predicted result and the real result is achieved by optimizing the model parameters, the specific meaning is as follows, where ξ denotes its loss function, and here is the mean square error loss function:
The transformation relationship between the world coordinate system and the camera coordinate system is shown in
R=R
1
R
2
R
3
after obtaining the transformed coordinates, the data is normalized, and the data set is divided into a training set and a test set, wherein, the data collected by the experimenter numbered (1, 5, 6, 7, 8) is the training set, and the experimenter (9,11) data is set as the test set, and the mean square error between the predicted value and the real value is used as an evaluation criterion of the model. The steps of normalization calculation are as follows, where p and a are the mean and standard deviation of the sample respectively, x denotes an original data, and x′ denotes a normalized data;
The generator output matrix X′ and predictive result matrix are as follows:
Where ⊙ denotes the Hadamard product, multiplied by element by element.
A prompt tensor H is introduced to determine the accurate mask value, that is, when it is 0.5, it means that the accurate value of M cannot be obtained from H, while when the value is 0 or 1, it means that the accurate value can be obtained, and E is an existential quantifier. Here the value V(D, G) is defined as follows:
Due to the lack of human skeleton data caused by occlusion, relying solely on joint position information can easily lead to the loss of effective features, that is, the loss of joint connection information and the loss of skeleton structure. The efficient use of features by the model is further improved by integrating the structural features of joints. Here, the position feature of the defined posture is denoted by the extracted skeleton position coordinate and an indicator scalar, when it is 0, it means that the position is missing, and when it is not 0 means that the position is not missing. The structural features of the joint are denoted by an association matrix, and the value of the element is composed of 0 and 1, 1 denotes that the joints of the row and column where the element is located are interconnected, and 0 denotes that the joints of the row and column where the element is located are not connected.
The basic idea of generative antagonistic networks lies in the dynamic game process, and the final equilibrium point is the Nash equilibrium point. The training of the network is realized by fixing different trainers at different stages, meanwhile, the discriminator network needs to be trained first to avoid problems such as mode collapse. Wherein, when training the discriminator, it is necessary to first fix the generator, by introducing the missing data predicted by the generator and the original real data into the discriminator, the error is calculated and back-propagation is performed to update the discriminator parameters; when training the generator, the discriminator network needs to be fixed, and the predicted value output by the generator is input into the discriminator as a negative sample, the parameters of the generator are updated by back propagation according to the error of the discriminator. The specific network structure flow diagram is shown in
The present invention realizes the three-dimensional mapping learning of two-dimensional human body posture data by designing a nonlinear model, so that the model can obtain sufficient spatial information and solve the problem that the key point information of human skeleton output from different perspectives is not uniform.
When learning a new sample, the OWM module modifies the weight value in the orthogonal direction of the feature solution space on the old task in order to retain the features learned before, so that the weight increment does not interact with the past task, so as to ensure that the solution sought in the new sample still exists in the previous solution space. Here, it is assumed that a previously trained input vector matrix set is A, a matrix I denotes a unit matrix, and a is a parameter, then the direction orthogonal to the input space needs to be found as shown below:
ΔW=λPΔW′
As shown in
The specific model parameters for the setup are shown in Table 3.
As shown in Table 4, it is the error comparison table between the predicted value and the real value of the occlusion prediction comparison experiment on different actions. It can be found that the algorithm in this paper performs best in the prediction of missing human skeleton key points under occlusion, with an average error of only 0.0657, and performs better in the evaluation of simple actions such as standing and walking.
As shown in
As shown in
As shown in
The environment of the three-dimensional posture estimation experiment of the present invention is shown in Table 5, and the accelerated training is realized by GPU.
The experiment uses Adam as the optimizer, the training times of all data sets are 1000 rounds, and the initial learning rate is set to 0.001 and decays exponentially with the number of training times. BatchSize is set to 64, and the neural network is initialized by Kaiming to ensure the stability of gradient echo during training and improve the training speed of the model. The model training parameters are shown in Table 6:
In order to verify the effect of the model, the distance error between the three-dimensional human skeleton key point data predicted by different algorithms and the original three-dimensional human skeleton key point data is calculated in millimeters. Validate on different actions such as Direct, Discuss, and Eating, and the resulting experiments are shown in Table 7:
As shown in
Aiming at the problem of missing human skeleton point data under occlusion and the problem of missing three-dimensional spatial information of two-dimensional skeleton data in the human body posture estimation algorithm, the occlusion prediction network and three-dimensional human body posture estimation model are established respectively. Wherein, the generative antagonistic interpolation network comprehensively uses the skeleton point tensor and the human body correlation tensor to predict the missing data of the human under occlusion, and compared with the interpolation algorithms such as MissForset, the effectiveness of the proposed algorithm for occlusion missing data is verified, and the error of the prediction performance is reduced by 54.1% on average compared with the experimental optimal algorithm. In addition, two-dimensional to three-dimensional human body posture estimation is realized by constructing a nonlinear network. Meanwhile, in order to improve the generalization ability of the model and enhance the continuous learning ability of the model, the OWM module is introduced into the network, and the experimental verification is carried out on the Human3.6M data set, compared with the algorithm such as the maximum marginal neural network, the distance error between the predicted value and the real value is used as the evaluation index, the error of the optimal algorithm is reduced by 13.8% on average in the experimental performance, which verifies the effectiveness of the improvement measures.
As shown in
w
ij
=v*α
ij
As shown in
The image features extraction of the present invention obtains the image features tensor through the Vision Transformer architecture, which is composed of an encoder and a decoder, each encoder and decoder is composed of a multi-head attention (MSA) and a fully connected network, and is connected by residuals between each attention layer and the neural network layer. Firstly, the segmented rectangular region of the human body is input into the Vision Transformer as a structural block, and then the block is converted into a feature vector with dimension D by linear transformation and combined with its position coding vector. Then the input image is divided into different image blocks, constructed into an image sequence z0, and input into the encoder. Here, each encoder is composed of two sub-modules: a multi-head attention module and a feedforward neural network module, wherein, a LN (LayerNorm) normalization layer is added in front of each neural network module, and a Gelu layer is added in the middle layer, as shown below:
For the input image sequence, each element is multiplied by a key vector K, value vector V and query vector Q that generated during the training process, and then the dot product of the current element Q value and other element K value is calculated as the score value, and normalized to ensure the stability of the gradient echo, finally, the multi-head attention feature weight is obtained by SoftMax.
As shown in
After the skeleton features and image features of the same dimension are obtained, the two features are fused and input into the classification network. The present invention uses a Wide&Deep neural network for the reconstruction and fusion of features, and finally, the probability of behavior category is obtained through the SoftMax classifier. The network structure establishes a linear module and a nonlinear module respectively, wherein the linear module is mainly used to fit the direct relationship between input and output, so that the model has good memory ability. The nonlinear module retains the excellent fitting ability in the original neural network, which further improves the generalization ability of the model and directly achieves a certain balance between nonlinear features and linear features. As shown in
The Wide module consists of a linear module y=wTx+b, where x denotes an input feature vector in the form of x=[x1, x2 . . . , xn], w=[w1, w2, . . . , wn] is a model training parameter, and b denotes a model bias term; the input fusion features comprise original input feature vectors and transformed feature vectors, where the transformed features are obtained by cross product transformation, as shown below, where cki denotes a Boolean variable, that is, if the i-th is a part of the k-th transformation φk, then it is 1, otherwise it is 0:
As shown in
The specific model parameters for the setup as shown in Table 9:
The experiment of the present invention evaluates the performance of the model through the ACC (Accuracy) index. The model speed is evaluated by the FPS value of the number of pictures that the model can recognize per second in the model inference stage. Wherein, the skeleton classification comparative experiment data set is composed of pure skeleton data, the corresponding category labels are labeled for each group of skeleton data, and then the LSTM, Transformer and DNN algorithms are used for experimental evaluation. In the image target detection part, LabelMe is used to calibrate different behaviors in the image data to form a Json file containing image region and label information, and then YOLOv5 and other target detections are used for experimental evaluation. Data set evaluation is divided into individual behavior evaluation and interactive behavior evaluation, wherein, the individual behavior comprises daily behaviors such as walking and standing. ‘human-object’ interactive behaviors comprise playing tennis and badminton. ‘human-human’ interactive behaviors comprise fighting and hugging.
As shown in Table 10, it is the experimental performance of the behavior understanding algorithm in the local data set.
As shown in Table 11, is the comparison of the experimental effects of the behavior understanding algorithm applied to the public data sets from different perspectives.
From the analysis of the experimental results, it can be seen that the behavior understanding algorithm that simply relies on skeleton information has higher speed, and has higher recognition accuracy in individual behavior understanding, but in interactive behavior understanding, the algorithm performs poorly. The reason is that it ignores the original image information, that is, for the interaction behavior, which relies on the effective image information, a single skeleton behavior understanding algorithm will lead to the loss of information extraction.
Similarly, the target detection algorithm that purely relies on image information is used for human behavior understanding, due to the complex structure of the algorithm model, the running speed of the model is slow and the real-time performance is poor. However, the accuracy of model recognition is higher than that of single skeleton behavior.
After comparison, the behavior understanding algorithm of the fusion of image features and skeleton features comprehensively utilizes the effective features of the image, which can better remove redundant noise and perform best in recognition accuracy. Meanwhile, due to the improvement of the model lightweight, the running speed of the model has also been improved to a certain extent, which has more application value.
As shown in
As shown in
As shown in
As shown in
Therefore, the present invention adopts the above-mentioned interactive behavior understanding method for posture reconstruction based on features of skeleton and image, fuses skeleton features and image features, and reconstructs features, it not only retains the purity of skeleton features for human behavior information extraction, but also uses image features to retain effective image information such as environment, so as to further complement the model feature information. Specifically, the skeleton features extracted by the graph convolution network make good use of the joint directed graph structure of the human skeleton, increase the relevance of the input skeleton point information, and obtain the accurate skeleton features. Then, the image is divided into image block sequences through the Vision Transformer network, and combined with the multi-head attention mechanism, effective image features can be extracted quickly and accurately. In the experimental part, the algorithm in this paper is compared with the simple skeleton features recognition algorithms LSTM, Transformer, DNN and image target detection behavior classification algorithms Fast R-CNN and YOLOv5, finally, the accuracy of the algorithm in this paper is improved by 7.2% and the speed is improved by 28% compared with the optimal algorithm, which verifies the efficiency and accuracy of the algorithm in this paper, indicating that the algorithm in this paper can be better applied to human behavior understanding.
Finally, it should be noted that the above examples are merely used for describing the technical solutions of the present invention, rather than limiting the same. Although the present invention has been described in detail with reference to the preferred examples, those of ordinary skill in the art should understand that the technical solutions of the present invention may still be modified or equivalently replaced. However, these modifications or substitutions should not make the modified technical solutions deviate from the spirit and scope of the technical solutions of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2023108388989 | Jul 2023 | CN | national |