INTERACTIVE BEHAVIOR UNDERSTANDING METHOD FOR POSTURE RECONSTRUCTION BASED ON FEATURES OF SKELETON AND IMAGE

TECHNICAL FIELD

The present invention relates to the field of human behavior understanding, in particular to an interactive behavior understanding method for posture reconstruction based on features of skeleton and image.

BACKGROUND ART

In the existing technology, the commonly used methods for human behavior understanding comprise behavior understanding algorithm based on human body posture estimation and target detection algorithm based on image information, wherein, the advantage of human body posture classification algorithm relies on human skeleton key points is that the human skeleton key points information removes the redundant noise information in the image and guarantees the pure behavior information, but completely discarding the image information will cause the loss of effective information. The target detection algorithm relies on images that can obtain enough image features and human body features, but there is a lot of noise interference information, which is not conducive to behavior understanding.

The model can quickly and accurately extract the complete human skeleton information through the lightweight improvement of the Open Pose algorithm, occlusion prediction, and three-dimensional human body posture estimation algorithm. However, algorithms that rely solely on human skeleton information do not perform well on interactive behavior. It is easy to misjudge some ‘human-object’ interaction behaviors, such as playing badminton and tennis, reading with both hands, and holding a water cup with both hands. Meanwhile, the performance for some ‘human-human’ interaction behaviors, such as stealing, fighting, and hugging is still not good when simply using skeleton data to distinguish. The reason is that the simple skeleton data completely abandons the image features, that is, the environmental perception ability of the model is not considered.

In order to comprehensively utilize the advantages of skeleton features and image features, and enhance the model's environmental perception ability and interactive behavior understanding, it is necessary to propose an interactive behavior understanding method for posture reconstruction based on features of skeleton and image to further improve the accuracy of the model, which can quickly and accurately extract effective image features.

SUMMARY

The objective of the present invention is to provide an interactive behavior understanding method for posture reconstruction based on features of skeleton and image, it not only retains the purity of skeleton features for human behavior information extraction, but also uses image features to retain effective image information such as environment, so as to further complement the model features information, and the skeleton features are extracted by the graph convolution network, which increases the relevance of the input skeleton point information and obtains the accurate skeleton features, the effective image features can be extracted quickly and accurately through the Vision Transformer network combined with the multi-head attention mechanism.

In order to achieve the above objective, the present invention provides an interactive behavior understanding method for posture reconstruction based on features of skeleton and image, the specific steps are as follows:

- S1, construction and preprocessing of a data set;
- S2, extraction of skeleton features: firstly, introducing a Bahdanau attention neural network to obtain skeleton data of human body posture with different weights; then establishing a directed graph model of human body posture via graph convolution neural network to extract accurate skeleton features;
- S3, extraction of image features: firstly, while acquiring three-dimensional skeleton data, reserving two-dimensional skeleton data to acquire human regions in images and extracting effective image features; then, introducing the skeleton expansion coefficient k as a trainable parameter, and training the trainable parameter via a neural network;
- S4, fusion and reconstruction of features: after acquiring skeleton features and image features of a same dimension, fusing and inputting the two features together into a classification network;
- S5, experimental evaluation and validation.

Preferably, in step S1, the construction and preprocessing of the data set comprise:

S11, construction of the data set: extraction of skeleton features, firstly, extracting a two-dimensional skeleton information of the human via improved OpenPose algorithm, and then generating a complete three-dimensional human skeleton data as the skeleton data via an occlusion prediction network and a three-dimensional human body posture estimation.

Preferably, in step S11 construction of the data set, the steps of a three-dimensional human body posture estimation algorithm in the case of occlusion are as follows:

- S111, preprocessing of the data set;
- S112, generative antagonistic interpolation network;
- S113, posture occlusion prediction network architecture;
- S114, three-dimensional human body posture estimation;
- S115, experimental analysis and validation.

Preferably, in step S2, the steps of the extraction of skeleton features are as follows:

S21, skeleton features weight network: for the three-dimensional posture data input in step S1, performing a basic initialization weight distribution, and setting an attention weight by normalizing an activation function, the specific formula is as follows:

$α_{i j} = \frac{\exp (score)}{Σ_{j = 1}^{n} \exp (score)};$

- where Σ_j=1ⁿα_ij=1, a value score is a correlation function between input and output, which is defined as follows:

$score = v * \tanh (r_{j} ⊙ \sum_{i = 1}^{n} x_{i});$

- where v denotes an offset vector, which is a parameter that can be trained in the model, x_idenotes an input matrix vector, r_jis a feature probability. The feature weights of different skeleton points are shown below:

w
_ij
=v*α
_ij

S22, graph convolution network: a convolution layer operation is obtained via a convolution operation of a signal x and a signal g, where the signal x denotes an input graph information, and the signal g denotes a convolution kernel, the convolution operation of the two is obtained via Fourier transform, where an F function denotes the Fourier transform, which is used to map the signal to the Fourier domain, as shown below:

$x * g = F^{- 1} (F (x) ⊙ F (g)) .$

Preferably, in step S3 image features extraction, each encoder is composed of two sub-modules: a multi-head attention module and a feedforward neural network module, as shown below:

$z_{1}^{'} = MSA (L N (z_{l - 1})) + z_{l - 1}, l = 1, \dots L;$

$z_{1} = MSA (LN (z_{1}^{'})) + z_{1}^{'}, l = 1, \dots L .$

Preferably, in step S4 fusion and reconstruction of features, the Wide module consists of a linear module y=w^Tx+b, where x denotes an input feature vector in the form of x=[x₁, x₂. . . , x_n], w=[w₁, w₂, . . . , w_n] is a model training parameter, and b denotes a model bias term; the input fusion features comprise original input feature vectors and transformed feature vectors, where the transformed features are obtained by cross product transformation, as shown below, where c_kidenotes a Boolean variable, that is, if the i-th is a part of the k-th transformation φ_k, then it is 1, otherwise it is 0:

$ϕ_{k} (x) = \prod_{i = 1}^{n} x_{i}^{c_{k i}}, c_{k i} \in {0, 1};$

- the specific meaning of forward propagation is as follows, where a_(l+1)denotes an output of a l+1 layer, and σ denotes an activation function:

$a_{(l + 1)} = σ (W_{(l)} a_{(l)} + b_{(l)});$

- a loss function is used to calculate a loss, optimize model parameters, and optimize the algorithm via a small batch gradient descent, where y denotes a prediction category label, a denotes the activation function, φ(x) denotes a cross product transformation, x denotes the input feature vector, a final output probability expression of the model is as follows:

$P (y | x) = σ (W_{w i d e}^{T} [x, ϕ (x)] + W_{d e e p}^{T} a_{l} + b) .$

Preferably, in step S5 experimental evaluation and validation, a model training environment is ed in the Windows10 environment, using CUDA 10.1 to establish the GPU environment for training, and Python 3.6.5 as a compiler.

Therefore, the present invention adopts the above-mentioned interactive behavior understanding method for posture reconstruction based on features of skeleton and image, it not only retains the purity of skeleton features for human behavior information extraction, but also uses image features to retain effective image information such as environment, so as to further complement the model feature information, and the skeleton features are extracted by the graph convolution network, which increases the relevance of the input skeleton point information and obtains the accurate skeleton features, the effective image features can be extracted quickly and accurately through the Vision Transformer network combined with the multi-head attention mechanism.

Further detailed descriptions of the technical scheme of the present invention can be found in the accompanying drawings and embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a part skeleton data of the behavior understanding of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 2 is an occlusion prediction data set of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 3 is a Human3.6M partial data set of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 4 is a transformation relationship (Z-axis rotation) between a world coordinate system and a camera coordinate system;

FIG. 5 is a generative antagonistic interpolation network structure diagram of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 6 is a posture occlusion prediction network structure diagram of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 7 is a nonlinear module network structure of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 8 is an OWM module schematic diagram of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 9 is an experimental comparison of different posture missing values of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 10 is a Loss change curve of an occlusion prediction algorithm in the present invention;

FIG. 11 is an occlusion prediction effect of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 12 is a performance effect of a three-dimensional human body posture estimation of the present invention;

FIG. 13 is an NTU-RGB+D part of a skeleton data of the present invention;

FIG. 14 is a flow chart of a graph convolution architecture of the human body posture of the present invention;

FIG. 15 is an image features extraction network of an interactive behavior understanding method for posture reconstruction of the present invention;

FIG. 16 is an image fusion Wide & Deep network structure of the present invention;

FIG. 17 is an overall network structure of a fusion of a skeleton feature and an image feature of the invention;

FIG. 18 is a recognition accuracy of each behavior by the behavior understanding algorithm of the present invention.

FIG. 19 is an attention network skeleton features weight distribution of the present invention; wherein FIG. 19(a) is a skeleton weight distribution of a global action; FIG. 19(b) is a skeleton weight distribution of tennis action;

FIG. 20 is a feature activation diagram of the Vision Transformer attention image of the present invention;

FIG. 21 is a model effect display system of an interactive behavior understanding method for posture reconstruction of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

The technical scheme of the present invention is further explained below by drawings and embodiments.

S1, Construction and Preprocessing of a Data Set

- the data set of the present invention uses a self-constructed interactive behavior data set and a public behavior understanding data set NTU-RGB+D, respectively. The self-constructed human behavior samples comprise ‘human-object’ interaction behavior, ‘human-human’ interaction behavior, and routine action, wherein, the ‘human-object’ interaction behavior samples set up badminton action and tennis action, reading book action and holding water action, that is, the collected samples are similar behaviors of ‘human-object’ interaction; the ‘human-human’ interaction behavior collects multiplayer interaction behaviors such as stealing, fighting, and hugging; routine actions comprise individual behaviors such as walking, standing, running, and falling. In order to ensure the universality of data and improve the robustness of the model, a variety of target object behavior data with different heights and body shapes are collected as samples. The training samples are obtained by on-site camera sampling and network collection, and then the pictures are scaled to 640×480 size, each type of action is sampled 200 pictures, a total of 4000 pictures.

S11, Construction of the Data Set

- extraction of skeleton features, firstly, a two-dimensional skeleton information of the human is extracted via improved OpenPose algorithm, and then a complete three-dimensional human skeleton data is generated as the skeleton data via an occlusion prediction network and a three-dimensional human body posture estimation. Wherein part skeleton data collected are shown in FIG. 1, each row denotes a three-dimensional coordinate (x_i, y, z) of its human body posture, after three-dimensional human body posture estimation and occlusion prediction, 16 sets of data are obtained, denoting the skeleton information of an action of a single human.

Wherein, in step S11 construction of the data set, the steps of a three-dimensional human body posture estimation algorithm in the case of occlusion are as follows:

S111, Preprocessing of the Data Set

- the data set consists of two parts, one is based on the generative antagonistic interpolation network to realize the occlusion prediction of three-dimensional human body posture, and the experiment needs to use the COCO human body posture data set. The second is to map the two-dimensional human body posture data to the three-dimensional human body posture data, the data set used in the experiment is the public data set Human3.6M data set.

1. Occlusion Prediction Data Set

- the focus of the occlusion prediction data set is the establishment of complete human skeleton key point data, that is, it is necessary to provide the model with a human body posture data set with missing data and a complete human body posture data set without missing data. The missing data set can be constructed by missing random positions in the complete data set, so the occlusion prediction data set can directly predict the key points of human skeleton in the COCO data set.

In order to make the occlusion prediction have good universal applicability and adapt to different individuals and multiple target behaviors. The present invention chooses to use the image data in the COCO human body posture data set, and divides it into multiple actions to extract the key points of human skeletons via the improved OpenPose algorithm, and saves the complete key point data of human skeletons as a training data set. As shown in FIG. 2, is some of the human skeleton key point data sets, each row in the figure denotes human body posture data of the extracted object, and the data is stored in floating point numbers to ensure sufficient accuracy.

2. Human3.6M Human Body Posture Data Set

The Human3.6M data set is by far the largest public data set for three-dimensional human body posture estimation. The data set collection target is seventeen actions performed by eleven professional actors, such as walking, calling, and participating in discussion, etc., for a total of 3.6 million samples. The data acquisition device uses 4 video cameras and 10 motion cameras, and the shooting area is 12 square meters. Wherein four cameras are shot from different angles as video data from different perspectives, and coordinate data of the key points of the three-dimensional human skeleton are collected by a motion capture device. Part of the video data in Human3.6M is shown in FIG. 3.

In order to ensure the consistency between the data of the Human3.6M data set and the OpenPose algorithm structure, it is necessary to preprocess the data and align the positional relationship of different skeleton points. The skeleton point correspondence between the two is shown in the following table.

TABLE 1

The relationship between Human3.6M data set and OpenPose

human body posture structure

Improved

Human3.6 data set
Corresponding
OpenPose model

Corresponding
relationship

Corresponding

Joint id
meaning
Joint id
Joint id
meaning

0
Hip
(8 + 11)/2
0
Nose

1
Right hip
8
1
Neck

2
Right knee
9
2
Right shoulder

3
Right foot
10
3
Right elbow

6
Left hip
11
4
Right wrist

7
Left knee
12
5
Left shoulder

8
Left foot
13
6
Left elbow

13
Neck
1
7
Left wrist

14
Chin
0
8
Right hip

15
Head
(14 + 15)/2
9
Right knee

17
Left shoulder
5
10
Right ankle

18
Left elbow
6
11
Left hip

19
Left wrist
7
12
Left knee

25
Right shoulder
2
13
Left ankle

26
Right elbow
3
14
Right eye

27
Right wrist
4
15
Left eye

After obtaining the two-dimensional skeleton data, a nonlinear model is established to learn the mapping relationship between two-dimensional data and three-dimensional data. The input of the nonlinear network is designed as two-dimensional human body posture data X∈ custom-character ²ⁿ, the network output form Y∈³ⁿ, and a learning function expression of the nonlinear network is G*:X∈²ⁿ→Y∈³ⁿ, the purpose of minimizing the mean square error between the network predicted result and the real result is achieved by optimizing the model parameters, the specific meaning is as follows, where ξ denotes its loss function, and here is the mean square error loss function:

$G *= \min_{G} \frac{1}{N} \sum_{i = 1}^{N} ξ (G (X_{i} - Y_{i}));$

- the hip joint of the three-dimensional skeleton structure of the human body is set as the coordinate center point according to the official standard of the Human3.6M data set, the transformation from the world coordinate system to the camera coordinate system is realized through a series of rotation and translation transformations, that is, the rigid body transformation conforming to the right-hand coordinate system.

The transformation relationship between the world coordinate system and the camera coordinate system is shown in FIG. 4. Taking Z-axis rotation as an example. Wherein O-X1Y1Z1 denotes the world coordinate system, O-XYZ denotes the camera coordinate system, and θ denotes an angle between X and X1, the specific transformation formula is as follows:

${\begin{matrix} X = X_{1} \cos θ - Y_{1} \sin θ \\ Y = X_{1} \sin θ + Y_{1} \cos θ \\ Z = Z_{1} \end{matrix} \Leftrightarrow [\begin{matrix} X \\ Y \\ Z \end{matrix}] = (\begin{matrix} \cos θ & - s in θ & 0 \\ \sin θ & \cos θ & 0 \\ 0 & 0 & 1 \end{matrix}) [\begin{matrix} X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}] = R_{1} [\begin{matrix} X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}];$

- similarly, the formulas for calculating the rotation angles α and β according to the X and Y axes are as follows:

$[\begin{matrix} X \\ Y \\ Z \end{matrix}] = (\begin{matrix} 1 & 0 & 0 \\ 0 & \cos α & \sin α \\ 0 & - s in α & \cos α \end{matrix}) [\begin{matrix} X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}] = R_{2} [\begin{matrix} X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}]$

$[\begin{matrix} X \\ Y \\ Z \end{matrix}] = (\begin{matrix} \cos β & 0 & - \sin β \\ 0 & 1 & 0 \\ \sin β & 0 & \cos β \end{matrix}) [\begin{matrix} X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}] = R_{3} [\begin{matrix} X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}]$

- a rotation matrix R can be obtained then according to R₁, R₂, R₃:

R=R
₁
R
₂
R
₃

- finally, the transformation relationship between the world coordinates and the camera coordinates can be obtained, as shown below. Where (X_W,Y_W,Z_W) is an absolute coordinate of the point, (X_C,Y_C,Z_C) is a camera coordinate of the point, R is the rotation matrix, and T is defined as an offset vector.

$[\begin{matrix} X_{C} \\ Y_{C} \\ Z_{C} \end{matrix}] = R [\begin{matrix} X_{W} \\ Y_{W} \\ Z_{W} \end{matrix}] + T$

after obtaining the transformed coordinates, the data is normalized, and the data set is divided into a training set and a test set, wherein, the data collected by the experimenter numbered (1, 5, 6, 7, 8) is the training set, and the experimenter (9,11) data is set as the test set, and the mean square error between the predicted value and the real value is used as an evaluation criterion of the model. The steps of normalization calculation are as follows, where p and a are the mean and standard deviation of the sample respectively, x denotes an original data, and x′ denotes a normalized data;

$x^{'} = \frac{x - μ}{σ} .$

S112, Generative Antagonistic Interpolation Network

- the human body posture architecture has certain regularity, so the correlation of skeleton point position can be learned by neural network, so as to realize the prediction of missing data of human skeleton key points under occlusion. The present invention realizes the prediction of missing human skeleton key points by establishing a generative antagonistic interpolation network to obtain complete human skeleton key point information. The generator is used to predict missing data and optimize model parameters by continuously reducing model losses. The discriminator is used to distinguish the predicted results from the real results, so as to achieve the purpose of antagonizing the generator, so that the predicted output of the generator is closer to the real. The generative antagonistic interpolation network structure diagram, as shown in FIG. 5:
- the specific process is as follows: firstly, three matrices are generated based on the input data, which are the original data, random noise, and mask matrices containing 0 and 1, respectively. In the mask matrix, 0 denotes that the location data is missing, and 1 denotes that the value of the location is not missing. Then, the three matrices are input into the generator, and the predicted values of the generator and the original missing data are interpolated and predicted at the missing position as the final output result. Finally, the output result of the generator and the mask matrix are input into the discriminator, which is used to determine the non-missing probability of missing data position in the mask vector.

1. Generator Network

- the input of the generator is composed of an original data tensor X, a random noise tensor Z, and a mask matrix M, define X′ as a generator output matrix, as a predicted output matrix, which is composed of the predicted value on the missing position and the non-missing real value, d denotes a dimension of the data, a function operation G of the generator is as follows:

$G : X \times {0, 1}^{d} \times {[0, 1]}^{d}$

The generator output matrix X′ and predictive result matrix custom-character are as follows:

$X^{'} = G (X, M, (1 - M) ⊙ Z);$

$\overset{⌢}{X} = M ⊙ X + (1 - M) ⊙ X^{'};$

Where ⊙ denotes the Hadamard product, multiplied by element by element.

2. Discriminator Network

- the discriminator network D in the generative antagonistic interpolation network is used to antagonize the generator, but its goal is not to distinguish the real and fake of the sample, but to distinguish the real and fake probability of different locations as a predictive mask M. Then, the discriminator network D is trained to maximize the probability of correctly predicting the mask tensor, and the generator network G is trained to minimize the probability of the discriminator correctly predicting the mask tensor.

A prompt tensor H is introduced to determine the accurate mask value, that is, when it is 0.5, it means that the accurate value of M cannot be obtained from H, while when the value is 0 or 1, it means that the accurate value can be obtained, and E is an existential quantifier. Here the value V(D, G) is defined as follows:

$V (D, G) = E_{(X, M, H)} [M^{T} \log D (\overset{⌢}{X}, H) + {(1 - M)}^{T} \log (1 - D (\overset{⌢}{X} H))];$

- a network optimization objective is as follows:

$\min_{G} \max_{D} V (D, G);$

- a defined loss function ξ:

$ξ : {0, 1}^{d} \times {0, 1}^{d} \to ℝ;$

$ξ (a, b) = \sum_{i = 1}^{d} [a_{i} \log (b_{i}) + (1 - a_{i} \log (1 - b_{i}))];$

- then there is a predicted mask tensor =D(, H), as shown below:

$\min_{G} \max_{D} E [ξ (M, \overset{⌢}{M})] .$

S113, Posture Occlusion Prediction Network Architecture;

- the present invention realizes the effective understanding of human body posture expression by establishing the position information and correlation information of the key points of the human skeleton, and then realizes the occlusion prediction of human body posture data by constructing a generative antagonistic interpolation network.

Due to the lack of human skeleton data caused by occlusion, relying solely on joint position information can easily lead to the loss of effective features, that is, the loss of joint connection information and the loss of skeleton structure. The efficient use of features by the model is further improved by integrating the structural features of joints. Here, the position feature of the defined posture is denoted by the extracted skeleton position coordinate and an indicator scalar, when it is 0, it means that the position is missing, and when it is not 0 means that the position is not missing. The structural features of the joint are denoted by an association matrix, and the value of the element is composed of 0 and 1, 1 denotes that the joints of the row and column where the element is located are interconnected, and 0 denotes that the joints of the row and column where the element is located are not connected.

The basic idea of generative antagonistic networks lies in the dynamic game process, and the final equilibrium point is the Nash equilibrium point. The training of the network is realized by fixing different trainers at different stages, meanwhile, the discriminator network needs to be trained first to avoid problems such as mode collapse. Wherein, when training the discriminator, it is necessary to first fix the generator, by introducing the missing data predicted by the generator and the original real data into the discriminator, the error is calculated and back-propagation is performed to update the discriminator parameters; when training the generator, the discriminator network needs to be fixed, and the predicted value output by the generator is input into the discriminator as a negative sample, the parameters of the generator are updated by back propagation according to the error of the discriminator. The specific network structure flow diagram is shown in FIG. 6.

S114, Three-Dimensional Human Body Posture Estimation;

The present invention realizes the three-dimensional mapping learning of two-dimensional human body posture data by designing a nonlinear model, so that the model can obtain sufficient spatial information and solve the problem that the key point information of human skeleton output from different perspectives is not uniform.

1. Nonlinear Module Design

- each nonlinear model is set to be composed of 1024 neurons, BatchNorm normalization, ReLu activation function and Dropout layer, and then each nonlinear module is connected by residuals to prevent gradient disappearance during training. Then, BatchNorm normalization is introduced to constrain the neural network layer and ensure the stability of gradient echo, then, the nonlinear fitting of the model is realized by the ReLu activation function, and finally, the neurons are randomly inactivated by Dropout to prevent the model from overfitting during the training process, so that the model can obtain the optimal results. The nonlinear network structure is designed as shown in FIG. 7.

2. OWM Module

- by establishing a nonlinear network to learn the mapping relationship of three-dimensional human body posture data, the original OpenPose two-dimensional human body posture data is converted into three-dimensional human body posture data, however, with the change of environment and the increase of the diversity of action categories, the nonlinear model is difficult to adapt to too many complex samples, which is easy to cause catastrophic forgetting. In this part, the OWM module is introduced to improve the generalization ability of the model and make the adaptability of the model stronger.

When learning a new sample, the OWM module modifies the weight value in the orthogonal direction of the feature solution space on the old task in order to retain the features learned before, so that the weight increment does not interact with the past task, so as to ensure that the solution sought in the new sample still exists in the previous solution space. Here, it is assumed that a previously trained input vector matrix set is A, a matrix I denotes a unit matrix, and a is a parameter, then the direction orthogonal to the input space needs to be found as shown below:

$P = I - {A (A^{T} A + α I)}^{- 1} A;$

- where the correction of the weight parameter comprises the learning rate, as shown below, here λ denotes a learning rate, ΔW′ is a back propagation weight change:

ΔW=λPΔW′

As shown in FIG. 8, it is an OWM module schematic diagram.

S115, Experimental Analysis and Validation

- the experiment of the present invention is divided into two parts: an occlusion prediction experiment and a three-dimensional human body posture estimation. Wherein the occlusion prediction experiment is evaluated by calculating the root mean square error of the real data and the predicted missing data, the three-dimensional human body posture estimation experiment is evaluated by calculating the error between the predicted three-dimensional coordinates and the real coordinate.

1. Occlusion Prediction Experiment

- the missing rate parameter is introduced to generate missing human body posture data, and experiments are carried out from 1 to 12 skeleton points according to the missing numbers, the generative antagonistic interpolation network GAIN_Pose introduced by the present invention is performed a comparative experiment with other algorithms, and compared with the machine learning algorithm MissForest and the deep learning algorithm MICE, respectively. The data missing are compared according to different missing numbers from 1 to 12 and different actions, and the evaluation index uses the root mean square error. The experimental environment of the occlusion prediction algorithm of the present invention is shown in Table 2.

TABLE 2

Occlusion prediction model training environment

Category
Environmental parameter

Operating system
Windows 10

CPU memory
16G

Programming language
Python 3.6.5

Deep learning
Keras

framework

Graphics card memory
Discrete graphics card 4G

CPU model
AMD Ryzen 7 4800H with Radeon Graphics

GPU model
NVIDIA Geforce GTX 1650

The specific model parameters for the setup are shown in Table 3.

TABLE 3

Training parameters of the occlusion prediction model

Parameter name
Meaning
Parameter value

Optimizer
Optimizer
Adam

Init_lr
Initial learning rate
0.001

Epoch
Training times of all data sets
5000

BatchSize
Number of training batch
128

samples

Init_BP
Initialization method of
Kaiming

neural network

As shown in Table 4, it is the error comparison table between the predicted value and the real value of the occlusion prediction comparison experiment on different actions. It can be found that the algorithm in this paper performs best in the prediction of missing human skeleton key points under occlusion, with an average error of only 0.0657, and performs better in the evaluation of simple actions such as standing and walking.

TABLE 4

Comparative experiment errors of occlusion prediction

Design algorithm
Walking
Running
Standing
Sitting

Algorithm of this paper
0.0595
0.0686
0.0552
0.0793

MissForest
0.0784
0.2032
0.0663
0.2245

MICE
0.0838
0.3365
0.0786
0.3569

Auto-Encoder
0.0824
0.2639
0.0793
0.2844

As shown in FIG. 9, the effect diagram is compared and evaluated for the experimental comparison of each algorithm with different skeleton missing values. It can be seen that as the number of missing points increases, the loss of the model gradually increases. When the missing value is less than 9, the algorithm performs better, but the missing rate is too large and the change curve increases sharply, which is not suitable for the case of too much data missing.

As shown in FIG. 10, it is the Loss change curve during the training process of the algorithm in this paper. It can be seen from the diagram that the model fitting amplitude tends to be stable, and the loss value basically does not change around 4500 rounds, and the curve is fitted.

As shown in FIG. 11, it shows the effect of occlusion prediction.

2. Three-Dimensional Human Body Posture Estimation Experiment

- in order to verify the effectiveness of the model structure, the proposed algorithm is compared with the nonlinear residual neural network, the maximum marginal neural network posture estimation algorithm, the motion compensation posture estimation algorithm, the convolution network three-dimensional posture estimation algorithm and the three-dimensional posture estimation algorithm based on image sequence.

The environment of the three-dimensional posture estimation experiment of the present invention is shown in Table 5, and the accelerated training is realized by GPU.

TABLE 5

Three-dimensional human body posture estimation model

training environment

Category
Environmental parameter

Operating system
Windows 10

CPU memory
16G

Script language
Python 3.6.5

Deep learning
Tensorflow

framework

CPU model
AMD Ryzen 7 4800H with Radeon Graphics

GPU model
NVIDIA Geforce GTX 1650

The experiment uses Adam as the optimizer, the training times of all data sets are 1000 rounds, and the initial learning rate is set to 0.001 and decays exponentially with the number of training times. BatchSize is set to 64, and the neural network is initialized by Kaiming to ensure the stability of gradient echo during training and improve the training speed of the model. The model training parameters are shown in Table 6:

TABLE 6

Training parameters of three-dimensional human body posture estimation model

Parameter name
Meaning
Parameter value

Optimizer
Optimizer
Adam

Init_lr
Initial learning rate
0.001

Epoch
Number of iterations
1000

BatchSize
BatchSize
64

Init_BP(Initial Back Propagation)
Initialization method of
Kaiming

neural network

In order to verify the effect of the model, the distance error between the three-dimensional human skeleton key point data predicted by different algorithms and the original three-dimensional human skeleton key point data is calculated in millimeters. Validate on different actions such as Direct, Discuss, and Eating, and the resulting experiments are shown in Table 7:

TABLE 7

Evaluation effect of three-dimensional human

body posture estimation experiment

Algorithm
Direct
Discuss
Eating
Greet
Photo
Sitting

Algorithm of this
48.6
53.8
50.5
52.9
86.3
83.6

paper

Nonlinear
62.3
68.2
64.3
59.6
92.7
88.8

residual

network

Maximum Marginal
101.5
138.4
98.8
125.8
172.4
149.6

Neural Network

Motion
103.6
149.2
89.3
127.4
193.6
141.3

compensation

algorithm

Convolution neural
80.9
82.3
79.2
81.6
89.3
85.2

network algorithm

Image sequence
85.6
114.7
106.3
111.5
137.2
122.4

As shown in FIG. 12, it is the test effect of the three-dimensional human body posture estimation in this paper.

Aiming at the problem of missing human skeleton point data under occlusion and the problem of missing three-dimensional spatial information of two-dimensional skeleton data in the human body posture estimation algorithm, the occlusion prediction network and three-dimensional human body posture estimation model are established respectively. Wherein, the generative antagonistic interpolation network comprehensively uses the skeleton point tensor and the human body correlation tensor to predict the missing data of the human under occlusion, and compared with the interpolation algorithms such as MissForset, the effectiveness of the proposed algorithm for occlusion missing data is verified, and the error of the prediction performance is reduced by 54.1% on average compared with the experimental optimal algorithm. In addition, two-dimensional to three-dimensional human body posture estimation is realized by constructing a nonlinear network. Meanwhile, in order to improve the generalization ability of the model and enhance the continuous learning ability of the model, the OWM module is introduced into the network, and the experimental verification is carried out on the Human3.6M data set, compared with the algorithm such as the maximum marginal neural network, the distance error between the predicted value and the real value is used as the evaluation index, the error of the optimal algorithm is reduced by 13.8% on average in the experimental performance, which verifies the effectiveness of the improvement measures.

S12, NTU-RGB+D Data Set

As shown in FIG. 13, it is the NTU-RGB+D part of the skeleton data set. The NTU-RGB+D public data set, collected by the Rose Lab laboratory, contains 56,880 sample data and is divided into 60 behaviors, comprising 40 categories of daily behaviors and 11 categories of ‘human-human’ interaction behaviors. The data set comprises RBG images, depth information, three-dimensional human skeleton data, etc.

S2, Extraction of Skeleton Features

- the complete three-dimensional human skeleton information has the inherent structural law of the human body, the conventional skeleton-based behavior understanding method is to use the human body posture estimation algorithm to predict the key points of the skeleton and then classify them, without considering the regularity of the human body structure, that is, the connection between different skeleton points. Therefore, the present invention obtained skeleton data of human body posture with different weights by introducing a Bahdanau attention neural network, and then a directed graph model of human body posture is established via graph convolution neural network to extract accurate skeleton features.

S21, Skeleton Features Weight Network

- a neural network model incorporating attention mechanism is established to obtain the weight information of different human skeleton points. For the input three-dimensional posture data, a basic initialization weight distribution is performed, and setting an attention weight by normalizing an activation function, the specific formula is as follows:

$α_{i j} = \frac{\exp (score)}{Σ_{j = 1}^{n} \exp (score)};$

- where Σ_j=1ⁿα_ij=1, a value score is a correlation function between input and output, which is defined as follows:

$score = v * \tanh (r_{j} ⊙ \sum_{i = 1}^{n} x_{i});$

- where v denotes an offset vector, which is a parameter that can be trained in the model, x_idenotes an input matrix vector, r_jis a feature probability. The feature weights of different skeleton points are shown below:

w
_ij
=v*α
_ij

S22, Graph Convolution Network

- the human skeleton topology is established by comprehensively considering the position of the skeleton key points and the angle between the limbs, when the two groups of human behavior information are similar, their target values after dimensionality reduction are also similar, therefore, the human skeleton structure is represented by a graph structure, and the feature expression form after graph convolution operation can be obtained. Here, each group of human body posture directed graphs is composed of its unit matrix and an adjacency matrix used to express the connection relationship of body limbs. A convolution layer operation is obtained via a convolution operation of a signal x and a signal g, where the signal x denotes an input graph information, and the signal g denotes a convolution kernel, the convolution operation of the two is obtained via Fourier transform, where a F function denotes the Fourier transform, which is used to map the signal to the Fourier domain, as shown below:

$x * g = F^{- 1} (F (x) ⊙ F (g));$

- the Fourier transform function F(x) can be written as U^Tx, where U denotes the Laplacian matrix corresponding to the topological graph, the matrix after feature transformation, namely the feature vector of the Laplacian matrix constitutes each row of the matrix U. Then L=U∧U^Tis obtained. The graph convolution operation formula is as follows, where gϕ denotes the convolution operation, which can be approximately calculated by Chebyshev polynomial:

$x * g = U (U^{T} x ⊙ U^{T} g) = U_{g ϕ} U^{T} x;$

- finally, it is assumed that the features extracted from the first layer of the graph convolution neural network is f^l+1∈^K*C, where ^K*Cdenotes that a size of the feature map is K*C, and its training parameter is set to W∈^C*C′, by calculating it with the correlation matrix Z∈^K*Kof the graph Laplacian matrix, the final convolution can be obtained as follows, where σ denotes the activation function, Leaky ReLu (Leaky Rectified Linear Unit) is used as the activation function in this model, and the correlation matrix Z is set as a trainable parameter to prevent over-fitting during model training.

$f^{l + 1} = σ ({ZF}^{l} W) .$

As shown in FIG. 14, is the flow chart of the convolution skeleton features extraction of the human body posture map.

S3, Extraction of Image Features

- while obtaining the three-dimensional skeleton data, the two-dimensional skeleton data is retained to obtain the human body region in the image and extract the effective image features. Then the skeleton expansion coefficient λ is introduced as a training parameter and trained by the neural network. The original expression of the rectangular region of the segmented human body can be set to (X_min, Y_min), (X_max, Y_max)=(min(I_x), min(I_y)), (max(I_x), max(I_y)), where (X_min, Y_min), (X_max, Y_max) denote the lower left corner (X,Y) coordinates and the upper right corner (X,Y) coordinates of the segmented region, respectively, and I_xand I_ydenote the (X,Y) coordinates of the input skeleton data, respectively. Then the resulting image segmentation rectangular region is (X_l, Y_l), (X_r, Y_r)=(X_min−λ, Y_min−λ), (X_max+λ, Y_max+λ), where the default initial value of λ is 1, denoting the expansion of 1 pixel.

The image features extraction of the present invention obtains the image features tensor through the Vision Transformer architecture, which is composed of an encoder and a decoder, each encoder and decoder is composed of a multi-head attention (MSA) and a fully connected network, and is connected by residuals between each attention layer and the neural network layer. Firstly, the segmented rectangular region of the human body is input into the Vision Transformer as a structural block, and then the block is converted into a feature vector with dimension D by linear transformation and combined with its position coding vector. Then the input image is divided into different image blocks, constructed into an image sequence z₀, and input into the encoder. Here, each encoder is composed of two sub-modules: a multi-head attention module and a feedforward neural network module, wherein, a LN (LayerNorm) normalization layer is added in front of each neural network module, and a Gelu layer is added in the middle layer, as shown below:

$\begin{matrix} z_{l}^{'} = MSA (LN (z_{l - 1})) + z_{l - 1}, l = 1, \dots L; \\ z_{l} = MSA (LN (z_{l}^{'})) + z_{l}^{'}, l = 1, \dots L; \end{matrix}$

For the input image sequence, each element is multiplied by a key vector K, value vector V and query vector Q that generated during the training process, and then the dot product of the current element Q value and other element K value is calculated as the score value, and normalized to ensure the stability of the gradient echo, finally, the multi-head attention feature weight is obtained by SoftMax.

As shown in FIG. 15, it is the Vision Transformer image features extraction network architecture, in which each image block is flattened by a linear projection matrix, and then the position coding vector is added as the common input of the network to ensure that the original feature still retains the position information of the feature during the formation of the image sequence.

S4, Fusion and Reconstruction of Features

After the skeleton features and image features of the same dimension are obtained, the two features are fused and input into the classification network. The present invention uses a Wide&Deep neural network for the reconstruction and fusion of features, and finally, the probability of behavior category is obtained through the SoftMax classifier. The network structure establishes a linear module and a nonlinear module respectively, wherein the linear module is mainly used to fit the direct relationship between input and output, so that the model has good memory ability. The nonlinear module retains the excellent fitting ability in the original neural network, which further improves the generalization ability of the model and directly achieves a certain balance between nonlinear features and linear features. As shown in FIG. 16, the feature fusion Wide & Deep network structure diagram.

The Wide module consists of a linear module y=w^Tx+b, where x denotes an input feature vector in the form of x=[x₁, x₂. . . , x_n], w=[w₁, w₂, . . . , w_n] is a model training parameter, and b denotes a model bias term; the input fusion features comprise original input feature vectors and transformed feature vectors, where the transformed features are obtained by cross product transformation, as shown below, where c_kidenotes a Boolean variable, that is, if the i-th is a part of the k-th transformation φ_k, then it is 1, otherwise it is 0:

$ϕ_{k} (x) = \prod_{i = 1}^{n} x_{i}^{c_{ki}}, c_{ki} \in {0, 1};$

- the Deep module is mainly composed of nonlinear neural networks, its main purpose is to convert the input features into low-dimensional embedded vectors, and reduce the loss through model training to optimize the results. Wherein the specific meaning of forward propagation is as follows, where a _(l+1)denotes an output of a l+1 layer, and σ denotes an activation function:

$a_{(l + 1)} = σ (W_{(l)} a_{(l)} + b_{(l)});$

- finally, the logarithm of the output values of the Wide module and the Deep module is combined, and the predicted value is weighted, then the loss function is used to calculate the loss and optimize the model parameters, and the gradient of the two parts is simultaneously back-propagation to complete the gradient echo through the small batch gradient descent optimization algorithm. As shown below is the final output probability expression of the model, where y denotes a prediction category label, σ denotes the activation function, φ(x) denotes a cross product transformation, x denotes the input feature vector.

$P (y ❘ x) = σ (W_{wide}^{T} [x, ϕ (x)] + W_{deep}^{T} a_{1} + b) .$

As shown in FIG. 17, it is the overall network structure diagram of the fusion of the skeleton features and image features.

S5, Experimental Evaluation and Validation

- the experiment in this part uses a pure skeleton human behavior understanding algorithm and a pure image target detection behavior understanding algorithm to compare with the effect of this algorithm. Wherein the skeleton behavior understanding algorithm adopts the time series model LSTM, Transformer sequence model and DNN (Dynamic Neural Network) neural network algorithm respectively. The behavior understanding algorithm based on target detection uses the single-stage target detection algorithm YOLOv5 and the double-stage target detection algorithm Fast R-CNN algorithm to compare with the algorithm in this paper.

S51, Model Training Environment and Parameters

- the experimental model training environment is established in the Windows10 environment, using CUDA 10.1 to establish the GPU environment for training, and Python 3.6.5 as a compiler. The specific training environment of the algorithm is shown in Table 8.

TABLE 8

Training environment of behavior understanding algorithm in this paper

Category
Environmental parameter

Operating system
Windows 10

Run memory
16G

Script language
Python 3.6.5

Deep learning
Pytorch

framework

CPU model
AMD Ryzen 7 4800H with Radeon Graphics

GPU model
NVIDIA Geforce GTX 3090

The specific model parameters for the setup as shown in Table 9:

TABLE 9

Model training parameters of behavior understanding

algorithm in this paper

Parameter name
Meaning
Parameter value

Optimizer
Optimizer
Adam

Init_lr
Initial learning rate
0.001

Epoch
Training times of all data sets
1000

BatchSize
Number of training batch samples
128

S52, Experimental Evaluation

The experiment of the present invention evaluates the performance of the model through the ACC (Accuracy) index. The model speed is evaluated by the FPS value of the number of pictures that the model can recognize per second in the model inference stage. Wherein, the skeleton classification comparative experiment data set is composed of pure skeleton data, the corresponding category labels are labeled for each group of skeleton data, and then the LSTM, Transformer and DNN algorithms are used for experimental evaluation. In the image target detection part, LabelMe is used to calibrate different behaviors in the image data to form a Json file containing image region and label information, and then YOLOv5 and other target detections are used for experimental evaluation. Data set evaluation is divided into individual behavior evaluation and interactive behavior evaluation, wherein, the individual behavior comprises daily behaviors such as walking and standing. ‘human-object’ interactive behaviors comprise playing tennis and badminton. ‘human-human’ interactive behaviors comprise fighting and hugging.

As shown in Table 10, it is the experimental performance of the behavior understanding algorithm in the local data set.

TABLE 10

Comparison of accuracy of local data sets of behavior

understanding algorithm

Individual
Interactive
All

Methods of use
behavior
behavior
behaviors
FPS

Skeleton features + LSTM
0.8332
0.6993
0.7663
34

Skeleton features +
0.8624
0.7433
0.8029
33

Transformer

Skeleton features + DNN
0.8533
0.7235
0.7884
34

Image features + Fast R-
0.8906
0.7988
0.8447
25

CNN

Image features + YOLOv5
0.8956
0.7863
0.8410
29

Algorithm of this paper
0.9223
0.8892
0.9058
32

As shown in Table 11, is the comparison of the experimental effects of the behavior understanding algorithm applied to the public data sets from different perspectives.

TABLE 11

Accuracy comparison of NTU-RGB + D data set

Methods of use
X-View
X-Sub

Skeleton features + LSTM
81.3%
66.3%

Skeleton features +
84.7%
71.5%

Transformer

Skeleton features + DNN
84.5%
70.8%

Image features + Fast R-
87.3%
75.4%

CNN

Image features + YOLOv5
87.9%
77.6%

Algorithm of this paper
90.4%
82.6%

From the analysis of the experimental results, it can be seen that the behavior understanding algorithm that simply relies on skeleton information has higher speed, and has higher recognition accuracy in individual behavior understanding, but in interactive behavior understanding, the algorithm performs poorly. The reason is that it ignores the original image information, that is, for the interaction behavior, which relies on the effective image information, a single skeleton behavior understanding algorithm will lead to the loss of information extraction.

Similarly, the target detection algorithm that purely relies on image information is used for human behavior understanding, due to the complex structure of the algorithm model, the running speed of the model is slow and the real-time performance is poor. However, the accuracy of model recognition is higher than that of single skeleton behavior.

After comparison, the behavior understanding algorithm of the fusion of image features and skeleton features comprehensively utilizes the effective features of the image, which can better remove redundant noise and perform best in recognition accuracy. Meanwhile, due to the improvement of the model lightweight, the running speed of the model has also been improved to a certain extent, which has more application value.

As shown in FIG. 18, it is the recognition accuracy graph of the algorithm for each behavior.

As shown in FIG. 19, is the attention network skeleton features weight distribution map. Wherein FIG. 20 (a) is a weight distribution of the global data set for skeleton features, the weight of the overall movement skeleton features is from 0 to 15 [0.0045845401, 0.0188367274, 0.0657692422, 0.0883763475, 0.0069323099, 0.1142232353, 0.0594012654, 0.0465061087, 0.0306623435, 0.0765381, 0.0605252366, 0.0756979099, 0.0852956 267, 0.0544496286, 0.1227602038, 0.0894338934]. FIG. 20 (b) is a weight distribution of skeleton features of tennis action. It can be seen that when it comes to tennis movements, its skeleton features are mainly focused on joint positions such as hands and waist, and its weight distribution is relatively uniform for global movements.

As shown in FIG. 20, is a feature activation diagram of the Vision Transformer attention image for each action.

As shown in FIG. 21, is a model effect display system.

Therefore, the present invention adopts the above-mentioned interactive behavior understanding method for posture reconstruction based on features of skeleton and image, fuses skeleton features and image features, and reconstructs features, it not only retains the purity of skeleton features for human behavior information extraction, but also uses image features to retain effective image information such as environment, so as to further complement the model feature information. Specifically, the skeleton features extracted by the graph convolution network make good use of the joint directed graph structure of the human skeleton, increase the relevance of the input skeleton point information, and obtain the accurate skeleton features. Then, the image is divided into image block sequences through the Vision Transformer network, and combined with the multi-head attention mechanism, effective image features can be extracted quickly and accurately. In the experimental part, the algorithm in this paper is compared with the simple skeleton features recognition algorithms LSTM, Transformer, DNN and image target detection behavior classification algorithms Fast R-CNN and YOLOv5, finally, the accuracy of the algorithm in this paper is improved by 7.2% and the speed is improved by 28% compared with the optimal algorithm, which verifies the efficiency and accuracy of the algorithm in this paper, indicating that the algorithm in this paper can be better applied to human behavior understanding.

Finally, it should be noted that the above examples are merely used for describing the technical solutions of the present invention, rather than limiting the same. Although the present invention has been described in detail with reference to the preferred examples, those of ordinary skill in the art should understand that the technical solutions of the present invention may still be modified or equivalently replaced. However, these modifications or substitutions should not make the modified technical solutions deviate from the spirit and scope of the technical solutions of the present invention.

INTERACTIVE BEHAVIOR UNDERSTANDING METHOD FOR POSTURE RECONSTRUCTION BASED ON FEATURES OF SKELETON AND IMAGE

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)