DEEP VIDEO LINKAGE FEATURE-BASED BEHAVIOR RECOGNITION METHOD

Description

TECHNICAL FIELD

The present invention relates to the field of computer vision technologies, in particular to a depth video linkage feature-based behavior recognition method.

BACKGROUND

Behavior recognition, which is now a research hotspot in the field of computer vision, is widely used in the fields of video surveillance, behavior analysis and the like.

With the development of depth cameras, depth videos, including a great deal of motion information, are readily accessible to people. Some scholars acquire locations of human bone joints in a depth video and use data of the joints for recognition. Other scholars directly input the depth video into a network for behavior recognition. However, the bone joint-based behavior recognition is susceptible to not only the accuracy in acquiring the bone joints but also intra-class differences of behaviors and the occlusion of the bone joints. Directly inputting the depth video into the network fails to make full use of three-dimensional information contained in the depth video and a feature relationship between behaviors in different dimensions.

Therefore, a depth video linkage feature-based behavior recognition method is provided to solve the problems of the behavior recognition algorithms described above.

SUMMARY

The present invention is provided to solve the problems in the prior art, and its objective is to provide a depth video linkage feature-based behavior recognition method, so as to solve the problem that deep features extracted by an existing recognition method fail to make full use of three-dimensional information in a depth behavior video.

The depth video linkage feature-based behavior recognition method includes the following steps:

- 1) projecting a depth video of each behavior sample onto a front side, a right side, a left side and a top side to obtain corresponding projection sequences;
- 2) obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence;
- 3) inputting the dynamic image of each behavior sample into a respective feature extraction module and extracting features;
- 4) inputting the extracted features into a multi-projection linkage feature extraction module and extracting a linkage feature of each projection combination;
- 5) connecting the extracted linkage features of all projection combinations by channel, and inputting the connected features into an average pooling layer and a fully connected layer;
- 6) constructing a depth video linkage feature-based behavior recognition network;
- 7) inputting a depth video of each training behavior sample into the depth video linkage feature-based behavior recognition network, and training the network till convergence; and
- 8) inputting a depth video of each behavior sample to be tested into the trained depth video linkage feature-based behavior recognition network to implement behavior recognition.

Preferably, the projection sequence is obtained in step 1) as follows:

- acquiring a depth video of any behavior sample, each behavior sample consisting of all frames in the depth video of the behavior sample,

$V = {I_{t} ❘ t \in [1, N]},$

- in which t represents a time index, and N is a total quantity of frames of the depth video V of the behavior sample. I_t∈R×C is a matrix representation of a t^th-frame depth image of the depth video V of the behavior sample, in which R and C correspond to the quantity of rows and the quantity of columns of the matrix representation of the t^th-frame depth image respectively, and indicates that a matrix is a real matrix. I_t(x_i, y_i)=d_irepresents a depth value of a point p_iwith coordinates (x_i, y_i) on the t^th-frame depth image, i.e., a distance between the point p_iand a depth camera. d_i∈[0, D], in which D represents the furthest distance detectable by the depth camera.

The depth video V of the behavior sample can be expressed as a set of projection sequences, which is denoted by a formula:

$V = {V_{front}, V_{right}, V_{left}, V_{top}},$

- in which V_frontrepresents a projection sequence obtained by projecting the depth video V of the behavior sample onto a front side, V_rightrepresents a projection sequence obtained by projecting the depth video V of the behavior sample onto a right side, V_leftrepresents a projection sequence obtained by projecting the depth video V of the behavior sample onto a left side, and the V_toprepresents a projection sequence obtained by projecting the depth video V of the behavior sample onto a top side.

The projection sequence V_frontis acquired as follows:

V_front={F_t|t∈[1, N]}, in which F_t∈R×C represents a projection graph obtained by projecting the t^th-frame depth image of the depth video V of the behavior sample onto a front side. An abscissa value, x_i, an ordinate value y_iand a depth value d_iof the point p_iin the depth image respectively determine an abscissa value x_i^ƒ, an ordinate value y_i^ƒ and a pixel value z_i^ƒ of a point projected from the point p_ionto the projection graph F_t, which are denoted by formulas:

$F_{t} (x_{i}^{f}, y_{i}^{f}) = z_{i}^{f}, x_{i}^{f} = x_{i}, y_{i}^{f} = y_{i}, z_{i}^{f} = f_{1} (d_{i}),$

- in which ƒ₁is a linear function indicating that the depth value d_iis mapped to an interval [0,255], such that the smaller the depth value of a point is, the larger the pixel value of the point on the projection graph is, i.e., the closer the point is to the depth camera, the brighter the point is on a front-side projection graph.

The projection sequence V_rightis acquired as follows:

V_right={R_t|t∈[1, N]}, in which R_t∈R×D represents a projection graph obtained by projecting the t^th-frame depth image on a right side. At least one point is projected onto the same location on the projection graph when the depth image is projected onto the right side. A point closest to an observer, i.e., a point furthest from a projection plane, can be seen when a behavior is observed from the right side. An abscissa value of the point, furthest from the projection plane, on the depth image is reserved, and a pixel value of the point in this location of the projection graph is calculated according to the abscissa value. Points in the depth image are traversed column by column from a column with the smallest abscissa x on the depth image in a direction in which x increases, and are projected onto the projection graph. An abscissa value x_i, an ordinate value y_iand a depth value d_iof the point p_iin the depth image respectively determine a pixel value z_i^r, an ordinate value y_i^rand an abscissa value x_i^rof a point in a projection graph R_t, which are denoted by formulas:

$F_{t} (x_{i}^{r}, y_{i}^{r}) = z_{i}^{r}, x_{i}^{r} = x_{i}, y_{i}^{r} = y_{i}, z_{i}^{r} = f_{2} (x_{i}),$

- in which ƒ₂is a linear function indicating that the abscissa value x_iis mapped to an interval [0,255]. In a case that x continues to increase, a new point is reserved if the new point and the previously projected point are projected onto the same location in the projection graph, i.e., a pixel value of this location in the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, i.e., z_i^r=ƒ₂(x_m), in which x_m=max x_i, x_i∈X_R, X_Ris a set of abscissas of all points with ordinate values y_i^rand depth values x_i^rin the depth image, and max x_i, x_i∈X_Rrepresents a maximum abscissa value in the set X_R.

The projection sequence V_leftis acquired as follows:

V_left={L_t|t∈[1, N]}, in which L_t∈R×D: represents a projection graph obtained by projecting the t^th-frame depth image onto a left side. In a case that multiple points are projected onto the same location on a left-side projection graph, a point furthest from the projection plane is reserved. Points in the depth image are traversed column by column from a column with the largest abscissa x on the depth image in a direction in which x decreases, and are projected onto the left-side projection graph. An abscissa value x_i, an ordinate value y_iand a depth value d_iof the point p_iin the depth image respectively determine a pixel value z_i^l, an ordinate value y_i^land an abscissa value x_i^lof a point in the projection graph L_t. For a point projected onto the same coordinates (x_i^l, y_i^l) on the left-side projection graph, an abscissa value of the point with the smallest abscissa is selected to calculate a pixel value at the coordinates of the projection graph, which are denoted by formulas:

$L_{t} (x_{i}^{l}, y_{i}^{l}) = z_{i}^{l}, x_{i}^{l} = d_{i}, y_{i}^{l} = y_{i}, z_{i}^{l} = f_{3} (x_{n}),$

- in which ƒ₃is a linear function indicating that an abscissa value x_nis mapped to an interval [0,255], x_n=min x_i, x_i∈X_L, in which X_Lis a set of abscissas of all points with ordinate values y_i^land depth values x_i^lin the depth image, and min x_i, x_i∈X_Lrepresents a minimum abscissa value in the set X_L.

The projection sequence V_topis acquired as follows:

V_top={T_t|t∈[1, N]}, in which O_t∈D×C represents a projection graph obtained by projecting the t^th-frame depth image onto a top side. In a case that multiple points are projected onto the same location on a top-side projection graph, a point furthest from the projection plane is reserved. Points in the depth image are traversed column by column from a column with the smallest ordinate y on the depth image in a direction in which y increases, and are projected onto the top-side projection graph. An abscissa value x_i, an ordinate value y_iand a depth value d_iof the point p_iin the depth image respectively determine an abscissa value x_i^o, a pixel value z_i^oand an ordinate value y_i^oof a point projected from the point p_ionto the projection graph O_t. For a point projected onto the same coordinates (x_i^o, y_i^o) on the projection graph, an ordinate value of the point with the largest ordinate is selected to calculate a pixel value at the coordinates of the projection graph, which is denoted by formulas:

$O_{t} (x_{i}^{o}, y_{i}^{o}) = z_{i}^{o}, x_{i}^{0} = x_{i}, y_{i}^{o} = d_{i}, z_{i}^{o} = f_{4} (y_{q}),$

- in which ƒ₄is a linear function indicating that an ordinate value y_qis mapped to an interval [0,255], y_q=max y_i, y_i∈Y_o, in which Y_ois a set of ordinates of all points with abscissa values x_i^oand depth values y_i^oin the depth image, and max y_i, y_i∈Y_orepresents a maximum ordinate value in the set Y_o.

Preferably, the dynamic image is calculated in step 2) as follows:

- by taking a front-side projection sequence V_front={F_t|t∈[1, N]} of the depth video V of the behavior sample as an example, vectorizing F_tfirst, i.e., connecting a row vector of F_tinto a new row vector i_t;
- solving an arithmetic square root of each element in the row vector i_tto obtain a new vector w_t, i.e.:

$w_{t} = \sqrt{i_{t}},$

- in which √{square root over (i_t)} indicates to solve an arithmetic square root of each element in the row vector i_t, and w_tis denoted by a frame vector of a t^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample;
- calculating a feature vector v_tof the t^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample according to a formula:

$v_{t} = \frac{1}{t} \sum_{κ = 1}^{t} w_{κ},$

- in which Σ_κ=1^tw_κ represents summation of frame vectors from a first-frame image to the t^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample;
- calculating a score B_tof the t^th-frame image F_tof the front-side projection sequence V_frontof the depth video V of the behavior sample according to a formula:

$B_{t} = u^{T} \cdot v_{t},$

- in which u is a vector of a dimension A, A=R×C, u^Trepresents transposition of the vector u, and u^T·v_tindicates to calculate a dot product of the feature vector v_tand a vector obtained by transposing the vector u;
- calculating a value of u, such that frame images in the front-side projection sequence V_fronthave higher and higher scores from front to back, i.e., the larger the t is, the higher the score B_tis, u being calculated by using RankSVM as follows:

$u = \underset{u}{\arg \min} E (u), E (u) = \frac{λ}{2} { u }^{2} + \frac{2}{T (T - 1)} \times \sum_{c > j} \max {0, 1 - B_{c} + B_{j}},$

- in which

$\underset{u}{\arg \min} E (u)$

represents u that minimizes the value of E(u), λ is a constant, and ∥u∥²indicates to calculate a sum of squares of each element in the vector u; B_cand B_jrespectively represent a score of a c^th-frame image and a score of a j^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample, and max {0,1−B_c+B_j} indicates to choose a larger value of 0 and 1−B_c+B_j; and

- in response to calculating the vector u by using RankSVM, arranging the vector u in an image form with the same size as F_tto obtain u′∈R×C, u′ being a dynamic image of the front-side projection sequence V_frontof the depth video V of the behavior sample.

Preferably, the feature extraction module includes a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5 and a multi-feature fusion unit, wherein outputs of the convolution unit 1, the convolution unit 2, the convolution unit 3, the convolution unit 4 and the convolution unit 5 are sequentially inputted into the multi-feature fusion unit, and a final output of the multi-feature fusion unit is M₆.

The convolution unit 1 includes two convolution layers and one maximum pooling layer. Each convolution layer has 64 convolution kernels, and each convolution kernel has a size of 3×3. A pooling kernel of the maximum pooling layer has a size of 2×2. An output of the convolution unit 1 is C₁.

The convolution unit 2 includes two convolution layers and one maximum pooling layer. Each convolution layer has 128 convolution kernels, and each convolution kernel has a size of 3×3. A pooling kernel of the maximum pooling layer has a size of 2×2. An input of the convolution unit 2 is C₁and an output thereof is C₂.

The convolution unit 3 includes three convolution layers and one maximum pooling layer. Each convolution layer has 256 convolution kernels, and each convolution kernel has a size of 3×3. A pooling kernel of the maximum pooling layer has a size of 2×2. An input of the convolution unit 3 is C₂and an output thereof is C₃.

The convolution unit 4 includes three convolution layers and one maximum pooling layer. Each convolution layer has 512 convolution kernels, and each convolution kernel has a size of 3×3. A pooling kernel of the maximum pooling layer has a size of 2×2. An input of the convolution unit 4 is C₃and an output thereof is C₄.

The convolution unit 5 includes three convolution layers and one maximum pooling layer. Each convolution layer has 512 convolution kernels, and each convolution kernel has a size of 3×3. A pooling kernel of the maximum pooling layer has a size of 2×2. An input of the convolution unit 5 is C4 and an output thereof is C₅.

Inputs of the multi-feature fusion unit are the output C₁of the convolution unit 1, the output C₂of the convolution unit 2, the output C₃of the convolution unit 3, the output C₄of the convolution unit 4 and the output C₅of the convolution unit 5. The output C₁of the convolution unit 1 is inputted into a maximum pooling layer 1 and a convolution layer 1 in the multi-feature fusion unit. A pooling kernel of the maximum pooling layer 1 has a size of 4×4. The convolution layer 1 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 1 is M₁.

The output C₂of the convolution unit 2 is inputted into a maximum pooling layer 2 and a convolution layer 2 in the multi-feature fusion unit. A pooling kernel of the maximum pooling layer 2 has a size of 2×2. The convolution layer 2 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 2 is M₂.

The output C₃of the convolution unit 3 is inputted into a convolution layer 3 in the multi-feature fusion unit. The convolution layer 3 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 3 is M₃.

The output C₄of the convolution unit 4 is inputted into an up-sampling layer 1 and a convolution layer 4 in the multi-feature fusion unit. The convolution layer 4 has 512 convolution kernels, the convolution kernel has a size of 1×1, and an output of the convolution layer 4 is M₄.

The output C₅of the convolution unit 5 is inputted into an up-sampling layer 2 and a convolution layer 5 in the multi-feature fusion unit. The convolution layer 5 has 512 convolution kernels, each convolution kernel 5 has a size of 1×1, and an output of the convolution layer 5 is M₅. The output M₁of the convolution layer 1, the output M₂of the convolution layer 2, the output M₃of the convolution layer 3, the output M₄of the convolution layer 4 and the output M₅of the convolution layer 5 are connected by channel and inputted into a convolution layer 6. The convolution layer 6 bas 256 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 6 is M₆. An output of the multi-feature fusion unit is the output M₆of the convolution layer 6.

Q^ƒ represents a feature that is extracted when the dynamic image of the front-side projection sequence of the depth video V of the behavior sample is inputted into the front-side projection feature extraction module; Q^rrepresents a feature that is extracted when the dynamic image of the right-side projection sequence of the depth video V of the behavior sample is inputted into the right-side projection feature extraction module; Q^lrepresents a feature that is extracted when the dynamic image of the left-side projection sequence of the depth video V of the behavior sample is inputted into the left-side projection feature extraction module: and Q^trepresents a feature that is extracted when the dynamic image of the top-side projection sequence of the depth video V of the behavior sample is inputted into the top-side projection feature extraction module.

Preferably, a linkage feature is extracted in step 4) by combining every two, every three and every four of the features extracted by all the feature extraction modules in step 3) to obtain multiple projection combinations.

The linkage feature of each projection combination is calculated as follows:

- connecting the features in the projection combination by channel to obtain a combined feature Q∈H×W×γJ, in which H and W represent a height and a width of each feature in the projection combination respectively, J represents the quantity of channels of each feature in the projection combination, and γ represents the quantity of features in the projection combination: calculating an explicit linkage feature Z_α of each projection combination and an implicit linkage feature Z_β of each projection combination; and calculating a linkage feature Z of the projection combination according to a formula:

$Z = Z_{α} \oplus Z_{β},$

- in which ⊕ represents addition of elements in corresponding locations of matrices Z_α and Z_β.

Preferably, in step 5), the linkage features of all the projection combinations are connected by channel, and inputted into the average pooling layer. An output Γ of the average pooling layer is inputted into a fully connected layer 2. The quantity of neurons in the fully connected layer 2 is D₂. An output S₂of the fully connected layer 2 is calculated as follows:

$S_{2} = ϕ_{relu} (W_{2} \cdot Γ + θ_{2}),$

- in which ϕ_reluis an activation function relu, W₂is a weight of the fully connected layer 2, and θ₂is a bias vector of the fully connected layer 2.

The output S₂of the fully connected layer 2 is inputted into a fully connected layer 3 with an activation function softmax. The quantity of neurons in the fully connected layer 3 is K. An output S₃is calculated as follows:

$S_{3} = ϕ_{soft \max} (W_{3} \cdot S_{2} + θ_{3}),$

- in which ϕ_{soft max}represents the activation function softmax, W₃is a weight of the fully Ø connected layer 3, and θ₃is a bias vector of the fully connected layer 3.

Preferably, an input of the depth video linkage feature-based behavior recognition network in step 6) is the depth video of the behavior sample, and an output thereof is a probability that a corresponding behavior sample belongs to a respective behavior category, i.e., an output of the fully connected layer 3 is Q₃. A loss function L of the network is:

$L = - \sum_{g = 1}^{G} \sum_{p = 1}^{K} {[1_{g}]}_{p} \log ({[Q_{3}^{q}]}_{p}),$

- in which G is a total quantity of training behavior samples, K is the quantity of categories of the behavior samples, Q₃^gis a network output of a g^thbehavior sample, and l_gis an expected output of the g^thbehavior sample. p^th-dimension data of l_gis defined as:

$[1_{g}] = {\begin{matrix} 1, if p = l_{e} \\ 0, else \end{matrix},$

- in which l₈is a tag value of the g^thbehavior sample.

Preferably, the behavior recognition in step 8) includes: inputting a depth video of each tested behavior sample into the trained depth video linkage feature-based behavior recognition network to obtain a predicted probability value of a current tested behavior video sample belonging to each behavior category, and taking the behavior category with the largest probability value as the finally predicted behavior category to which the current tested behavior video sample belongs, so as to implement the behavior recognition.

Preferably, the explicit linkage feature of each projection combination is calculated by the following steps:

- 1) calculating an average value of features of each channel and an average value Q_aof features of an a^thchannel of the combined feature Q according to a formula:

${\overline{Q}}_{a} = \frac{1}{H \times W} \sum_{h, w}^{H, W} Q_{a, h, w},$

- in which Q_a,h,wrepresents an h^th-row and w^th-column element value of the a^thchannel of the combined feature Q;
- 2) calculating a degree of explicit correlation P∈γJ×γJ of features between different channels of the combined feature Q, a degree of explicit correlation P^a,bof features between the a^thchannel and a b^thchannel being calculated according to a formula:

$P_{a, b} = \frac{1}{H \times W} \sum_{h, w}^{H, W} (Q_{a, h, w} - {\overline{Q}}_{a}) (Q_{b, h, w} - {\overline{Q}}_{b}),$

- in which Q_b,h,wrepresents an h^th-row and w^th-column element value of the b^thchannel of the combined feature Q, and Q_brepresents an average value of features of the b^thchannel of the combined feature Q;
- 3) calculating a degree of normalized explicit correlation {circumflex over (P)}∈γJ×γJ of features between the different channels of the combined feature Q, a degree of normalized explicit correlation {circumflex over (P)}^a,bof features between the a^thchannel and the b^thchannel being calculated according to a formula:

${\hat{P}}_{a, b} = \frac{e^{P^{a, b}}}{\sum_{b = 1}^{γ J} e^{P^{a, b}}};$

and

- 4) calculating an explicit linkage feature Z_a∈H×W×γJ of the projection combination, a feature Z_α^aof the a^thchannel of Q_α being calculated according to a formula:

$Z_{α}^{a} = \sum_{b = 1}^{γ J} {\hat{P}}^{a, b} Q_{b},$

- in which Q_brepresents a feature of the b^thchannel of the combined feature Q.

Preferably, the implicit linkage feature of each projection combination is calculated by the following steps:

- 1) calculating an average value of each channel of the combined feature Q, and connecting the average values of all the channels into a vector Q=(Q₁, Q₂, . . . , Q_γJ);
- 2) inputting the vector Q into the fully connected layer 1, the quantity of neurons of the fully connected layer 1 being γJ, an output of the fully connected layer 1 being S₁=ϕ_sigmoid(W₁·Q+θ₁)∈, in which ϕ_sigmoidrepresents an activation function sigmoid, W_t∈γJ×γJ represents a weight of the fully connected layer 1, and θ₁∈ represents a bias vector of the fully connected layer 1; and
- 3) calculating an implicit linkage feature Z_β∈H×W×γJ of the projection combination, a feature Z_β^aof an a^thchannel of Z_β being calculated according to a formula:

$Z_{β}^{a} = S_{1}^{a} \cdot Q_{a},$

- in which S₁^arepresents a value of an a^thelement of the output S₁of the fully connected layer 1.

The present invention has the following beneficial effects: 1) the depth video-based behavior recognition cannot acquire information such as human appearance, thereby protecting human privacy; meanwhile, the depth video is less susceptible to light, and thus can provide more abundant three-dimensional information about a behavior; and

- 2) information of the behavior in different dimensions can be acquired by projecting the depth video onto different planes, and these pieces of information can be combined to make it easier to recognize a human behavior; and the learned linkage features of the depth video in different dimensions are more discriminative for behavior recognition.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow chart according to the present invention;

FIG. 2 is a flow chart of a feature extraction module;

FIG. 3 is a flow chart illustrating extraction of linkage features from each projection combination;

FIG. 4 is a flow chart of a depth video linkage feature-based behavior recognition network;

FIGS. 5A-5D are schematic diagrams of planar projections of a hand waving behavior according to an embodiment; and

FIG. 6 is a dynamic image of a front-side projection of a hand waving behavior according to an embodiment.

DETAILED DESCRIPTION

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of embodiments of the present invention, rather than all of the embodiments. According to the described embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without any creative work fall within the protection scope of the present invention.

According to the embodiments of the present invention, referring to FIGS. 1-6, a depth video linkage feature-based behavior recognition method includes the following steps:

- 1) projecting a depth video of each behavior sample onto a front side, a right side, a left side and a top side to obtain four projection sequences;
- 2) obtaining four dynamic images of each behavior sample by calculating dynamic images of the four projection sequences of each behavior sample;
- 3) inputting the four dynamic images into respective feature extraction modules and extracting features;
- 4) inputting the features extracted from the dynamic images of the four projection sequences into a multi-projection linkage feature extraction module and extracting a linkage feature of each projection combination;
- 5) connecting the extracted linkage features of all the projection combinations by channel, and inputting the connected features into an average pooling layer and two fully connected layers;
- 6) constructing a depth video linkage feature-based behavior recognition network;
- 7) inputting a depth video of each training behavior sample into the depth video linkage feature-based behavior recognition network, and training the network till convergence; and
- 8) inputting a depth video of each tested behavior sample into the trained depth video linkage feature-based behavior recognition network to implement behavior recognition.

The dynamic image is obtained in step 2) as follows.

By taking a front-side projection sequence V_front={F_t|t∈[1, N]} of a depth video V of the behavior sample as an example, the dynamic image is calculated as follows:

- vectorizing F_tfirst, i.e., connecting a row vector of F_tinto a new row vector i_t;
- solving an arithmetic square root of each element in the row vector i_tto obtain a new vector w_t, i.e.:

$w_{t} = \sqrt{i_{t}},$

- in which √{square root over (i_t)} indicates to solve an arithmetic square root of each element in the row vector i_t, w_tbeing denoted by a frame vector of a t^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample;
- calculating a feature vector v_tof the t^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample according to a formula:

$v_{t} = \frac{1}{t} \sum_{κ = 1}^{t} w_{κ},$

- in which Σ_κ=1^tw_κ represents summation of frame vectors from a first-frame image to the t^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample;
- calculating a score B_tof the t^th-frame image F_tof the front-side projection sequence V_frontof the depth video V of the behavior sample according to a formula:

$B_{t} = u^{T} \cdot v_{t},$

- in which u is a vector in a dimension A, A=R×C, u^Trepresents transposition of the vector u, and u^T·v_tindicates to calculate a dot product of the feature vector v_tand a vector obtained by transposing the vector u;
- calculating a value of u, such that frame images in the front-side projection sequence V_fronthave higher and higher scores from front to back, i.e., the larger the t is, the higher the score B_tis, u being calculated by using RankSVM as follows:

$u = \underset{u}{\arg \min} E (u), E (u) = \frac{λ}{2} { u }^{2} + \frac{2}{T (T - 1)} \times \sum_{c > j} \max {0, 1 - B_{c} + B_{j}},$

- in which

$\underset{u}{\arg \min} E (u)$

represents u that minimizes the value of E(u), λ is a constant, and ∥u∥²indicates to calculate a sum of squares of all elements in the vector u; B_cand B_jrespectively represent a score of a c^th-frame image and a score of a j^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample, and max{0,1−B_c+B_j} indicates to choose a larger value of 0 and 1−B_c+B_j; and

- in response to calculating the vector u by using RankSVM, arranging the vector u in an image form with the same size as F_tto obtain u′∈R×C, u′ being the dynamic image of the front-side projection sequence V_frontof the depth video V of the behavior sample.

The dynamic images of a right-side projection sequence, a left-side projection sequence and a top-side projection sequence of the depth video V of the behavior sample are calculated in the same way as the dynamic image of the front-side projection sequence.

A linkage feature of each projection combination is extracted in step 4) as follows.

As shown in FIG. 3, every two, every three and every four of the features extracted after the dynamic images of the four projection sequences are inputted into the respective feature extraction modules are combined to obtain a total of 11 projection combinations. A combination of the features extracted from the dynamic images of the front-side projection sequence and of the left-side projection sequence is denoted by a 1-2 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence and of the right-side projection sequence is denoted by a 1-3 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence and of the top-side projection sequence is denoted by a 1-4 projection combination. A combination of the features extracted from the dynamic images of the left-side projection sequence and of the right-side projection sequence is denoted by a 2-3 projection combination. A combination of the features extracted from the dynamic images of the left-side projection sequence and of the top-side projection sequence is denoted by a 2-4 projection combination. A combination of the features extracted from the dynamic images of the right-side projection sequence and of the top-side projection sequence is denoted by a 3-4 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence, of the left-side projection sequence and of the right-side projection sequence is denoted by a 1-2-3 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence, of the left-side projection sequence and of the top-side projection sequence is denoted by a 1-2-4 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence, of the right-side projection sequence and of the top-side projection sequence is denoted by a 1-3-4 projection combination. A combination of the features extracted from the dynamic images of the left-side projection sequence, of the right-side projection sequence and of the top-side projection sequence is denoted by a 2-3-4 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence, of the left-side projection sequence, of the right-side projection sequence and of the top-side projection sequence is denoted by a 1-2-3-4 projection combination.

The linkage feature of each projection combination is calculated as follows:

- connecting the features in the projection combination by channel to obtain a combined feature Q∈H×W×γJ, in which H and W represent a height and a width of each feature in the projection combination respectively, J represents the quantity of channels of each feature in the projection combination, and γ represents the quantity of features in the projection combination.

An explicit linkage feature of each projection combination is calculated first by the following steps:

- 1) calculating an average value of features of each channel and an average value Q_aof features of an a^thchannel of the combined feature Q according to a formula:

${\overline{Q}}_{a} = \frac{1}{H \times W} \sum_{h, w}^{H, W} Q_{a, h, w},$

- in which Q_a,h,wrepresents an h^th-row and w^th-column element value of the a^thchannel of the combined feature Q;
- 2) calculating a degree of explicit correlation P∈γJ×γJ of features between different channels of the combined feature Q, a degree of explicit correlation P^a,bof features between the a^thchannel and a b^thchannel being calculated according to a formula:

$P^{a, b} = \frac{1}{H \times W} \sum_{h, w}^{H, W} (Q_{a, h, w} - {\overline{Q}}_{a}) (Q_{b, h, w} - {\overline{Q}}_{b}),$

- in which Q_b,h,wrepresents an h^th-row and w^th-column element value of the b^thchannel of the combined feature Q, and Q_brepresents an average value of features of the b^thchannel of the combined feature Q;
- 3) calculating a degree of normalized explicit correlation P∈γJ×γJ of features between the different channels of the combined feature Q, a degree of normalized explicit correlation {circumflex over (P)}_a,bof features between the a^thchannel and the b^thchannel being calculated according to a formula:

${\hat{P}}^{a, b} = \frac{e^{P^{a, b}}}{\sum_{b = 1}^{γ J} e^{P^{a, b}}};$

and

- 4) calculating an explicit linkage feature Z_α∈H×W×γJ of the projection combination, a feature Z_α^a of the a^thchannel of Z_α being calculated according to a formula:

$Z_{α}^{a} = \sum_{b = 1}^{γ J} {\hat{P}}^{a, b} Q_{b},$

- in which Q_brepresents a feature of the b^thchannel of the combined feature Q.

After that, an implicit linkage feature of each projection combination is calculated by the following steps:

- 1) calculating an average value of each channel of the combined feature Q, and connecting the average values of all the channels into a vector Q=(Q₁, Q₂, . . . , Q_γJ);
- 2) inputting the vector into a fully connected layer 1, the quantity of neurons of the fully connected layer 1 being γJ, an output of the fully connected layer 1 being, S₁=ϕ_sigmoid(W₁·Q+θ₁)∈, in which ϕ_sigmoidrepresents an activation function sigmoid, W_l∈γJ×γJ represents a weight of the fully connected layer 1, and θ₁∈ represents a bias vector of the fully connected layer 1; and
- 3) calculating an implicit linkage feature Z_β∈H×W×γJ of the projection combination, a feature Z_β^aof an a^thchannel of Z_β being calculated according to a formula:

$Z_{β}^{a} = S_{1}^{a} \cdot Q_{a},$

- in which S₁^arepresents a value of an a^thelement of the output S₁of the fully connected layer 1.

Finally, the linkage feature Z of each projection combination is calculated according to a formula:

$Z = Z_{α} \oplus Z_{β},$

- in which ⊕ represents addition of elements in corresponding locations of matrices Z_α and Z_β.

There are 11 projection combinations in total, and hence 11 linkage features may be obtained.

The depth video linkage feature-based behavior recognition network is constructed in step 6). As shown in FIG. 4, an input of the network is the depth video of the behavior sample, an output thereof is a probability that a corresponding behavior sample belongs to the respective behavior category, i.e., an output of a fully connected layer 3 is Q₃. A loss function L of the network is:

$L = - \sum_{g = 1}^{G} \sum_{p = 1}^{K} {[l_{g}]}_{p} \log ({[Q_{3}^{g}]}_{p}),$

- in which G is the total number of training behavior samples, K is the quantity of categories of the behavior samples, Q₃^gis a network output of a g^thbehavior sample, l_gis an expected output of the g^thbehavior sample, and p^th-dimension data of l_gis defined as:

${[l_{g}]}_{p} = {\begin{matrix} 1, & if p = l_{g} \\ 0, & else \end{matrix},$

- in which l_gis a tag value of the g^thbehavior sample.

In step 7), the depth video of each training behavior sample is inputted into the depth video linkage feature-based behavior recognition network, and the network is trained till convergence.

In step 8), the depth video of each tested behavior sample is inputted into the trained depth video linkage feature-based behavior recognition network to obtain a predicted probability value of a current tested behavior video sample belonging to the respective behavior category, and the behavior category with the largest probability value is the finally predicted behavior category to which the current tested behavior video sample belongs, so as to implement the behavior recognition.

Embodiments

As shown in FIGS. 5A-5D and 6:

- 1) there are 2,400 samples in total in a behavior sample set, including 8 behavior categories, with 300 samples in each behavior category. Two thirds of the samples in each behavior category are randomly selected and assigned to a training set, and the remaining one third are assigned to a testing set, to obtain a total of 1,600 training samples and 800 testing samples. Each behavior sample consists of all frames in a depth video of the sample. A depth video V of any behavior sample is taken as an example:

$V = {I_{t} ❘ t \in [1, 50]},$

- in which t represents a time index. There are 50 frames in total in the behavior sample. J_i∈240×240 is a matrix representation of a t^th-frame depth image of the depth video V of the behavior sample. The t^th-frame depth image has 240 rows and 240 columns. indicates that a matrix is a real matrix. I_t(x_i, y_i)=d_irepresents a depth value of a point p_iwith coordinates (x_i, y_i) on the t^th-frame depth image, i.e., a distance between the point p_iand a depth camera.

The depth video V of the behavior sample is respectively projected onto four planes, including a front side, a right side, a left side and a top side. At this time, the depth video V of the behavior sample may be denoted by a set of four projection sequences, which is expressed by the following formula:

$V = {V_{front}, V_{right}, V_{left}, V_{top}},$

- in which V_frontrepresents a projection sequence obtained by projecting the depth video V of the behavior sample onto a front side, V_rightrepresents a projection sequence obtained by projecting the depth video V of the behavior sample onto a right side, V_leftrepresents a projection sequence obtained by projecting the depth video V of the behavior sample onto a left side, and the V_toprepresents a projection sequence obtained by projecting the depth video V of the behavior sample onto a top side.

V_front={F_t|t∈[1,50]}, in which F_t∈240×240 represents a projection graph obtained by projecting the t^th-frame depth image of the depth video V of the behavior sample onto the front side. An abscissa value x_i, an ordinate value y_iand a depth value d_iof the point p_iin the depth image respectively determine an abscissa value x_i^ƒ, an ordinate value y_i^ƒ and a pixel value z_i^ƒ of a point projected from the point p_ionto the projection graph F_t, which are denoted by the formulas:

$F_{t} (x_{i}^{f}, y_{i}^{f}) = z_{i}^{f}, x_{i}^{f} = x_{i}, y_{i}^{f} = y_{i}, z_{i}^{f} = f_{1} (d_{i}),$

- in which ƒ₁is a linear function indicating that the depth value d_iis mapped to an interval [0.255], such that the smaller the depth value is, the larger the pixel value on the projection graph is, i.e., the closer the point is to the depth camera, the brighter the point is on a front-side projection graph.

V_right={R_t|t∈[1,50]}, in which R_t∈240×240 represents a projection graph obtained by projecting the t^th-frame depth image onto the right side. There may be more than one point projected onto the same location on the projection graph when the depth image is projected onto the right side. A point closest to an observer, i.e., a point furthest from a projection plane, can be seen when a behavior is observed from the right side. Therefore, an abscissa value of the point furthest from the projection plane on the depth image should be reserved, and a pixel value of the point in this location of the projection graph is calculated according to the abscissa value. For this purpose, points in the depth image are traversed column by column from a column with the smallest abscissa x in the depth image in a direction in which x increases, and are projected onto the projection graph. An abscissa value x_i, an ordinate value y_iand a depth value d_iof the point p_iin the depth image respectively determine a pixel value z_i^r, an ordinate value y_i^rand an abscissa value x_i^rof a point in the projection graph R_t, which are denoted by the formulas:

$R_{t} (x_{i}^{r}, y_{i}^{r}) = z_{i}^{r}, x_{i}^{r} = d_{i}, y_{i}^{r} = y_{i}, z_{i}^{r} = f_{2} (x_{i}),$

- in which ƒ₂is a linear function indicating that the abscissa value x_iis mapped to an interval [0,255] In a case that x continues to increase, there may be a new point that should be reserved, and the new point and the previously projected point are projected onto the same location in the projection graph, i.e., a pixel value of this location in the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, i.e., z_i^r=ƒ₂(x_m), in which x_m=max x_i, x_i∈X_R, X_Ris a set of abscissas of all points with ordinate values y_i^rand depth values x_i^rin the depth image, and max x_i, x_i∈X_Rrepresents a maximum abscissa value in the set X_R.

V_left={L_t|t∈[1,50]}, in which L_t∈240×240 represents a projection graph obtained by projecting the t^th-frame depth image onto the left side. Similar to acquisition of the right-side projection graph, in a case that multiple points are projected onto the same location on a left-side projection graph, a point furthest from a projection plane should be reserved. For this purpose, points in the depth image are traversed column by column from a column with the largest abscissa x in the depth image in a direction in which x decreases, and are projected onto the left-side projection graph. An abscissa value x_i, an ordinate value y_iand a depth value d_iof the point p_iin the depth image respectively determine a pixel value z_i^t, an ordinate value y_i^tand an abscissa value x_i^tof a point in the projection graph L_t. For a point projected onto the same coordinates (x_i^l, y_i^l) on the left-side projection graph, an abscissa value of the point with the smallest abscissa is selected to calculate a pixel value at the coordinates of the projection graph, which is denoted by a formula:

$L_{t} (x_{i}^{l}, y_{i}^{l}) = z_{i}^{l}, x_{i}^{l} = d_{i}, y_{i}^{l} = y_{i}, z_{i}^{l} = f_{3} (x_{n}),$

- in which ƒ₃is a linear function indicating that an abscissa value x_nis mapped to an interval [0,255], x_n=min x_i, x_i∈X_L, in which X_Lis a set of abscissas of all points with ordinate values y_i^land depth values x_i^lin the depth image, and min x_i,x_i∈X₁represents a minimum abscissa value in the set X_L.

V_top={T_t|t∈[1,50]}, in which O_t∈240×240 represents a projection graph obtained by projecting the t^th-frame depth image onto the top side. In a case that multiple points are projected onto the same location on a top-side projection graph, a point furthest from a projection plane is reserved. Points in the depth image are traversed column by column from a column with the smallest ordinate y on the depth image in a direction in which y increases, and are projected onto the top-side projection graph. An abscissa value x_i, an ordinate value y_iand a depth value d_iof the point p_iin the depth image respectively determine an abscissa value x_i^o, a pixel value z_i^oand an ordinate value y_i^oof a point of the point p_iprojected onto a projection graph O_t. For a point projected onto the same coordinates (x_i^o, y_i^o) on the projection graph, an ordinate value of the point with the largest ordinate is selected to calculate a pixel value at the coordinates of the projection graph, which is denoted by a formula:

$O_{t} (x_{i}^{o}, y_{i}^{o}) = z_{i}^{o}, x_{i}^{o} = x_{i}, y_{i}^{o} = d_{i}, z_{i}^{o} = f_{4} (y_{q}),$

- in which ƒ₄is a linear function indicating that an ordinate value y_qis mapped to an interval [0,255]; y_q=max y_i, y_i∈Y_o, in which Y_ois a set of ordinates of all points with abscissa values x_i^oand depth values y_i^oin the depth image, and max y_i, y_i∈Y_orepresents a maximum ordinate value in the set Y_o.
- 2) dynamic images of four projection sequences of the depth video of each behavior sample are calculated to obtain four dynamic images of each behavior sample. By taking the front-side projection sequence front V_front={F_t|t∈[1,50]} of the depth video V of the behavior sample as an example, the dynamic image is calculated as follows.
- F_tis vectorized first, i.e., a row vector of F_tis connected into a new row vector i_t.

An arithmetic square root of each element in the row vector i_tis solved to obtain a new vector W_t, i.e.:

$w_{t} = \sqrt{i_{t}},$

- in which √{square root over (i_t)} indicates to solve an arithmetic square root of each element in the row vector i_t. W_tis denoted by a vector frame of a t^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample.

A feature vector V_tof the t^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample is calculated according to a formula:

$v_{t} = \frac{1}{t} \sum_{κ = 1}^{t} w_{κ},$

- in which Σ_κ=1^tW_κ represents summation of frame vectors from a first-frame image to the t^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample.

A score B_tof the t^th-frame image F_tof the front-side projection sequence V_frontof the depth video V of the behavior sample is calculated according to a formula:

$B_{t} = u^{T} \cdot v_{t},$

- in which u is a vector in a dimension of 57600, u^Trepresents transposition of the vector u, and u^T·v_tindicates to calculate a dot product of the feature vector V_tand a vector obtained by transposing the vector u.

A value of u is calculated, such that frame images in the front-side projection sequence V_fronthave higher and higher scores from front to back, i.e., the larger the t is, the higher the score B_tis. u is calculated by using RankSVM as follows:

$u = \underset{u}{\arg \min} E (u) E (u) = \frac{λ}{2} { u }^{2} + \frac{1}{1225} \times \sum_{c > j} \max {0, 1 - B_{c} + B_{j}},$

- in which

$\underset{u}{\arg \min} E (u)$

represents u that minimizes the value of E(u), λ is a constant, and ∥u∥²indicates to calculate a sum of squares of each element in the vector u; B_cand B_jrespectively represent a score of a c^th-frame image and a score of a j^th-frame image of the front-side projection sequence V_frontof the depth video V of the behavior sample, and max{0,1−B_c+B_j} indicates to choose a larger value of 0 and 1−B_c+B_j.

In response to calculating the vector u by using RankSVM, the vector u is arranged in an image form with the same size as F_tto obtain u′∈240×240. u′ is a dynamic image of the front-side projection sequence V_frontof the depth video V of the behavior sample. FIGS. 5A-5D show dynamic images of a front-side projection of a hand waving behavior.

The dynamic images of the right-side projection sequence, the left-side projection sequence and the top-side projection sequence of the depth video V of the behavior sample are calculated in the same way as the dynamic images of the front-side projection sequence.

3) The dynamic images of the front-side projection sequence, the right-side projection sequence, the left-side projection sequence and the top-side projection sequence of the depth video of the behavior sample are inputted into their respective feature extraction modules for extracting features. The feature extraction module includes a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5 and a multi-feature fusion unit.

The output C₄of the convolution unit 4 is inputted into an up-sampling layer 1 and a convolution layer 4 in the multi-feature fusion unit. The convolution layer 4 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 4 is M₄.

The output C₅of the convolution unit 5 is inputted into an up-sampling layer 2 and a convolution layer 5 in the multi-feature fusion unit. The convolution layer 5 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 5 is M₅. The output M₁of the convolution layer 1, the output M₂of the convolution layer 2, the output M₃of the convolution layer 3, the output M₄of the convolution layer 4 and the output M₅of the convolution layer 5 are connected by channel and inputted into a convolution layer 6. The convolution layer 6 has 256 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 6 is M₆. An output of the multi-feature fusion unit is the output M₆of the convolution layer 6.

Dynamic images of the front-side projection sequence, the right-side projection sequence, the left-side projection sequence and the top-side projection sequence of the depth video V of the behavior sample are respectively inputted into their respective feature extraction modules, namely, a front-side projection feature extraction module, a right-side projection feature extraction module, a left-side projection feature extraction module and a top-side projection feature extraction module. The four feature extraction modules are of the same structure. However, during network training, the four modules do not share parameters. The four feature extraction modules respectively output features Q^ƒ, Q^r, Q^land Q^t, which respectively represent a feature that is extracted when the dynamic image of the front-side projection sequence of the depth video V of the behavior sample is inputted into the front-side projection feature extraction module, a feature that is extracted when the dynamic image of the right-side projection sequence of the depth video V of the behavior sample is inputted into the right-side projection feature extraction module, a feature that is extracted when the dynamic image of the left-side projection sequence of the depth video V of the behavior sample is inputted into the left-side projection feature extraction module, and a feature that is extracted when the dynamic image of the top-side projection sequence of the depth video V of the behavior sample is inputted into the top-side projection feature extraction module.

4) The features extracted by all the feature extraction modules are inputted into the multi-projection linkage feature extraction module, and a linkage feature of each projection combination is extracted. Every two, every three and every four of the features extracted in response to the dynamic images of the four projection sequences being inputted into the respective feature extraction modules are combined to obtain a total of 11 projection combinations. A combination of the features extracted from the dynamic images of the front-side projection sequence and of the left-side projection sequence is denoted by a 1-2 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence and of the right-side projection sequence is denoted by a 1-3 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence and of the top-side projection sequence is denoted by a 1-4 projection combination. A combination of the features extracted from the dynamic images of the left-side projection sequence and of the right-side projection sequence is denoted by a 2-3 projection combination. A combination of the features extracted from the dynamic images of the left-side projection sequence and of the top-side projection sequence is denoted by a 2-4 projection combination. A combination of the features extracted from the dynamic images of the right-side projection sequence and of the top-side projection sequence is denoted by a 3-4 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence, of the left-side projection sequence and of the right-side projection sequence is denoted by a 1-2-3 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence, of the left-side projection sequence and of the top-side projection sequence is denoted by a 1-2-4 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence, of the right-side projection sequence and of the top-side projection sequence is denoted by a 1-3-4 projection combination. A combination of the features extracted from the dynamic images of the left-side projection sequence, of the right-side projection sequence and of the top-side projection sequence is denoted by a 2-3-4 projection combination. A combination of the features extracted from the dynamic images of the front-side projection sequence, of the left-side projection sequence, of the right-side projection sequence and of the top-side projection sequence is denoted by a 1-2-3-4 projection combination.

The linkage feature of each projection combination is calculated. By taking the 1-2 projection combination as an example, its linkage feature is calculated as follows:

The features and Q^ƒ in the 1-2 projection combination are connected by channel to obtain a combined feature Q∈H×W×512, in which H and W represent a height and a width of Q^ƒ and Q^l.

An explicit linkage feature of the projection combination is first calculated by the following steps:

- (1) calculating an average value of features of each channel and an average value Q_aof features of an a^thchannel of the combined feature Q according to a formula:

${\overline{Q}}_{a} = \frac{1}{H \times W} \sum_{h, w}^{H, W} Q_{a, h, w},$

- in which Q_a,h,wrepresents an h^th-row and w^th-column element value of the a^thchannel of the combined feature Q;
- (2) calculating a degree of explicit correlation P∈512×512 of features between different channels of the combined feature Q, a degree of explicit correlation P^a,bof features between the a^thchannel and a b^thchannel being calculated according to a formula:

$P^{a, b} = \frac{1}{H \times W} \sum_{h, w}^{H, W} (Q_{a, h, w} - {\overline{Q}}_{a}) (Q_{b, h, w} - {\overline{Q}}_{b}),$

- in which Q_b,h,wrepresents an h^th-row and w^th-column element value of the b^thchannel of the combined feature Q, and Q_brepresents an average value of features of the b^thchannel of the combined feature Q;
- 3) calculating a degree of normalized explicit correlation {circumflex over (P)}∈512×512 of features between the different channels of the combined feature Q, a degree of normalized explicit correlation {circumflex over (P)}^a,bof features between the a^thchannel and the b^thchannel being calculated according to a formula:

${\hat{P}}^{a, b} = \frac{e^{P^{a, b}}}{\sum_{b = 1}^{512} e^{P^{a, b}}};$

and

- 4) calculating an explicit linkage feature Z_α∈H×W×512 of the projection combination, a feature Z_α^aof the a^thchannel of Z_α being calculated according to a formula:

$Z_{α}^{a} = \sum_{b = 1}^{512} {\hat{P}}^{a, b} Q_{b},$

- in which Q_brepresents a feature of the b^thchannel of the combined feature Q.

After that, an implicit linkage feature of each projection combination is calculated by the following steps:

- (1) calculating an average value of each channel of the combined feature Q, and connecting the average values of all the channels into a vector Q=(Q₃, Q₂, . . . . , Q₅₁₂);
- (2) inputting the vector Q into the fully connected layer 1, the fully connected layer 1 having 512 neurons, an output of the fully connected layer 1 being, S₁=ϕ_sigmoid(W₁·Q+θ₁∈512×1, in which ϕ_sigmoidrepresents an activation function sigmoid, W₁∈512×512 represents a weight of the fully connected layer 1, θ₁∈512×1 and represents a bias vector of the fully connected layer 1; and
- (3) calculating an implicit linkage feature Z_β∈H×W×512 of the projection combination, a feature Z_β^aof an a^thchannel of Z_β being calculated according to a formula:

$Z_{β}^{a} = S_{1}^{a} \cdot Q_{a},$

- in which S₁^arepresents a value of an a^thelement of the output S₁of the fully connected layer 1.

Finally, a linkage feature Z of the 1-2 projection combination is calculated according to a formula:

$Z = Z_{a} \oplus Z_{β},$

- in which ⊕ represents addition of elements in corresponding locations of matrices Z_α and Z_β.

There are 11 projection combinations in total, and hence 11 linkage features may be obtained by the calculation method described above.

5) The linkage features of the 11 projection combinations obtained are connected by channel, and inputted into the average pooling layer. An output Γ of the average pooling layer is inputted into a fully connected layer 2. The fully connected layer 2 has 1,024 neurons. An output S₂of the fully connected layer 2 is calculated as follows:

$S_{2} = ϕ_{relu} (W_{2} \cdot Γ + θ_{2}),$

- in which ϕ_reluis an activation function relu, W₂is a weight of the fully connected layer 2, and θ₂is a bias vector of the fully connected layer 2.

An output S₂of the fully connected layer 2 is inputted into a fully connected layer 3 with an activation function softmax. The fully connected layer 3 has 8 neurons. An output S₃of the fully connected layer 3 is calculated as follows:

$S_{3} = ϕ_{soft \max} (W_{3} \cdot S_{2} + θ_{3}),$

- in which ϕ_softmaxrepresents the activation function softmax, W₃is a weight of the fully connected layer 3, and θ₃is a bias vector of the fully connected layer 3.

6) A depth video linkage feature-based behavior recognition network is constructed. An input of the network is the depth video of the behavior sample, and an output thereof is a probability that a corresponding behavior sample belongs to the respective behavior category, i.e., the output of the fully connected layer 3 is Q₃. A loss function L of the network is:

$L = - \sum_{g = 1}^{2400} \sum_{𝓅 = 1}^{8} {[l_{g}]}_{𝓅} \log ({[Q_{3}^{g}]}_{𝓅}),$

in which Q₃^gis a network output of a g^thbehavior sample, l_gis an expected output of the g^thbehavior sample, and p^th-dimension data of l_gis defined as:

${[l_{g}]}_{𝓅} = {\begin{matrix} 1, if p = l_{g} \\ 0, else \end{matrix},$

- in which l_gis a tag value of the g^thbehavior sample.

7) A depth video of each training behavior sample is inputted into the depth video linkage feature-based behavior recognition network, and the network is trained till convergence.

8) A depth video of each tested behavior sample is inputted into the trained depth video linkage feature-based behavior recognition network to obtain a predicted probability value of a current tested behavior video sample belonging to the respective behavior category. The behavior category with the largest probability value is taken as the finally predicted behavior category to which the current tested behavior video sample belongs, so as to implement the behavior recognition.

The activation function relu has a formula f(x)=max(0, x). An input of the function is x, and an output thereof is the larger one of x and 0.

The activation function softmax has a formula

$S_{i} = \frac{e^{i}}{\sum_{j = 1}^{n} e^{j}},$

in which i represents an output of an i^thneuron in the fully connected layer, J represents an output of a j^thneuron in the fully connected layer, n represents the quantity of neurons in the fully connected layer, and S_irepresents an output of the i^thneuron in the fully connected layer according to the activation function softmax.

The activation function sigmoid has a formula

$f (x) = \frac{1}{1 + e^{- x}} .$

An input of the function is x, and an output thereof is

$\frac{1}{1 + e^{- x}} .$

x represents the input of the activation function sigmoid and ƒ(x) represents the output of the activation function sigmoid.

It should be noted that, in this context, relational terms such as “first” and “second” are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any actual relationship or order between these entities or operations. The term “including”, “include” or any other variants thereof is intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements includes not only those elements but also other elements that are not specifically listed, or further includes elements that are inherent to such a process, method, item or device.

Although the embodiments of the present invention have been shown and described, it should be understood by those of ordinary skill in the art that various changes, modifications, substitutions and variations of theses embodiments may be made without departing from the principle and spirit of the present invention. The scope of the present invention is defined by the appended claims and equivalents thereof.

Claims

1. A depth video linkage feature-based behavior recognition method, comprising the following steps: 1) projecting a depth video of each behavior sample onto a front side, a right side, a left side and a top side to obtain corresponding projection sequences;2) obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence;3) inputting the dynamic image of each behavior sample into a respective feature extraction module and extracting features;4) inputting the extracted features into a multi-projection linkage feature extraction module and extracting a linkage feature of each projection combination;5) connecting the extracted linkage features of all the projection combinations by channel, and inputting the connected features into an average pooling layer and a fully connected layer;6) constructing a depth video linkage feature-based behavior recognition network;7) inputting a depth video of each training behavior sample into the depth video linkage feature-based behavior recognition network, and training the network till convergence; and8) inputting a depth video of each behavior sample to be tested into the trained behavior recognition network to implement behavior recognition.
2. The depth video linkage feature-based behavior recognition method according to claim 1, wherein the projection sequence is obtained in step 1) as follows: acquiring a depth video of any behavior sample, each behavior sample consisting of all frames in the depth video of the behavior sample,
3. The depth video linkage feature-based behavior recognition method according to claim 1, wherein the dynamic image is calculated in step 2) as follows: by taking a front-side projection sequence Vfront={Ft∈[1, N]} of the depth video V of the behavior sample as an example, vectorizing Ft first, i.e., connecting a row vector of Ft into a new row vector it;solving an arithmetic square root of each element in the row vector it to obtain a new vector wt, i.e.:
4. The depth video linkage feature-based behavior recognition method according to claim 1, wherein the feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5 and a multi-feature fusion unit; wherein outputs of the convolution unit 1, the convolution unit 2, the convolution unit 3, the convolution unit 4 and the convolution unit 5 are sequentially inputted into the multi-feature fusion unit, and a final output of the multi-feature fusion unit is M6; the convolution unit 1 comprises two convolution layers and one maximum pooling layer, each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3×3, a pooling kernel of the maximum pooling layer has a size of 2×2, and an output of the convolution unit 1 is C1;the convolution unit 2 comprises two convolution layers and one maximum pooling layer, each convolution layer has 128 convolution kernels, each convolution kernel has a size of 3×3, a pooling kernel of the maximum pooling layer has a size of 2×2, and an input of the convolution unit 2 is C1 and an output thereof is C2;the convolution unit 3 comprises three convolution layers and one maximum pooling layer, each convolution layer has 256 convolution kernels, each convolution kernel has a size of 3×3, a pooling kernel of the maximum pooling layer has a size of 2×2, and an input of the convolution unit 3 is C2 and an output thereof is C3;the convolution unit 4 comprises three convolution layers and one maximum pooling layer, each convolution layer has 512 convolution kernels, each convolution kernel has a size of 3×3, a pooling kernel of the maximum pooling layer has a size of 2×2, and an input of the convolution unit 4 is C3 and an output thereof is C4;the convolution unit 5 comprises three convolution layers and one maximum pooling layer, each convolution layer has 512 convolution kernels, each convolution kernel has a size of 3×3, a pooling kernel of the maximum pooling layer has a size of 2×2, and an input of the convolution unit 5 is C4 and an output thereof is C5;inputs of the multi-feature fusion unit are the output C1 of the convolution unit 1, the output C2 of the convolution unit 2, the output C3 of the convolution unit 3, the output C4 of the convolution unit 4 and the output C5 of the convolution unit 5; the output C1 of the convolution unit 1 is inputted into a maximum pooling layer 1 and a convolution layer 1 in the multi-feature fusion unit, a pooling kernel of the maximum pooling layer 1 has a size of 4×4, the convolution layer 1 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 1 is M1;the output C2 of the convolution unit 2 is inputted into a maximum pooling layer 2 and a convolution layer 2 in the multi-feature fusion unit, a pooling kernel of the maximum pooling layer 2 has a size of 2×2, the convolution layer 2 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 2 is M2;the output C3 of the convolution unit 3 is inputted into a convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 3 is M3;the output C4 of the convolution unit 4 is inputted into an up-sampling layer 1 and a convolution layer 4 in the multi-feature fusion unit, the convolution layer 4 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 4 is M4;the output C5 of the convolution unit 5 is inputted into an up-sampling layer 2 and a convolution layer 5 in the multi-feature fusion unit, the convolution layer 5 has 512 convolution kernels, each convolution kernel has a size of 1×1, and an output of the convolution layer 5 is M5; the output M1 of the convolution layer 1, the output M2 of the convolution layer 2, the output M3 of the convolution layer 3, the output M4 of the convolution layer 4 and the output M5 of the convolution layer 5 are connected by channel and inputted into a convolution layer 6; the convolution layer 6 has 256 convolution kernels, each convolution kernel has a size of 1×1, an output of the convolution layer 6 is M6, and an output of the multi-feature fusion unit is the output M6 of the convolution layer 6;dynamic images of the front-side projection sequence, the right-side projection sequence, the left-side projection sequence and the top-side projection sequence of the depth video V of the behavior sample are respectively inputted into respective feature extraction modules, namely, a front-side projection feature extraction module, a right-side projection feature extraction module, a left-side projection feature extraction module and a top-side projection feature extraction module, and during network training, the modules described above do not share parameters, and the feature extraction modules described above respectively output features Qƒ, Qr, Ql and Qt;Qƒ represents a feature that is extracted when the dynamic image of the front-side projection sequence of the depth video V of the behavior sample is inputted into the front-side projection feature extraction module, Qr represents a feature that is extracted when the dynamic image of the right-side projection sequence of the depth video V of the behavior sample is inputted into the right-side projection feature extraction module, Ql represents a feature that is extracted when the dynamic image of the left-side projection sequence of the depth video V of the behavior sample is inputted into the left-side projection feature extraction module, and Qt represents a feature that is extracted when the dynamic image of the top-side projection sequence of the depth video V of the behavior sample is inputted into the top-side projection feature extraction module.
5. The depth video linkage feature-based behavior recognition method according to claim 1, wherein the linkage feature is extracted in step 4) by combining every two, every three and every four of the features extracted by each feature extraction module in step 3) to obtain multiple projection combinations; a linkage feature of each projection combination is calculated as follows:connecting the features in the projection combination by channel to obtain a combined feature Q∈H×W×γJ, in which H and W represent a height and a width of each feature in the projection combination respectively, J represents the number of channels of each feature in the projection combination, and γ represents the number of features in the projection combination; calculating an explicit linkage feature Zα of each projection combination and an implicit linkage feature Zβ of each projection combination; and calculating a linkage feature Z of the projection combination according to a formula:
6. The depth video linkage feature-based behavior recognition method according to claim 1, wherein in step 5), the linkage features of all the projection combinations are connected by channel, and inputted into the average pooling layer, an output Γ of the average pooling layer is inputted into the fully connected layer 2, the quantity of neurons in the fully connected layer 2 is D2, and an output S2 of the fully connected layer 2 is calculated as follows:
7. The depth video linkage feature-based behavior recognition method according to claim 1, wherein an input of the depth video linkage feature-based behavior recognition network in step 6) is the depth video of the behavior sample, an output thereof is a probability that a corresponding behavior sample belongs to the respective behavior category, i.e., the output Q3 of the fully connected layer 3, and a loss function L of the network is:
8. The depth video linkage feature-based behavior recognition method according to claim 1, wherein the behavior recognition in step 8) comprises: inputting a depth video of each tested behavior sample into the trained depth video linkage feature-based behavior recognition network to obtain a predicted probability value of a current tested behavior video sample belonging to the respective behavior category, and taking the behavior category with the largest probability value as the finally predicted behavior category to which the current tested behavior video sample belongs, so as to implement the behavior recognition.
9. The depth video linkage feature-based behavior recognition method according to claim 5, wherein the explicit linkage feature of each projection combination is calculated by the following steps: 1) calculating an average value of features of each channel and an average value Qa of features of an ath channel of the combined feature Q according to a formula:
10. The depth video linkage feature-based behavior recognition method according to claim 5, wherein the implicit linkage feature of each projection combination is calculated by the following steps: 1) calculating an average value of each channel of the combined feature Q, and connecting the average values of all the channels into a vector Q=(Q1,Q2, . . . , QγJ);2) inputting the vector Q into the fully connected layer 1, the number of neurons of the fully connected layer 1 being γJ, an output of the fully connected layer 1 being S1=ϕsigmoid(W1·Q+θ1)∈, in which ϕsigmoid represents an activation function sigmoid, W1∈γJ×γJ represents a weight of the fully connected layer 1, and θ1∈represents a bias vector of the fully connected layer 1; and3) calculating an implicit linkage feature Zβ∈H×W×γJ of the projection combination, a feature Zβa of an ath channel of Zβ being calculated according to a formula:

Priority Claims (1)

Number	Date	Country	Kind
202110968288.1	Aug 2021	CN	national

PCT Information

Filing Document	Filing Date	Country	Kind
PCT/CN2022/098508	6/14/2022	WO

DEEP VIDEO LINKAGE FEATURE-BASED BEHAVIOR RECOGNITION METHOD

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)

PCT Information