The present invention relates to an action recognition method, and belongs to the field of action recognition technologies.
Action recognition, as an important subject in the field of computer vision, is widely used in video surveillance, behavior analysis, human-computer interaction and other fields. Although more and more attention has been paid to bone-based action recognition methods due to the development of cheap depth cameras, these methods are limited by an accuracy requirement of the depth cameras. When there is an occlusion problem or a relatively complex action, a predicted position of a bone joint is usually incorrect. Compared with the depth cameras, RGB devices have become more mature, and are more reliable. Therefore, many scholars study action recognition based on an RGB video.
Most of existing methods implement action recognition by extracting image-level features of frames of a video, but these methods do not devote themselves to extracting motion features of actions in the video. However, for analysis of the video, it is very important to acquire dynamic information in the video, and the motion features are important clues to distinguish different actions.
Therefore, an action recognition method is provided aiming at problems of the above action recognition algorithms.
The present invention aims to solve the problems in the prior art, and the technical solution is as follows.
An action recognition method includes the following steps:
Preferably, each action video sample is composed of all frames in this action video sample, and any action video sample A is calculated by
A={It|t∈[1,T]}
represents a u that allows a value of E(u) to be minimum, λ is a constant, ∥u∥2 represents a sum of squares of all elements in the vector u, Bi and Bj represents a score of an ith frame image of the video segment An
Further, in step 2, the feature extractor consists of a series of convolution layers and pooling layers; the dynamic image of each video segment in each action video sample is input into the feature extractor, and a feature map output by the feature extractor is FM∈K
Further, inputting the acquired motion feature map and static feature map into the motion feature enhancer and extracting the motion feature vector of the dynamic image in step 3 particularly include:
Further, in step 4, the feature center group totally contains Nk feature centers, each feature center corresponds to a scaling coefficient, and initial values of each feature center and a scaling coefficient thereof are calculated by the following method:
Further, in step 5, acquiring the complete histogram expression of the action video sample particularly includes:
Further, in step 6, the complete histogram expression of the action video sample is input into a multilayer perceptron to form a motion feature quantization network, and the motion feature quantization network includes the feature extractor, the motion feature enhancer, the feature soft quantizer, the histogram connecting layer and the multilayer perceptron;
Further, in step 8, the dynamic image and the static image of each video segment in the training action video sample are input into the feature extractor in the trained motion feature quantization network to acquire a motion feature map and a static feature map; the motion feature map and the static feature map are input into the motion feature enhancer in the trained motion feature quantization network to acquire an enhanced motion feature map FM′ of the corresponding video segment of the training action video sample; the enhanced motion feature map FM′ contains a motion feature vector xy∈D, and y=1,2, . . . , K1×K2; and the motion feature vector is input into the feature soft quantizer in the trained motion feature quantization network to acquire a corresponding histogram expression
Further, inputting the histogram expression into the salient motion feature extractor to acquire the salient motion feature map in step 9 particularly includes the following steps:
Further, in step 10, the action classifier is composed of the feature extractor, the motion feature enhancer and the feature soft quantizer in the trained motion feature quantization network as well as the salient motion feature extractor and the convolutional neural network;
Further, implementing the action recognition in step 12 particularly includes: segmenting, using a window with a length of l1, a test action video sample by a step length of l2, calculating a dynamic image and a static image of each video segment, then, inputting the dynamic image and the static image into the trained action classifier to acquire a predicted probability value representing that the current test action video sample belongs to each action category, adding the output probability values of all the video segments, and using an action category with a greatest probability value as a finally predicted action category to which the current test action video sample belongs.
The motion feature quantization network provided by the present invention can extract pure motion features from motion videos, ignore static information such as a background and an object, and only use the motion features for action recognition, so that the learned motion features are more distinguishable as for the action recognition.
The technical solutions in embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only part but not all of the embodiments of the present invention. All other embodiments obtained by those skilled in the art without creative efforts based on the embodiments in the present invention are within the protection scope of the present invention.
As shown in
1. The total number of samples in an action video sample set is 2,000, and there are 10 action categories, each of which has 200 action video samples. Three-fourths of the samples in each action category are randomly selected to be classified in a training set, and the remaining one-fourth of the samples is classified in a test set, so that 1,500 training action video samples and 500 test action video samples are acquired. Each action video sample is composed of all frames in this action video sample. The first action video sample A is taken as an example:
A={It|t∈[1,40]},
Dynamic images of the five video segments A1, A2, A3, A4, and A5 of the action video sample A are calculated respectively. The video segment A2={It|t∈[7,16]}={I′t
An arithmetic square root of each element in the row vector it
wt
A feature vector vt
A score Bt
Bt
The value of the u is calculated so that the more later ranked a frame image in the video segment is, the higher a score of the frame image is, that is, the greater the t1 is, the higher the score Bt
represents a u that allows a value of E(u) to be minimum, λ is a constant, and ∥U∥2 represents a sum of squares of all elements in the vector u. Bi and Bj represents a score of the ith frame image of the video segment An
The vector u calculated by the RankSVM is arranged into an image form with the same size as I′t240×320×3, u2 is called a dynamic image of the second video segment A2 of the action video sample A.
2. Each dynamic image of the action video sample is input into a feature extractor to extract a motion feature in the dynamic image. The feature extractor consists of a series of convolution layers and pooling layers. As shown in
The residual convolution layer has 256 convolution kernels, each of which has a size of 1×1. An output of the residual convolution layer and an output of the third convolution layer are added together as an output of the fourth residual addition layer, which is also an output of the group convolution module 1. The group convolution modules 2 and 3 are similar to the group convolution module 1, as shown in
The feature map output by the feature extractor is FM∈30×40×256, wherein the height, width and the number of channels of the feature map are 30, 40 and 256, respectively. The feature map FM is called a motion feature map.
3. For each video segment in each action video sample, an in-between frame of the video segment is extracted as a static image of the video segment of the action video sample. 30×40×256, and FS is called a static feature map.
4. The motion feature map FM and the static feature map FS of each video segment of the action video sample are input into a motion feature enhancer. A motion feature vector of the dynamic image is extracted by the following particular steps.
The sum of pixel values of each channel in the motion feature map FM is calculated, wherein the sum μd of pixel values of the dth channel is calculated as follows:
The sum of pixel values of each channel in the static feature map FS is calculated, wherein the sum Sd of pixel values of the dth channel is calculated as follows:
A difference between the sum of the pixel values of each channel in the motion feature map FM and the sum of the pixel values of the corresponding channel in the static feature map FS is calculated, wherein a calculation equation of a difference βd between sums of the pixel values of the dth channels is:
βd=|Sd−μd|.
Since the motion feature map and the static feature map are outputs acquired by the same feature extractor, convolution kernels corresponding to the dth channels in the motion feature map and the static feature map are the same. If the difference βd is very small, it means that static features, such as background features, are mostly extracted by the convolution kernels corresponding to the dth channels in the motion feature map and the static feature map, or if the difference βd is relatively large, it means that motion features are mostly extracted by the convolution kernels corresponding to the dth channels in the motion feature map and the static feature map. Thus, the larger the difference βd is, the greater a weight of the features extracted by the corresponding convolution kernels is, so that the motion features are enhanced.
A weight rd of features extracted by the convolution kernels corresponding to the dth channels is calculated by the following equation:
5. A feature center group is constructed, and totally contains 64 feature centers. Each feature center corresponds to a scaling coefficient. The first feature center is taken as an example, and initial values of each feature center and a scaling coefficient thereof are calculated by the following method.
Motion feature vectors of dynamic images in video segments of all training action video samples are calculated, and all the motion feature vectors are clustered. The number of the clustered clusters is set to 64. Each cluster has a cluster center. The value of a clustered center of the first cluster is used as an initial value of a first feature center. A set of all feature vectors in the first cluster is recorded as Ei which contains 500 vectors:
E1={e1,e2, . . . ,e500}.
The Euclidean distance dq,τ between vectors is calculated:
According to the above method, the initial values of 64 feature centers and the initial values of the corresponding scaling coefficients can be acquired.
6. For a motion feature vector xy of a dynamic image, a distance from this motion feature vector to the kth feature center ck is calculated, is used as an output of this motion feature vector in the kth feature center ck, and is calculated by the following equation:
Wk(xy)=exp(−∥xy−ck∥2/σk).
The output acquired by inputting the motion feature vector xy to the kth feature center is normalized:
7. All motion feature vectors of each dynamic image of the action video sample are respectively input to each feature center of the feature center group, and all outputs on each feature center of the feature center group are accumulated. The accumulated output hkn
hkn
The accumulated values of all the feature centers are connected together to acquire a histogram expression Hn
Hn
For the dynamic image u2 of the second video segment A2 of the action video sample A, the calculated histogram expression is H2=(h12, h22, . . . ,h642).
The feature center group and an accumulation layer that accumulates the outputs of the feature center group constitute the feature soft quantizer. The input of the feature soft quantizer is the motion feature vector of the dynamic image of each video segment in each action video sample, and an output of the feature soft quantizer is the histogram expression of the dynamic image of each video segment.
8. For each action video sample, it has a plurality of video segments, a histogram expression corresponding to a dynamic image of each video segment is acquired, and is input into the histogram connecting layer, and the histogram expressions are connected to acquire the complete histogram expression of the action video sample. For the action video sample A, it is segmented into 5 video segments, and its complete histogram expression is:
H=(H1,H2, . . . ,H5)=(h11,h21, . . . ,h641,h12,h22, . . . ,h642, . . . ,h15,h25, . . . ,h645).
9. The complete histogram expression of the action video sample is input into a multilayer perceptron to form a motion feature quantization network, as shown in
The multilayer perceptron includes an input layer, a hidden layer and an output layer. The input layer is connected with an output of the histogram connecting layer, and an output Input of the input layer is the same as that output H of the histogram connecting layer, namely, Input=H. The input layer totally has 320 neurons. The hidden layer totally has 128 neurons which are fully connected with all output units of the input layer. The output layer of the multilayer perceptron has 10 neurons, each of which represents an action category. A weight between the input layer and the hidden layer is expressed 320×128, and a weight between the hidden layer and the output layer is expressed as
128×10.
An output Q of a neuron in the hidden layer is calculated as follows:
Q=ϕelu(128,
An output O of the output layer of the multilayer perceptron is:
O=ϕsoftmax(10,
where ϕsoftmax is an activation function of softmax, and θO∈10 is a bias vector of the output layer.
A loss function L1 of the motion feature quantization network is:
10. The dynamic image and the static image of each video segment of the training action video sample are input into the feature extractor in the trained motion feature quantization network to acquire a motion feature map and a static feature map, respectively. The motion feature map and the static feature map are input into the motion feature enhancer in the trained motion feature quantization network to acquire an enhanced motion feature map of the corresponding video segment of the training action video sample. The enhanced motion feature map of the second video segment A2 of the video sample A is FM′, which contains the motion feature vector xy∈256 and y=1, 2, . . . , 1200.
The motion feature vector is input into the feature soft quantizer in the trained motion feature quantization network to acquire a corresponding histogram expression
For the second segment A2 of the action video sample A, the acquired histogram expression is
11. The acquired histogram expression
For these five feature centers, the distance between the feature vector of each pixel in the enhanced motion feature map FM′ and each feature center is calculated. The distance between the feature vector xy and the feature center c2 is calculated by the following equation:
W2(xy)=exp(−∥xy−c2∥2/σ2).
By using the distance as a new pixel value of each pixel, each feature center can acquire an image which is called a salient motion feature image. Each pixel value of the image is the distance between the feature vector of the corresponding pixel and the feature center.
There are 5 feature centers in total, and 5 salient motion feature images can be acquired. The five acquired salient motion feature images are stacked together according to channels to acquire a salient motion feature map with 5 channels.
12. The salient motion feature map is input into the convolutional neural network to form an action classifier, as shown in
The convolution module 3 contains four group convolution modules. The first layer of the group convolution module 1 is a convolution layer, the second layer thereof is a group convolution layer, the third layer thereof is a convolution layer, and the fourth layer thereof is a residual addition layer. The first convolution layer has 256 convolution kernels, each of which has a size of 1×1. The second group convolution layer has 256 convolution kernels, each of which has a size of 3×3. In this group convolution layer, the input feature map with the size of W×H×256 is divided into 32 groups of feature maps according to channels, each of which has a size of W2×H2×8. The 256 convolution kernels are divided into 32 groups, each of which has 8 convolution kernels. Each group of the feature maps is convolved with each group of the convolution kernels, respectively. Finally, convolution results of all the groups are connected according to the channels to acquire an output of the group convolution layer. The third convolution layer has 512 convolution kernels, each of which has a size of 1×1. The fourth residual addition layer transfers the input of the first convolution layer into the residual convolution layer. The residual convolution layer has 512 convolution kernels, each of which has a size of 1×1. An output of the residual convolution layer and an output of the third convolution layer are added together as an output of the fourth residual addition layer, which is also an output of the group convolution module 1. The group convolution modules 2, 3 and 4 are similar to the group convolution module 1 only except that a fourth residual addition layer of each of the group convolution modules 2, 3 and 4 directly adds the input of the first convolution layer and the output of the third convolution layer, and there is no residual convolution layer.
The convolution module 4 contains six group convolution modules. The first layer of the group convolution module 1 is a convolution layer, the second layer thereof is a group convolution layer, the third layer thereof is a convolution layer, and the fourth layer thereof is a residual addition layer. The first convolution layer has 512 convolution kernels, each of which has a size of 1×1. The second group convolution layer has 512 convolution kernels, each of which has a size of 3×3. In this group convolution layer, the input feature map with the size of W3×H3×512 is divided into 32 groups of feature maps according to channels, each of which has a size of W×H3×16. The 512 convolution kernels are divided into 32 groups, each of which has 16 convolution kernels. Each group of the feature maps is convolved with each group of the convolution kernels, respectively. Finally, convolution results of all the groups are connected according to the channels to acquire an output of the group convolution layer. The third convolution layer has 1024 convolution kernels, each of which has a size of 1×1. The fourth residual addition layer transfers the input of the first convolution layer into the residual convolution layer. The residual convolution layer has 1024 convolution kernels, each of which has a size of 1×1. An output of the residual convolution layer and an output of the third convolution layer are added together as an output of the fourth residual addition layer, which is also an output of the group convolution module 1. The group convolution modules 2 to 6 are similar to the group convolution module 1 only except that a fourth residual addition layer of each of the group convolution modules 2 to 6 directly adds the input of the first convolution layer and the output of the third convolution layer, and there is no residual convolution layer.
The convolution module 5 contains three group convolution modules. The first layer of the group convolution module 1 is a convolution layer, the second layer thereof is a group convolution layer, the third layer thereof is a convolution layer, and the fourth layer thereof is a residual addition layer. The first convolution layer has 1024 convolution kernels, each of which has a size of 1×1. The second group convolution layer has 1024 convolution kernels, each of which has a size of 3×3. In this group convolution layer, the input feature map with the size of W4×H4×1024 is divided into 32 groups of feature maps according to channels, each of which has a size of W4×H4×32. The 1024 convolution kernels are divided into 32 groups, each of which has 32 convolution kernels. Each group of the feature maps is convolved with each group of the convolution kernels, respectively. Finally, convolution results of all the groups are connected according to the channels to acquire an output of the group convolution layer. The third convolution layer has 2048 convolution kernels, each of which has a size of 1×1. The fourth residual addition layer transfers the input of the first convolution layer into the residual convolution layer. The residual convolution layer has 2048 convolution kernels, each of which has a size of 1×1. An output of the residual convolution layer and an output of the third convolution layer are added together as an output of the fourth residual addition layer, which is also an output of the group convolution module 1. The group convolution modules 2 and 3 are similar to the group convolution module 1 only except that a fourth residual addition layer of each of the group convolution modules 2 and 3 directly adds the input of the first convolution layer and the output of the third convolution layer, and there is no residual convolution layer.
The global average pooling layer calculates an average value of each channel of the feature map input in this layer as the output. An activation function used by the fully-connected layer is softmax.
The loss function L2 of the action classifier is:
The input of the action classifier is the dynamic image and the static image of each video segment of the action video sample, and the output thereof is a probability value representing that a current action video sample belongs to each action category. The output probability values of all the video segments are added, and an action category with the greatest probability value is used as a finally predicted action category to which the current action video sample belongs.
13. The action classifier is trained to converge. A window with a length of 10 is used to segment a test action video sample by a step length of 6. A dynamic image and a static image of each video segment are calculated, and then the dynamic image and the static image are input into the trained action classifier to acquire a predicted probability value representing that the current test action video sample belongs to each action category. The output probability values of all the video segments are added, and an action category with the greatest probability value is used as a finally predicted action category to which the current test action video sample belongs.
Although the present invention has been described in detail with reference to the foregoing embodiments, it is still possible for those skilled in the art to modify the technical solutions described in the foregoing embodiments, or equivalently replace some of the technical features therein. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the present invention shall be embraced in the scope of protection of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
202110473438.1 | Apr 2021 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/106694 | 7/16/2021 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2022/227292 | 11/3/2022 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20210150198 | Guan | May 2021 | A1 |
Number | Date | Country |
---|---|---|
108399435 | Aug 2018 | CN |
110942037 | Mar 2020 | CN |
111860353 | Oct 2020 | CN |
20160124948 | Oct 2016 | KR |
Entry |
---|
Y. Huang, J. Yang, Z. Shao and Y. Li, “Learning Motion Features from Dynamic Images of Depth Video for Human Action Recognition,” 2021 27th International Conference on Mechatronics and Machine Vision in Practice (M2VIP), Shanghai, China, 2021, pp. 258-263, doi: 10.1109/M2VIP49856.2021.9665132. (Year: 2021). |
Number | Date | Country | |
---|---|---|---|
20240046700 A1 | Feb 2024 | US |