The present disclosure relates to a graph convolutional network (GCN) for human action recognition, and is particularly directed to a modified spatial-temporal GCN with a self-attention model.
Human action recognition underwent active development in recent years, as it plays a significant role in video understanding. In general, human action can be recognized from multiple modalities, such as appearance, depth, optical flows, and body. Among these modalities, dynamic human skeletons usually convey significant information that is complementary to others. However, conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties for generalization and/or application.
There were many issues and problems associated with existing approaches for recognizing human actions by modeling skeletons, for example but not limited to, low recognition efficiency, slow recognition speed, and/or low recognition accuracy.
The present disclosure describes methods, devices, systems, and storage medium for recognizing a human action using an actional-structural self-attention graph convolutional network (GCN), which may overcome some of the challenges and drawbacks discussed above, improving overall performance, increasing recognition speed without sacrificing recognition accuracy.
Embodiments of the present disclosure include methods, devices, and computer readable medium for an actional-structural self-attention graph convolutional network (GCN) system for recognizing one or more action.
The present disclosure describes a method for recognizing a human action using a graph convolutional network (GCN). The method includes obtaining, by a device, a plurality of joint poses. The device includes a memory storing instructions and a processor in communication with the memory. The method also includes normalizing, by the device, the plurality of joint poses to obtained a plurality of normalized joint poses; extracting, by the device, a plurality of rough features using a modified spatial-temporal GCN (ST-GCN) from the plurality of normalized joint poses; reducing, by the device, a feature dimension of the plurality of rough features to obtain a plurality of dimension-shrunk features; refining, by the device, the plurality of dimension-shrunk features based on a self-attention model to obtain a plurality of refined features; and recognizing, by the device, a human action based on the plurality of refined features.
The present disclosure describes a device for recognizing a human action using a graph convolutional network (GCN). The device includes a memory storing instructions; and a processor in communication with the memory. When the processor executes the instructions, the processor is configured to cause the device to obtain a plurality of joint poses; normalize the plurality of joint poses to obtained a plurality of normalized joint poses; extract a plurality of rough features using a modified spatial-temporal GCN (ST-GCN) from the plurality of normalized joint poses; reduce a feature dimension of the plurality of rough features to obtain a plurality of dimension-shrunk features; refine the plurality of dimension-shrunk features based on a self-attention model to obtain a plurality of refined features; and recognize a human action based on the plurality of refined features.
The present disclosure describes a non-transitory computer readable storage medium storing instructions. The instructions, when executed by a processor, cause the processor to perform obtaining a plurality of joint poses; normalizing the plurality of joint poses to obtained a plurality of normalized joint poses; extracting a plurality of rough features using a modified spatial-temporal GCN (ST-GCN) from the plurality of normalized joint poses; reducing a feature dimension of the plurality of rough features to obtain a plurality of dimension-shrunk features; refining the plurality of dimension-shrunk features based on a self-attention model to obtain a plurality of refined features; and recognizing a human action based on the plurality of refined features.
The above and other aspects and their implementations are described in greater details in the drawings, the descriptions, and the claims.
The system and method described below may be better understood with reference to the following drawings and description of non-limiting and non-exhaustive embodiments. The components in the drawings are not necessarily to scale. Emphasis instead is placed upon illustrating the principles of the disclosure.
The method will now be described with reference to the accompanying drawings, which show, by way of illustration, specific exemplary embodiments. The method may, however, be embodied in a variety of different forms and, therefore, covered or claimed subject matter is intended to be construed as not being limited to any exemplary embodiments set forth. The method may be embodied as methods, devices, components, or systems. Accordingly, embodiments may, for example, take the form of hardware, software, firmware or any combination thereof.
Throughout the specification and claims, terms may have nuanced meanings suggested or implied in context beyond an explicitly stated meaning. Likewise, the phrase “in one embodiment” or “in some embodiments” as used herein does not necessarily refer to the same embodiment and the phrase “in another embodiment” or “in other embodiments” as used herein does not necessarily refer to a different embodiment. The phrase “in one implementation” or “in some implementations” as used herein does not necessarily refer to the same implementation and the phrase “in another implementation” or “in other implementations” as used herein does not necessarily refer to a different implementation. It is intended, for example, that claimed subject matter includes combinations of exemplary embodiments or implementations in whole or in part.
In general, terminology may be understood at least in part from usage in context. For example, terms, such as “and”, “or”, or “and/or,” as used herein may include a variety of meanings that may depend at least in part upon the context in which such terms are used. Typically, “or” if used to associate a list, such as A, B or C, is intended to mean A, B, and C, here used in the inclusive sense, as well as A, B or C, here used in the exclusive sense. In addition, the term “one or more” or “at least one” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a”, “an”, or “the”, again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” or “determined by” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.
The present disclosure describes methods, devices, systems, and storage medium for recognizing one or more human action using a modified spatial-temporal graph convolutional network (GCN) with a self-attention model.
Dynamics of human body skeletons may convey significant information for recognizing various human actions. For example, there may be scenarios, for example but not limited to, modeling dynamics of human body skeletons based on one or more video clip, and recognizing various human activities based on the dynamics of human body skeletons. The human activities may include, but not limited to, walking, standing, running, jumping, turning, skiing, playing tai-chi, and the like.
Recognizing various human activities from one or more video clip may play an important role in understanding content of the one or more video clip, and/or monitoring one or more subject's behavior in a certain environment. Recently, machine learning and/or artificial intelligence (AI) was applied in recognizing human activities. A big challenge remained for a machine to understand the meaning accurately and efficiently on real time high-definition (HD) video.
Neural networks is one of the most popular machine learning algorithms, and achieved some success in accuracy and speed. Neural network includes various variants, for example but not limited to, convolutional neural networks (CNN), recurrent neural networks (RNN), auto-encoders, and deep learning.
Dynamics of human body skeletons may be represented by a skeleton sequence or a plurality of joint poses, which may be represented by two-dimensional or three-dimensional coordinates of more than one human joints in more than one frames. Each frame may represent the coordinates of joint poses at a different time point, for example, a sequential time point during a time lapse of a video clip. It was a challenge to let computer get the meaning from image frames in videos. For example, a video clip of gymnastic competition, judges may watch a gymnast competing in the competition for further evaluation and/or assessment; and it was challenge to have a computer to achieve a comparable efficiency, accuracy, and reliability.
A model of dynamic skeletons called spatial-temporal graph convolutional networks (ST-GCN), which automatically learn both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability.
For a standard ST-GCN model, pose estimation may be performed on videos and construct spatial temporal graph on skeleton sequences. Multiple layers of spatial-temporal graph convolutional network (ST-GCN) generate higher-level feature maps on the graph, which may then be classified to the corresponding action category. The ST-GCN model may work on action recognition with high accuracy, and its speed may be limited to relatively low frame rate even with a relatively powerful computer, for example, around 10 frame per second (FPS) with a computer equipped with a GTX-1080Ti graphic processing unit (GPU). This may hinder its real-time applications, which may require about or more than 25 FPS.
It may be desired to design a simplified ST-GCN which can reach higher speed (for example, reaching about or more than 25 FPS) without scarifying the accuracy of action recognition. The present disclosure describes various embodiments for recognizing a human action using the simplified ST-GCN without scarifying the accuracy of action recognition, addressing some of the issues discussed above. The various embodiment may include an actional-structural self-attention GCN for recognizing one or more action.
The electronic communication environment 100 may also include a portion or all of the following: one or more databases 120, one or more two-dimension image/video acquisition servers 130, one or more user devices (or terminals, 140, 170, and 180) associated with one or more users (142, 172, and 182), one or more application servers 150, one or more three-dimension image/video acquisition servers 160.
Any one of the above components may be in direct communication with each other via public or private communication networks (for example, local-network or Internet), or may be in indirect communication with each other via a third party. For example but not limited to, the database 120 may communicate with the two-dimension image/video acquisition server 130 (or the three-dimension image/video acquisition server 160) without via the actional-structural self-attention GCN system 110, for example, the acquired two-dimension video may be sent directly via 123 from the two-dimension image/video acquisition server 130 to the database 120, so that the database 120 may store the acquired two-dimension video in its database.
In one implementation, referring to
The user devices/terminals (140, 170, and 180) may be any form of mobile or fixed electronic devices including but not limited to desktop personal computer, laptop computers, tablets, mobile phones, personal digital assistants, and the like. The user devices/terminals may be installed with a user interface for accessing the actional-structural self-attention GCN system.
The database may be hosted in a central database server, a plurality of distributed database servers, or in cloud-based database hosts. The database 120 may be configured to store image/video data of one or more subject performing certain actions, the intermediate data, and/or final results for implementing the actional-structural self-attention GCN system.
The communication interfaces 202 may include wireless transmitters and receivers (“transceivers”) 212 and any antennas 214 used by the transmitting and receiving circuitry of the transceivers 212. The transceivers 212 and antennas 214 may support Wi-Fi network communications, for instance, under any version of IEEE 802.11, e.g., 802.11n or 802.11ac. The transceivers 212 and antennas 214 may support mobile network communications, for example, 3G, 4G, and 5G communications. The communication interfaces 202 may also include wireline transceivers 216, for example, Ethernet communications.
The storage 209 may be used to store various initial, intermediate, or final data or model for implementing the actional-structural self-attention GCN system. These data corpus may alternatively be stored in the database 120 of
The system circuitry 204 may include hardware, software, firmware, or other circuitry in any combination. The system circuitry 204 may be implemented, for example, with one or more systems on a chip (SoC), application specific integrated circuits (ASIC), microprocessors, discrete analog and digital circuits, and other circuitry.
For example, the system circuitry 204 may be implemented as 220 for the actional-structural self-attention GCN system 110 of
Likewise, the system circuitry 204 may be implemented as 240 for the user devices 140, 170, and 180 of
Referring to
The actional-structural self-attention GCN 300 may receive an input 302, and may generate an output 362. The input 302 may include video data, and the output 362 may include one or more action prediction based on the video data. The pose estimator 310 may receive the input 302 and perform pose estimation to obtain and output a plurality of joint poses 312. The pose normalizer 320 may receive the plurality of joint poses 312 and perform pose normalization to obtain and output a plurality of normalized joint poses 322. The feature extractor 330 may receive the plurality of normalized joint poses 322 and perform feature extraction to obtain and output a plurality of rough features 332. The feature dimension reducer 340 may reduce the plurality of rough features 332 and perform feature dimension reduction to obtain and output a plurality of dimension-shrunk features 342. The feature refiner 350 may receive the plurality of dimension-shrunk features 342 and perform feature refinement to obtain and output a plurality of refined features 352. The classifier 360 may receive the plurality of refined features 352 and perform classification and prediction to obtain and output the output 362 including the one or more action predication.
The present disclosure also describes embodiments of a method 400 in
Referring to the step 410, obtaining a plurality of joint poses may be performed by a pose estimator 310 in
In one implementation, the plurality of joint poses may be obtained from one or more motion-capture image sensor, for example but not limited to, depth sensor, camera, video recorder, and the like. In some other implementations, the plurality of joint poses may be obtained from videos according to pose estimation algorithms. The output from the motion-capture devices or the videos may include a sequence of frames. Each frame may corresponding to a particular time points in sequence, and each frame may be used to generate joint coordinates, forming the plurality of joint poses.
In one implementation, the plurality of joint poses may include joint coordinates in a form of two-dimension coordinates, for example (x, y) where x is the coordinate along x-axis and y is the coordinate along y-axis. A confidence score for each joint may be added into the two-dimension coordinates, so that each joint may be represented with a tuple of (x, y, c) wherein c is the confidence score for this joint's coordinates.
In another implementation, the plurality of joint poses may include joint coordinates in a form of three-dimension coordinates, for example (x, y, z) where x is the coordinate along x-axis, y is the coordinate along y-axis, and z is the coordinate along z-axis. A confidence score for each joint may be added into the three-dimension coordinates, so that each joint may be represented with a tuple of (x, y, z, c) wherein c is the confidence score for this joint's coordinates.
Referring to step 420, normalizing the plurality of joint poses to obtained a plurality of normalized joint poses may be performed by a pose normalizer 320 in
Referring to
The step 420 may include fixed torso length normalization, wherein all pose coordinates may be normalized relative to the torso length. Optionally and alternatively, if a torso length for one subject is not detected for this image frame, the method may discard this subject and do not analysis the pose coordinates for this subject for this image frame, for example, when at least one of the Joint No. 1 and Joint No. 8 for this subject is not in the image frame or not visible due to being block by another subject or object.
Referring to step 430, extracting a plurality of rough features using a modified spatial-temporal GCN (ST-GCN) from the plurality of normalized joint poses may be performed by a feature extractor 330. The feature extractor may include a modified spatial-temporal GCN (ST-GCN).
In one implementation referring to
Each ST-GCN block contains a spatial graph convolution followed by a temporal graph convolution, which alternatingly extracts spatial and temporal features. The spatial graph convolution is a key component in the ST-GCN block, the spatial graph convolution introduces a weighted average of neighboring features for each joint. The ST-GCN block may have a main advantage of extraction of spatial features, and/or may have disadvantage that it may use only a weight matrix to measure inter-frame attention (correlation), which is relatively ineffective.
The number of ST-GCN blocks in a feature extractor model may be, for example but not limited to, 3, 5, 7, 10, or 13. The more ST-GCN blocks the feature extractor includes, the more number of total parameters in the model, and the more complexity of the calculation and the longer of the computing time required to complete the calculation. A ST-GCN including 10 ST-GCN blocks may be slower than a ST-GCN including 7 ST-GCN blocks due to the larger number of total parameters. For example, a standard ST-GCN may include 10 ST-GCN blocks, and parameters for each corresponding ST-GCN blocks may include 3×64(1), 64×64(1), 64×64(1), 64×64(1), 64×128(2), 128×128(1), 128×128(1), 128×256(2), 256×256(1), and 256×256(1). A standard ST-GCN including 10 ST-GCN blocks may include a number of total parameters being 3,098,832.
For one exemplary embodiment referring to
The feature extractor may include, based on the plurality of normalized joint poses, to construct a spatial-temporal graph with the joints as graph nodes and natural connectivities in both human body structures and time as graph edges.
For one example in one implementation, an undirected spatial temporal graph G=(V, E) may be constructed based on the plurality of normalized joint poses.
V may be the node set including N joints and T frames, for example V includes vti, wherein t is a positive integer representing the frame No. from 1 to T, inclusive; and i is a positive integer representing the Joint No. from 1 to N, inclusive.
E may be the edge set including two edge subsets. The first edge subset may represent an intra-skeleton connection at each frame, for example, the first edge subset Ef includes vti*vtj, wherein t is a positive integer representing the frame No. from 1 to T, inclusive; i is a positive integer representing the first Joint No. of the intra-skeleton connection from 1 to N, inclusive; and j is a positive integer representing the second Joint No. of the intra-skeleton connection from 1 to N, inclusive.
The second edge subset may represent the inter-frame edges connecting the same joint in consecutive frames, for example, the second edge subset Es includes vti*v(t+1)i, wherein t is a positive integer representing the frame No. from 1 to T, inclusive; t+1 is the consecutive frame; and i is a positive integer representing the first Joint No. of the intra-skeleton connection from 1 to N, inclusive.
Referring to step 440, reducing a feature dimension of the plurality of rough features to obtain a plurality of dimension-shrunk features may be performed by a feature dimension reducer. The step 440 may add convolution on joints to get key joints and reduce feature dimensions for further processing.
As shown in
In one implementation, the sequence length output from the feature extractor is 75×25×256, and the feature dimension reducer may reduce the sequence length to 18×12×128, wherein 18×12=216 is the length of sequence, and 128 is the vector dimension.
Referring to
Referring to step 450, refining the plurality of dimension-shrunk features based on a self-attention model to obtain a plurality of refined features may be performed by a feature refiner 350 in
Referring to
Referring to
An actional-structural self-attention GCN may use the transformer encoder-like self-attention model, instead of a mere weight matrix, to explicitly learn inter-frame attention (correlation). The transformer encoder-like self-attention mechanism may also serve to refine the features, so that the level of accuracy may be preserved in comparing with the original ST-GCN model. The actional-structural self-attention GCN in the present disclosure may use the transformer encoder-like self-attention model to achieve at least the same level of accuracy as of a standard ST-GCN with at least twice of the action-recognition speed.
Referring to step 460, recognizing a human action based on the plurality of refined features may be performed by a classifier 360 in
Referring to
Referring to
Optionally, the method may further include overlaying the predicated human action on one or more image frame, and displaying the overlaid image frame. In one implementation, the predicated human action may be overlaid as a text with a prominent font type, size, or color. Optionally and/or alternatively in another implementation, the joint pose in the overlaid image frame may be displayed as well.
For example,
The embodiments described in the present disclosure may be trained according to a general ST-GCN and/or tested by using standard reference datasets, for example but not limited to, the action recognition NTU RGB+D Dataset (http://rose1.ntu.edu.sg/datasets/actionrecognition.asp), and the Kinetics Dataset (https://deepmind.com/research/open-source/kinetics).
The NTU-RGB+D Dataset contains 56,880 skeletal motion sequences completed by one or two performers, which are divided into 60 categories (i.e, 60 human action classes). The NTU-RGB+D Dataset is one of the largest data sets for skeleton-based action recognition. The NTU-RGB+D Dataset provides each person with three-dimension spatial coordinates of 25 joints in one action. To evaluate the model, two protocols may be used: a first protocol of cross-subject, and a second protocol of cross-view. In the “cross-theme”, 40 samples executed by 20 objects, 320 samples may be divided into training sets, and the rest belong to the test set. Cross-View may allocate data based on camera views, where the training and test sets may include 37,920 and 18,960 samples, respectively.
The Kinetics Dataset is a large dataset for human behavior analysis, containing more than 240,000 video clips with 400 actions. Since only red-green-blue (RGB) video is provided, the OpenPose toolbox may be used to obtain skeleton data by estimating joint positions on certain pixels. The toolbox will generate two-dimension pixel coordinates (x, y) and confidence c for a total of 25 joints from the resized video with a resolution of 340 pixels×256 pixels. Each joint may be represented as a three-element feature vector: [x, y, c]. For the multi-frame case, the body with the highest average joint confidence in each sequence may be chosen. Therefore, a clip with T frames is converted into a skeleton sequence with a size of 25×3×T.
Chart 1010 in
Chart 1030 in
The present disclosure also describes various applications for the embodiments describes above. For one example of the various applications, the embodiments in the present disclosure may be used in an elderly care center. With the help of action recognition technology provided by the embodiments in the present disclosure, service personnel at the elderly care center may more accurately record main activities of a group of the elderly, and then analyse these data to improve the lives of seniors, for example, during seniors doing exercise in an elderly care center (see
For another example of the various applications, the embodiments in the present disclosure may be used in auto detection. On some occasions, people may need to carry out a lot of repetitive tasks, for example, car manufacturing plant workers may need to conduct multiple factory inspections on the cars that are about to leave the factory. Such work may often require a high degree of conscientiousness and professional work ethics. If workers fail to perform such duties, it may be difficult to detect this. With action recognition technology, car manufacturing plant personnel may better assess the performance of such staff. The embodiments in the present disclosure may be used to detect whether the main work steps are fully finished by the staff, which may help to ensure staff members to carry out all their required duties to ensure that products are properly tested, and quality assured.
For another example of the various applications, the embodiments in the present disclosure may be used in smart schools. The embodiments in the present disclosure may be installed in public places like primary and secondary school campuses, to help school administrators identify and address certain problems that may exist with a few primary and secondary school students. For example, there may be incidents of campus bullying and school fights in some elementary and middle schools. Such incidents may occur when teachers are not present or may occur in a secluded corner of the campus. If these matters are not identified and dealt with in good time, they may escalate, and it may also be difficult to trace back to the culprits after the event. Action recognition and behavior analysis may immediately alert teachers and/or administrators of such situations so that they can be dealt with in a timely manner.
For another example of the various applications, the embodiments in the present disclosure may be used in intelligent prison and detention. The embodiments in the present disclosure may be used to provide detainees' action analysis, with which can measure the detainee mood status more accurately. The embodiments in the present disclosure may also be used to help prison management to detect suspicious behavior by inmates. The embodiments in the present disclosure may be used in detain rooms and prisons for looking out for fights and suicide attempts, which can modernize the city's correctional facilities and provide intelligent prison and detention.
Through the descriptions of the preceding embodiments, persons skilled in the art may understand that the methods according to the foregoing embodiments may be implemented by hardware only or by software and a necessary universal hardware platform. However, in most cases, using software and a necessary universal hardware platform are preferred. Based on such an understanding, the technical solutions of the present disclosure essentially, or the part contributing to the prior art may be implemented in a form of a software product. The computer software product is stored in a storage medium (such as a ROM/RAM, a magnetic disk, or an optical disc) and includes several instructions for instructing a terminal device (which may be a mobile phone, a computer, a server, a network device, or the like) to perform the methods described in the embodiments of the present disclosure.
While the particular invention has been described with reference to illustrative embodiments, this description is not meant to be limiting. Various modifications of the illustrative embodiments and additional embodiments will be apparent to one of ordinary skill in the art from this description. Those skilled in the art will readily recognize these and various other modifications can be made to the exemplary embodiments, illustrated and described herein, without departing from the spirit and scope of the present invention. It is therefore contemplated that the appended claims will cover any such modifications and alternate embodiments. Certain proportions within the illustrations may be exaggerated, while other proportions may be minimized. Accordingly, the disclosure and the figures are to be regarded as illustrative rather than restrictive.