The following generally relates to systems and methods for video and image processing for activity and event recognition, in particular to group activity recognition in images and videos with self-attention mechanisms.
Group activity detection and recognition from visual data such as images and videos involves identifying what an entity (e.g., a person) does in a group of entities (e.g., people) and what the group is doing as a whole. As an example, in a sport game such as volleyball, an individual player may jump, while the group is performing a spike. Besides sports, such group activity recognition has several applications including crowd monitoring, surveillance, and human behavior analysis. Common tactics to recognize group activities exploit representations that model spatial graph relations between individual entities (e.g., references [1, 2]) and follow those entities and their movements over time (e.g., references [1, 3]). It has been found to be common in the prior art to explicitly model these spatial and temporal relationships based on the location of the entities, which requires to either explicitly define or use a pre-defined structure for groups of entities in a scene to model and recognize group activities.
In the prior art, many action recognition techniques are based on a holistic approach, thereby learning a global feature representation of the image or video by explicitly modelling the spatial and temporal relationship between people and objects in the scene. State-of-the-art techniques for image recognition such as Convolutional Neural Networks (CNNs) have been used for action detection and extended from two dimensional images to capture temporal information and account for time in the videos which is vital information for action recognition. Earlier methods rely on extracting features from each video frame using two dimensional (2D) CNNs and then fusing them using different fusion methods to include temporal information—see reference [4]. Some prior art methods have leveraged Long Short-Term Memory neural networks (LSTMs) to model long-term temporal dependencies across frames—see reference [5]. Some work has extended the 2D convolutional filters to three dimensional (3D) filters by using time as the third dimension to extract features from videos for different video analysis tasks—see reference [6].
Several studies explored attention mechanisms for video action recognition by incorporating attention via LSTM models (see reference [5]), pooling methods (see reference [7]) or mathematical graphs models (see reference [8]).
Most of the individual human actions are highly related to the position and motion of the human body joints and the pose of the human body. This has been extensively explored in the literature, including using hand-crafted pose features (see reference [9]), skeleton data (see reference [10]), body joint representation (see reference [11]) and attention guided by pose (see reference [12]). However, these approaches were only designed to recognize an action for one individual actor, which is not applicable to inferring group activities because of the absence of the information about the interactions between the entities in the group.
Prior art methods for group activity recognition often relied on designing and using hand-crafted features to represent the visual data for further analysis, engineered explicitly to extract a characteristic information of each individual in the scene, which were then processed by probabilistic graphical models (see reference [13]) for the final inference. Some of the more recent methods utilized artificial neural networks and more specifically recurrent neural network (RNN)-type networks to infer group activities from extracted image feature or video features—see references [3] and [14].
Rather than explicitly define and model the spatial and temporal relationships between the entities in the visual data based on the location of the entities to infer individual and group activities, the disclosed method uses an implicit spatio-temporal model which automatically learns the spatial and temporal configuration of the groups of entities (e.g., humans) from the visual data, using the visual appearance and spatial attributes of the entities (e.g. body skeleton or body pose information for humans) for recognizing group activities. The learning is done by applying machine learning and artificial intelligence techniques on the visual data, to extract spatial, temporal, and spatio-temporal information characterizing content of the visual data, also known as visual features. Visual features are numerical representations of the visual content, often coded as a vector of numbers. In this document the terms “numerical representation” and “features” are used interchangeably.
The following also discloses individual and group activity detection methods using visual data to detect and recognize the activity of an individual and the group that it belongs to. The methods are based on the learning appearance characteristics using machine learning and artificial intelligence techniques from the images in the videos and spatial attributes of the entities and persons to selectively extract information relevant for individual and group activity recognition.
In an aspect, the following discloses a method for group and individual activity recognition from video data which is able to jointly use pixel level video data, motion information and the skeletal shape of the people and their spatial attributes in the scene that model both static and dynamic representations of each individual subject (person) to automatically learn to recognize and localize the individual and group actions and the key actor in the scene. The method uses a self-attentions mechanism that learns and selectively extracts the important representative feature for individual and group activities and learns to construct a model to understand and represent the relationship and interactions between multiple people and objects in a group setting. Those extracted representative feature are represented by numerical values, which can further be used to recognize and detect individual and group activities.
As understood herein, a self-attention mechanism models dependencies and relations between individuals in the scene or referred to them here as actors and combines actor-level information for group activity recognition via a learning mechanism. Therefore, it does not require explicit and pre-defined spatial and temporal constraints to model those relationships.
Although certain aspects of the disclosed methods are related to the group and individual activity recognition involving people and objects, the systems and methods described herein can be used for activity recognition involving only objects without people, such as traffic monitoring as long as the objects have some representative static and dynamic features and there is spatial and temporal structure in the scene between the objects.
In one aspect, there is provided a method for processing visual data for individual and group activities and interactions, the method comprising: receiving at least one image from a video of a scene showing one or more entities at a corresponding time; using a training set comprising at least one labeled individual or group activity; and applying at least one machine learning or artificial intelligence technique to learn from the training set to represent spatial, temporal or spatio-temporal content of the visual data and numerically model the visual data by assigning numerical representations.
In an implementation, the method further includes applying learnt machine learning and artificial models to the visual data; identifying individual and group activities by analyzing the numerical representation assigned to the spatial, temporal, or spatio-temporal content of the visual data; and outputting at least one label to categorize an individual or a group activity in the visual data.
In other aspects, systems, devices, and computer readable medium configured to perform the above method are also provided.
Embodiments will now be described with reference to the appended drawings wherein:
An exemplary embodiment of the presently described system takes a visual input such as an image or video of a scene with multiple entities including individuals and objects to detect, recognize, identify, categorize, label, analyze and understand the individual actions, the group activities, and the key individual or entity that either makes the most important action in the group or carries out a main action characterizing the group activity which is referred to as the “key actor”. The individual actions and group activities include human actions, human-human interactions, human-object interactions, or object-object interactions.
In the exemplary embodiment, a set of labeled videos or images containing at least one image or video of at least one individual or group activity is used as the “training set” to train machine learning algorithms. Given the training set, machine learning algorithms learn to process the visual data for individual and group activities and interactions by generating a numerical representation of spatial, temporal or spatio-temporal content of the visual data. The numerical representation which sometimes refer to as “visual features” or “features” are either explicitly representing the labels and categories for the individual and group activities, or implicitly representing them to be used for further processing. After the training, the learnt models process an input image or video to generate the numerical representation of the visual content.
Referring to the drawings,
Turning now to
Further detail of the operation of the configurations shown in
In an exemplary embodiment, illustrated also in
In this exemplary embodiment, the feature vectors that are representing the appearance and the skeletal structure of the person are obtained by passing images through artificial neural networks. However, any suitable method can be used to extract intermediate features representing the images. Therefore, while examples are provided using artificial neural networks, the principles described herein should not be limited thereto.
All human actions involve the motion of body joints, such as hands and legs. This applies not only to fine-grained actions that are performed in sports activities, e.g., spike and set in a volleyball game, but also to every day actions such as walking and talking. This means that it is important to capture not only the position of joints but their temporal dynamics as well. For this purpose, one can use both position and motion of individual body joints and actors themselves.
To obtain joint positions, a pose estimation model can be applied. This model receives as an input, a bounding box around the actor, and predicts the location of key joints. This embodiment does not rely on a particular choice of pose estimation model. For example, state-of-the art body pose estimation such as HRNet can be used—see reference [15]. One can use the features from the last layer of the pose estimation neural network, right before the final classification layer. To extract the temporal dynamics of each actor and model the motion data from the video frames, state-of-the art 3D CNNs can be used such as I3D models. The dynamic feature extraction models can be applied on the sequence of the detected body joints across the videos, the raw video pixel data or the optical flow video. The dynamic features are extracted from stacked Ft, t=1, . . . , T frames. The RGB pixel data and optical flow representations are considered here, but for those who are skilled in computer vision the dynamic features can be extracted from multiple different sources using different techniques. The dynamic feature extractors can either be applied on the whole video frame or only the spatio-temporal region that an actor or an entity of interest is present.
Transformer networks can learn and select important information for a specific task. A transformer network includes two main parts, an encoder and a decoder. The encoder receives an input sequence of words (source) that is processed by a stack of identical layers including a multi-head self-attention layer and a fully connected feed-forward network. Then, a decoder generates an output sequence (target) through the representation generated by the encoder. The decoder is built in a similar way as the encoder having access to the encoded sequence. The self-attention mechanism is the vital component of the transformer network, which can also be successfully used to reason about actors' relations and interactions.
Attention A is a function that represents a weighted sum of the values V. The weights are computed by matching a query Q with the set of keys K. The matching function can have different forms, most popular is the scaled dot-product. Formally, attention with the scaled dot-product matching function can be written as:
where d is the dimension of both queries and keys. In the self-attention module all three representations (Q, K, V) are computed from the input sequence S via linear projections so Ah(Q,K,V)=concat(h1, . . . ,hm)W.
Since attention is a weighted sum of all values, it overcomes the problem of forgetfulness over time. This mechanism gives more importance to the most relevant observations which is a required property for group activity recognition because the system can enhance the information of each actor's features based on the other actors in the scene without any spatial constraints. Multi-head attention Ah is an extension of attention with several parallel attention functions using separate linear projections hi of (Q, K, V):
h
i
A(QWiQ,KWiK,VWiV)
Transformer encoder layer E includes a multi-head attention combined with a feed-forward neural network L:
L(X)=Linear(Dropout(ReLU(Linear(X)))
E(S)=LayerNorm(S+Dropout(Ah(S)))
E(S)=LayerNorm(E(S)+Dropout(L(E(S)))
The transformer encoder can contain several of such layers which sequentially process an input S.
S is a set of actors' features S={si|i=1, . . . , N} obtained by actor feature extractors and represented by numerical values. As features si do not follow any particular order, the self-attention mechanism 18 is a more suitable model than RNN and CNN for refinement and aggregation of these features. An alternative approach can be incorporating a graph representation. However, the graph representation requires explicit modeling of connections between nodes through appearance and position relations. The transformer encoder mitigates this requirement relying solely on the self-attention mechanism 18. The transformer encoder also implicitly models spatial relations between actors via positional encoding of si. It can be done by representing each bounding box bi of respective actor's features si with its center point (xi,yi) and encoding the center point.
It is apparent that using information from different modalities, i.e. static, dynamic, spatial attribute, RGB pixel values, and optical flow modalities; improves the performance of activity recognition methods. In this embodiment several modalities are incorporated for individual and group activity detection, referred to as static and dynamic modalities. The static one is represented by the pose models which captures the static position of body joints or spatial attributes of the entities, while the dynamic one is represented by applying a temporal machine learning video processing technique such I3D on a sequence of images in the video and is responsible for the temporal features of each actor in the scene. As RGB pixel values and optical flow can capture different aspects of motion both of them are used in this embodiment. To fuse static and dynamic modalities two fusion strategies can be used, early fusion of actors' features before the transformer network and late fusion which aggregates the assigned labels to the actions after classification/categorization. Early fusion enables access to both static and dynamic features before inference of group activity. Late fusion separately processes static and dynamic features for group activity recognition and can concentrate on static or dynamic features, separately.
Training Objective
The parameters of all the components, the static and dynamic models, the self-attention mechanism 18 and the fusion mechanism could be either estimated separately or jointly using standard machine learning techniques such as gradient based learning methods that are commonly used for artificial neural networks. In one ideal setting, the whole parameter estimation of those components can be estimated using a standard classification loss function, learnt from a set of available labelled examples. In case of separately learning the parameters of those components, each one can be estimated separately and then the learnt models can be combined together. To estimate all parameters together, neural network models can be trained in an end-to-end fashion to simultaneously predict individual actions of each actor and group activity. For both tasks one can use a standard loss function such as cross-entropy loss and combine two losses in a weighted sum:
=λgg(yg,{tilde over (y)}g)+λaa(ya,{tilde over (y)}a)
where g, a are cross-entropy losses, yg and ya are ground truth labels, {tilde over (y)}g and {tilde over (y)}a are predictions for group activity and individual actions, respectively. λg and λa are scalar weights of two losses.
Experiments were carried out on publicly available group activity datasets, namely the volleyball dataset (see reference [3]) and the collective dataset (see reference [16]). The results were compared to the state-of-the-art.
For simplicity, the static modality is called “Pose”, the dynamic one that uses raw pixel data from video frames is called “RGB”, and dynamic one with optical flow frames is called “Flow” in the next several paragraphs.
The volleyball dataset included clips from 55 videos of volleyball games, which are split into two sets: 39 training videos and 16 testing videos. There are 4830 clips in total, 3493 training clips and 1337 clips for testing. Each clip is 41 frames in length. Available annotation includes group activity label, individual players' bounding boxes and their respective actions which are provided only for the middle frame of the clip. This dataset is extended with ground truth bounding boxes for the rest of the frames in clips which are also used in the experimental evaluation. The list of group activity labels contains four main activities (set, spike, pass, win point) which are divided into two subgroups, left and right, having eight group activity labels in total. Each player can perform one of nine individual actions: blocking, digging, falling, jumping, moving, setting, spiking, standing and waiting.
The collective dataset included 44 clips with varying lengths starting from 193 frames to around 1800 frames in each clip. Every 10th frame has the annotation of persons' bounding boxes with one of five individual actions: (crossing, waiting, queueing, walking and talking. The group activity is determined by the action which most people perform in the clip.
For experimental evaluation T=10 frames are used as the input, the frame that is labeled for the activity and group activity as the middle frame, 5 frames before and 4 frames after. During training one frame Ftp from T input frames is randomly sampled for the pose modality to extract relevant body pose features. The group activity recognition accuracy is used as an evaluation metric.
The use of static modality, human body pose, without dynamic modality results in an average accuracy of 91% for group activity recognition on the volleyball dataset. Including the relative position of all the people in the scene, referred to as “positional encoding” increase the accuracy to 92.3%. Therefore, explicitly adding information about actors' positions helps the transformer better reason about this part of the group activity. Using static and dynamic modalities separately without any information fusion, the results on the Volleyball dataset are shown in
The results of combining dynamic and static modalities are presented in
Comparison with the state-of-the-art on the volleyball dataset is shown in
The static and dynamic modalities representing individual and group activities are used together to automatically learn the spatio-temporal context of the scene for group activities using a self-attention mechanism. In this particular embodiment, the human body pose is used as the static modality However, any feature extraction technique can be applied on the images to extract other sort of static representations instead of body pose. In addition, the extracted static features from images can be stacked together to be used as the dynamic modality. The same can be applied to the dynamic modality to generate static features. Another key component is the self-attention mechanism 18 to dynamically select the more relevant representative features for activity recognition from each modality. This exemplary embodiment discloses the use of human pose information on one single image as one of the inputs for the method, however various modifications to make use of a sequence of images instead of one image will be apparent to those skilled in the art. For those skilled in the art, a multitude of different feature extractors and optimization loss functions can be used instead of the exemplary ones in the current embodiment. Although the examples are using videos as the input to the model, one single image can be used instead and rather than using static and dynamic modalities, only static modality can be used. In this case, the body pose and the extracted feature from the raw image pixels are both considered as static modalities.
The exemplary methods described herein are used to categorize the visual input and assign appropriate labels to the individual actions and group activities. However, similar techniques can detect those activities in a video sequence, meaning that the time the activities are happening in a video can be also identified as well as the spatial region in the video where they activities are happening. A sample method can be using a moving window on multiple video frames in time, to detect and localize those activities which will be apparent to those skilled in the art.
To better understand the performance of the exemplary model one can present confusion matrices for group activity recognition on the volleyball dataset in
For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the examples described herein. However, it will be understood by those of ordinary skill in the art that the examples described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the examples described herein. Also, the description is not to be considered as limiting the scope of the examples described herein.
It will be appreciated that the examples and corresponding diagrams used herein are for illustrative purposes only. Different configurations and terminology can be used without departing from the principles expressed herein. For instance, components and modules can be added, deleted, modified, or arranged with differing connections without departing from these principles.
It will also be appreciated that any module or component exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the system 10, 20, 25, any component of or related to the system 10, 20, 25, etc., or accessible or connectable thereto. Any application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media.
The steps or operations in the flow charts and diagrams described herein are just for example. There may be many variations to these steps or operations without departing from the principles discussed above. For instance, the steps may be performed in a differing order, or steps may be added, deleted, or modified.
Although the above principles have been described with reference to certain specific examples, various modifications thereof will be apparent to those skilled in the art as outlined in the appended claims.
This application is a Continuation of PCT Application No. PCT/CA2021/050391 filed on Mar. 25, 2021, which claims priority to U.S. Provisional Patent Application No. 63/000,560 filed on Mar. 27, 2020, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63000560 | Mar 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CA2021/050391 | Mar 2021 | US |
Child | 17817454 | US |