The present application claims priority to Swedish Patent Application No. 2150749-6, filed Jun. 11, 2021, the content of which is incorporated herein by reference in its entirety.
The present disclosure relates generally to training of neural networks and, in particular, to such training for action recognition in time-sequences of data samples that represent objects performing various actions.
Action recognition, classification and understanding in videos or other time-resolved reproductions of moving subjects (humans, animals, etc.) form a significant research domain in computer vision. Action recognition, also known as activity recognition, has many applications given the abundance of available moving visual media in today's society, including intelligent search and retrieval, surveillance, sports events analytics, health monitoring, human-computer interaction, etc. At the same time, action recognition is considered one of the most challenging tasks of computer vision.
Neural networks have shown great promise for use in systems for action recognition. Neural networks may be trained by use of recordings that are associated with accurate action annotations (labels). Conventionally, the process of annotating recordings is performed manually by experts in the field and is both time-consuming and expensive.
Self-supervised learning (SSL) aims to learn feature representations from large amounts of unlabeled data. It has been proposed to use self-supervised training to help supervised training, by pre-training a network by use of unlabeled data and then fine-tuning the pre-trained network by use of a small amount of labeled data. Such an approach applied on individual images is described in the article “Bootstrap your own latent—a new approach to self-supervised learning”, by Grill et al, arXiv:2006.07733v3 [cs.LG] 10 Sep. 2020. The pre-training relies on two neural networks that interact and learn from each other. The pre-training uses augmented views generated from input images by a single augmentation pipeline. From an augmented view of an image, a first network is trained to predict the representation generated by the second network for another augmented view of the same image. Concurrently, the second network is updated with a slow-moving average of the first network.
It is an objective to at least partly overcome one or more limitations of the prior art.
Another objective is to improve action recognition in input data by neural networks.
A further objective is to reduce the amount of labelled input data needed to train a neural network to perform action recognition at a given accuracy.
One or more of these objectives, as well as further objectives that may appear from the description below, are at least partly achieved by a system, a method, and a computer-readable medium according to the independent claims, embodiments thereof being defined by the dependent claims.
Still other objectives, as well as features, aspects and technical effects will appear from the following detailed description, from the attached claims as well as from the drawings.
Embodiments will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments are shown. Indeed, the subject of the present disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure may satisfy applicable legal requirements.
Also, it will be understood that, where possible, any of the advantages, features, functions, devices, and/or operational aspects of any of the embodiments described and/or contemplated herein may be included in any of the other embodiments described and/or contemplated herein, and/or vice versa. In addition, where possible, any terms expressed in the singular form herein are meant to also include the plural form and/or vice versa, unless explicitly stated otherwise. As used herein, “at least one” shall mean “one or more” and these phrases are intended to be interchangeable. Accordingly, the terms “a” and/or “an” shall mean “at least one” or “one or more”, even though the phrase “one or more” or “at least one” is also used herein. As used herein, except where the context requires otherwise owing to express language or necessary implication, the word “comprise” or variations such as “comprises” or “comprising” is used in an inclusive sense, that is, to specify the presence of the stated features but not to preclude the presence or addition of further features in various embodiments. The term “compute”, and derivatives thereof, is used in its conventional meaning and may be seen to involve performing a calculation involving one or more mathematical operations to produce a result, for example by use of a computer.
As used herein, the terms “multiple”, “plural” and “plurality” are intended to imply provision of two or more elements, whereas the term a “set” of elements is intended to imply a provision of one or more elements. The term “and/or” includes any and all combinations of one or more of the associated listed elements.
It will furthermore be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing the scope of the present disclosure.
Well-known functions or constructions may not be described in detail for brevity and/or clarity. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
Like numbers refer to like elements throughout.
Before describing embodiments in more detail, a few definitions will be given.
As used herein, “keypoint” has its conventional meaning in the field of computer vision and is also known as an interest point. A keypoint is a spatial location or point in an image that define what is interesting or what stand out in the image and may be defined to be invariant to image rotation, shrinkage, translation, distortion, etc. More generally, a keypoint may be denoted a “reference point” on an object to be detected in the image, with the reference point having a predefined placement on the object. Keypoints may be defined for a specific type of object, for example a human or animal body, a part of the human or animal body, or an inanimate object with a known structure or configuration. In the example of a human or animal body, keypoints may identify one or more joints and/or extremities. Keypoints may be detected by use of any existing feature detection algorithm(s), for example image processing techniques that are operable to detect one or more of edges, corners, blobs, ridges, etc. in digital images. Non-limiting examples of feature detection algorithms comprise SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Feature), FAST (Features from Accelerated Segment Test), SUSAN (Smallest Univalue Segment Assimilating Nucleus), Harris affine region detector, and ORB (Oriented FAST and Rotated BRIEF). Further information about conventional keypoint detectors is found in the article “Local invariant feature detectors: a survey”, by Tuytelaars et al, published in Found. Trends. Comput. Graph. Vis. 3(3), 177-280 (2007). Further examples of feature detection algorithms are found in the articles “Simple Baselines for Human Pose Estimation and Tracking”, by Xiao et al, published at ECCV 2018, and “Deep High-Resolution Representation Learning for Human Pose Estimation”, by Sun et al, published at CVPR 2019. Correspondingly, objects may be detected in images by use of any existing object detection algorithm(s). Non-limiting examples include various machine learning-based approaches or deep learning-based approaches, such as Viola—Jones object detection framework, SIFT, HOG (Histogram of Oriented Gradients), Region Proposals (RCNN, Fast-RCNN, Faster-RCNN), SSD (Single Shot MultiBox Detector), You Only Look Once (YOLO, YOLO9000, YOLOv3), and RefineDet (Single-Shot Refinement Neural Network for Object Detection).
As used herein, “pose” defines the posture of an object and comprises a collection of positions which may represent keypoints. The positions may be two-dimensional (2D) positions, for example in an image coordinate system, resulting in a 2D pose, or three-dimensional (3D) positions, for example in a scene coordinate system, resulting in a 3D pose. A 3D pose may be generated based on two or more images taken from different viewing angles, or by an imaging device capable of depth-sensing, for example as implemented by Microsoft Kinect™. A pose for a human or animal object is also referred to as a “skeleton” herein.
As used herein, “action sequence” refers to a time sequence of data samples that depict an object while the object performs an activity. The activity may or may not correspond to one or more actions among a group of predefined actions. If the activity corresponds to one or more predefined actions, the action sequence may be associated with one or more labels or tags indicative of the predefined action(s). Such an action sequence is “labeled” or “annotated”. An action sequence that is not associated with a label or tag is “unlabeled” or “non-annotated”. The action sequence may be a time sequence of images, or a time sequence of poses. When based on poses, the action sequence may be seen to comprise a time sequence of object representations, with each object representation comprising locations of predefined features on the object.
As used herein, “neural network” refers to a computational learning system that uses a network of functions to understand and translate a data input of one form into a desired output, usually in another form. The neural network comprises a plurality of interconnected layers of neurons. A neuron is an algorithm that receives inputs and aggregates them to produce an output, for example by applying a respective weight to the inputs, summing the weighted inputs and passing the sum through a non-linear function known as an activation function.
Embodiments are related to training of neural networks for action recognition, in particular with limited access to labeled data. Some embodiments relate to techniques of training a neural network to generate labeled action sequences from unlabeled action sequences. Some embodiments belong the field of self-supervised learning (SSL) and involves data augmentation of the unlabeled action sequences as a technique of priming a neural network with desirable invariances. By training the neural network to ignore variances among data samples that are irrelevant for action recognition, the trained neural network will be capable of creating meaningful representations of the unlabeled action sequences. Some embodiments relate to techniques for configuring the data augmentation to improve the performance of the trained neural network in terms of its representations of the unlabeled action sequences. Some embodiments relate to techniques for further improving the performance of such a trained neural network.
Action recognition involves processing action sequences to determine one or more actions performed the object in the action sequences. It is a complex task subject to active research. Neural networks are well-suited for this task if properly trained.
It is currently believed that the use of skeleton data will improve the performance of action recognition by neural networks. The conversion from images to skeletons may be advantageous as it reduces the amount of data that has to be processed for action recognition. This enables use of light-weight and robust action recognition algorithms. Training such algorithms in a fully-supervised manner requires large datasets of skeleton data with accurate action annotation. However, skeleton-based training data is scarce and generating the required datasets is a time-consuming and expensive process requiring domain experts. Embodiments to be described in the following, with reference to
While the following description may refer to the use of action sequences in the form of skeleton sequences, it is equally applicable to other types of action sequences, for example comprising a respective time sequence of digital images. Similarly, although the object may be presented as a human or animal, it may be an inanimate object.
The system 1A in
The system 1A in
In accordance with embodiments, the sub-modules 21, 22 differ by at least one augmentation function. In one example, consistent with the above-mentioned aggressive-conservative configuration, sub-module 22 comprises the augmentation function(s) of sub-module 21, or a subset thereof, and one or more additional augmentation functions. In another example, which also may be consistent with the aggressive-conservative configuration, the augmentation function(s) in sub-module 22 differ from the augmentation function(s) in sub-module 21.
Reverting to
In some embodiments, the action sequences AS1, AS2 in at least some of the pairs are identical. For example, the augmentation module 20 may retrieve a single action sequence from the database 10 and duplicate it to form AS1, AS2. Alternatively, each action sequence may be stored in two copies in the database 10 for retrieval by the augmentation module 20. This type of action sequences is referred to as “duplicate AS” in the following.
In some embodiments, the action sequences AS1, AS2 in at least some of the pairs are taken from two different viewing angles onto the object while it performs an activity, referred to a “multiview AS” in the following. Thus, AS1 and AS2 may be recorded at the same time to depict the object from two different directions. For example, the multiview AS may originate from two imaging devices in different positions in relation to the object. The use of multi-view AS in pre-training is currently believed to improve performance, for example by allowing the neural network to learn representations that are robust to changes of viewpoint and different camera properties. It may be noted that I1, I2 may be generated to include both duplicate AS and multiview AS, as well as other types of action sequences.
The first updating module 13 is configured to receive Q and Z′ and, based on Q and Z′ for a number of incoming action sequences, compute and update the values of control parameters of the first neural network 11, as indicated by an arrow 131. It is to be noted that the first updating module 13 does not update the control parameters of the second neural network 12.
In the illustrated example, the second updating module 14 comprises a first sub-module 141, which is configured to update control parameters of encoder 121 based on control parameters of encoder 111, and a second sub-module 142, which is configured to update control parameters of projector 122 based on control parameters of projector 112. The first and second sub-modules 141, 142 may use the same or different functions to update the control parameters.
In the illustrated example, step 601 comprises retrieving AS1 and AS2 from the database 10. As noted above, step 601 may perform duplicate AS or multiview AS to retrieve AS1, AS2. Step 602 comprises generating augmented versions of AS1 and AS2. Step 602 may be seen to comprise sub-step 602A, in which MAS1 is generated by operating the first sub-module 21 on AS1, and sub-step 602B, in which MAS2 is generated by operating the second sub-module 22 on AS2. Step 603 comprises including MAS1 and MAS2 in I1 and I2. Step 603 may also comprise a normalization processing of MAS1 and MAS2 before they are included in I1 and I2. For example, the normalization processing may comprise rotating the poses to face in a predetermined direction, transforming the poses to be centered at the origin, etc. Step 603 may be performed by the output sub-module 23. As understood from the foregoing, MAS1 and MAS2 may be included in I1 to be processed concurrently with MAS2 and MAS1, respectively, in 12. Steps 601-603 are repeated, by step 604, a predefined number of times, to include a predefined number of pairs of augmented action sequences in the input data I1, I2.
The resulting input data I1, I2 is then processed in steps 605-608. Step 605 comprises operating NN1 on I1 to generate the first representation data Q (
Irrespective of implementation, when all action sequences in I1, I2 have been processed and [P1] and [P2] have been calculated, step 609 may return the method 600 to step 601 to generate a new batch of I1, I2. Each execution of steps 601 through step 608 may be denoted an optimization step. Step 609 may be arranged to initiate a predefined number of optimization steps. Alternatively, step 609 may be arranged to initiate optimization steps until a convergence criterion is fulfilled.
Reverting to
An augmentation function for resampling (“resampling function”) is operable to change the time distance between the skeletons in an action sequence, resulting in an increase or decrease in the speed of the action sequence. The resampling function may operate to increase/decrease the speed of one or more subsets of the action sequence. It is conceivable that different subsets of the action sequence are subjected to different resampling. The resampling function may be included to train the neural network to be robust to variations in speed between action sequences.
An augmentation function for low-pass filtering (“LP function”) is operable to perform a temporal smoothing of the time sequence of skeletons in an action sequence (cf. 102 in
An augmentation function for noise enhancement (“noise function”) is operable to introduce noise to the keypoint locations in the action sequence. The noise may be statistical noise of any suitable distribution, including but not limited to Gaussian noise. The noise function may be operated on a specific subset of keypoints or all keypoints of the skeleton. The LP function may be included to train the neural network to be robust to noise.
An augmentation function for spatial distortion (“distortion function”) is operable to distort one or more skeletons in an action sequence in a selected direction. The spatial distortion may comprise non-uniform scaling and/or shearing. As used herein, shearing refers to a linear transformation that slants the shape of an object in a given direction, for example by applying a shear transformation matrix to a respective skeleton in the action sequence. An example of a spatial distortion of a skeleton 111 is shown in
An augmentation function for keypoint removal (“removal function”) is operable to hide a subset of the keypoints of the skeletons in the action sequence. The respective keypoint is hidden by removal of its location in the definition of the action sequence. The removal function may operate on individual keypoints or a group of keypoints. One such group may be defined in relation to a geometric plane with a predefined arrangement through the object. In one example, the geometric plane is a vertical symmetry plane of the object, and all keypoints on the left or right side of the plane are removed. An example of left-sided keypoint removal in a skeleton 111 is shown in
An augmentation function for mirroring (“mirror function”) is operable to flip the object through a predefined geometric plane. Each keypoint is thereby repositioned relative to the geometric plane, ending up an equal distance away, but on the opposite side. An example of a mirroring of a skeleton 111 through a vertical mirror plane is shown in
An augmentation function for temporal cropping (“cropping function”) is operable to extract a temporally coherent subset of the skeletons in an action sequence. An example is shown in
An augmentation function for start modification (“rearrangement function”) is operable to select a skeleton in the action sequence and rearrange the action sequence with the selected skeleton as starting point. The rearrangement function thereby results in a temporal shift of the action sequence. An example of a rearrangement function is shown in
As noted above, an augmentation function may be controlled based on randomized control parameter value(s). The control parameter value(s) may thereby be changed for each action sequence to be processed. The resampling function may, for example, resample the respective action sequence to a random time interval. The LP function may, for example, operate the low-pass filter on a random subset of keypoints and/or randomly define the low-pass filter. The noise function may, for example, add noise to a random subset of keypoints. The distortion function may, for example, distort a random subset of keypoints and/or randomly define the distortion to be applied in terms of selected direction and/or type of distortion. The removal function may, for example, remove a random subset of keypoints and/or randomly select geometric plane.
The mirror function may, for example, randomly select mirror plane. The cropping function may, for example, randomly select the coherent subset to be extracted. The rearrangement function may, for example, randomly select the skeleton to be used as starting point.
In the above-mentioned aggressive-conservative configuration, the first sub-module 21 may comprise one or more augmentation functions designated as “conservative”. In some embodiments, at least one of the resampling function, the noise function, or the LP function is a conservative function. One or more of the other augmentation functions may be designated as “aggressive” and may be included in the second sub-module 22, which may or may not also include the conservative function(s). Aggressive functions are not included in the first sub-module 21. In some embodiments, an augmentation function may be switched from basic to aggressive by the use of control parameters. For example, a removal function that removes one or a few keypoints may be designated as conservative, whereas a removal function that removes larger groups of keypoints may be designated as aggressive. In this way all of the above examples of augmentation functions may be implemented as a conservative or aggressive function.
Although augmentation functions have been exemplified with reference to skeleton sequences in the foregoing, the skilled person understands that corresponding augmentation functions may be defined to operate on image sequences (cf. 100 in
Reverting to the flow chart of
However, as also shown in
The structures and methods disclosed herein may be implemented by hardware or a combination of software and hardware. In some embodiments, such hardware comprises one or more software-controlled computer resources.
In the following, clauses are recited to summarize some aspects and embodiments of the invention as disclosed in the foregoing.
Clause 1. A system for training of a neural network, said system comprising: a first neural network (11), which is configured to operate on first input data (I1) to generate first representation data; a second neural network (12), which is configured to operate on second input data (I2) to generate second representation data; a first updating module (13), which is configured to update parameters of the first neural network (11) to minimize a difference between the first representation data and the second representation data; a second updating module (14), which is configured to update parameters of the second neural network (12) as a function of the parameters of the first neural network (11); and an augmentation module (20), which is configured to retrieve a plurality of corresponding first and second action sequences, each depicting a respective object performing a respective activity, generate the first and second input data (I1, I2) to include augmented versions of the first and second action sequences; wherein the system is configured to operate the first and second neural networks (11, 12) on one or more instances of the first and second input data (I1, I2) generated by the augmentation module (20) and provide at least a subset of the parameters of the first neural network (11) as a parameter definition ([P1]) of a pre-trained neural network, and wherein the augmentation module (20) comprises a first sub-module (21) which is configured to generate a first augmented version (MAS1) based on a respective first action sequence (AS1), and a second sub-module (22) which is configured to generate a second augmented version (MAS2) based on a respective second action sequence (AS2), wherein the second sub-module (22) differs from the first sub-module (21).
Clause 2. The system of clause 1, wherein the augmentation module (20) is configured to include the corresponding first and second augmented versions (MAS1, MAS2) in the first and second input data (I1, I2) such that the first and second networks (11, 12) operate concurrently on the corresponding first and second augmented versions (MAS1, MAS2).
Clause 3. The system of clause 1 or 2, wherein the first sub-module (21) comprises a first set of augmentation functions (F11, . . . , F1m) which are operable on the respective first action sequence (AS1) to generate the first augmented version (MAS1), wherein the second sub-module (22) comprises a second set of augmentation functions (F21, . . . , F2n) which are operable on the respective second action sequence (AS2) to generate the second augmented version (MAS2), wherein the first and second sets of augmentation functions differ by at least one augmentation function.
Clause 4. The system of any preceding clause, wherein the second sub-module (22) is operable to apply more augmentation than the first sub-module (21).
Clause 5. The system of any preceding clause, wherein each of the first and second action sequences comprise a time sequence of object representations (103), and wherein each of the object representations (103) comprises locations of predefined features (104) on the respective object.
Clause 6. The system of clause 5, wherein the second sub-module (22), to generate the second augmented version (MAS2), is operable to randomly select a coherent subset (102′) of the object representations (103) in the respective second action sequence (AS2).
Clause 7. The system of clause 5 or 6, wherein the second sub-module (22), to generate the second augmented version (MAS2) is operable to distort the object representations (103) in the respective second action sequence (AS2) in a selected direction.
Clause 8. The system of any one of clauses 5-7, wherein the second sub-module (22), to generate the second augmented version (MAS2), is operable to hide a subset of the respective object in the object representations (103) in the respective second action sequence (AS2).
Clause 9. The system of clause 8, wherein the subset corresponds to said predefined features (104) on one side of a geometric plane with a predefined arrangement through the respective object.
Clause 10. The system of any one of clauses 5-9, wherein the second sub-module (22), to generate the second augmented version (MAS2), is operable to perform a temporal smoothing of the object representations (103) in the respective second action sequence (AS2).
Clause 11. The system of any one of clauses 5-10, wherein the second sub-module (22), to generate the second augmented version (MAS2), is operable to randomly select an object representation (103A) in the respective second action sequence (AS2) and rearrange the respective second action sequence (AS2) with the selected object representation (103A) as starting point.
Clause 12. The system of any one of clauses 5-11, wherein the second sub-module (22), to generate the second augmented version (MAS2), is operable to flip the respective object in the object representations (103) in the respective second action sequence (AS2) through a mirror plane.
Clause 13. The system of any one of clauses 5-12, wherein the first sub-module (21), to generate the first augmented version (MAS1), is operable to change a time distance between the object representations (103) in the respective first action sequence (AS1).
Clause 14. The system of any preceding clause, wherein the augmentation module (20) is configured to retrieve the first and second action sequences so as to correspond to different viewing angles onto the respective object performing the respective activity.
Clause 15. The system of any preceding clause, which further comprises a training sub-system (1B), which comprises: a third neural network (15), which is configured to operate on third input data (I3) to generate third representation data, the third neural network (15) being initialized by use of the parameter definition ([P1]), and a third updating module (16), which is configured to update parameters of the third network (15) to minimize a difference between the third representation data and activity label data (L3) associated with the third input data (I3), wherein the training sub-system (1B) is configured to, by the third updating module (16), train the third network (15) to recognize one or more activities represented by the activity label data (L3).
Clause 16. The system of clause 15, wherein the training sub-system (1B) comprises a further augmentation module (25) which is configured to retrieve third action sequences of one or more objects performing one or more activities, generate the third input data (I3) to include third augmented versions of the third action sequences, wherein the further augmentation module (20) is configured in correspondence with the first sub-module (21).
Clause 17. The system of clause 15 or 16, which comprises a fourth neural network (17), which is configured to operate on fourth input data (I4) to generate fourth representation data, and a fourth updating module (18), which is configured to update parameters of the fourth network (17) to minimize a difference between the fourth representation data and fifth representation data, wherein the fifth representation data is generated by the third neural network (15), when trained, being operated on the fourth input data (I4).
Clause 18. The system of clause 17, wherein the fourth neural network (17) has a smaller number of channels than the third neural network (15).
Clause 19. A computer-implemented method for use in training of a neural network, said method comprising: retrieving (601) first and second action sequences of an object performing an activity; generating the first and second input data to include first and second augmented versions of the first and second action sequences; operating (605) a first neural network on the first input data to generate first representation data; operating (606) a second neural network on the second input data to generate second representation data; updating (607) parameters of the first neural network to minimize a difference between the first representation data and the second representation data; updating (608) parameters of the second neural network as a function of the parameters of the first neural network; and providing (610), after operating the first and second neural networks on one or more instances of the first and second input data, at least a subset of the parameters of the first neural network as a parameter definition of a pre-trained neural network, wherein said generating the first and second input data comprises operating (602A) a first sub-module on the first action sequence to generate the first augmented version, and operating (602B) a second sub-module, which differs from the first sub-module, on the second action sequence to generate the second augmented version.
Clause 20. A computer-readable medium comprising computer instructions (1002A) which, when executed by a processor (1001), cause the processor (1001) to the perform the method of clause 19.
Number | Date | Country | Kind |
---|---|---|---|
2150749-6 | Jun 2021 | SE | national |