The present invention relates to a system and method for optimising a reinforcement learning model and in particular, for use with computer vision and image data. This may also be described as Localised Machine Learning Optimisation.
The success of deep learning in computer vision and other fields in recent years has relied heavily upon the availability of large quantities of labelled training data. However, there are two emerging fundamental challenges to deep learning: (1) How to scale up model training on large quantities of unlabelled data from a previously unseen application domain (target domain) given a previously trained model from a different domain (source domain); and (2) How to scale up model training when different target domain application data are no longer available to a centralised data labelling and model training process due to privacy concerns and data protection requirements. For deep learning on person re-identification (Re-ID) tasks in particular, most existing person Re-ID techniques are based on the assumption that a large amount of pre-labelled data is available and can be used for model training all at once in batch. However, this assumption is not applicable to most real-world deployment of a Re-ID system.
For example, it is difficult for different systems or organisations may be unwilling to share their data, whereas successful and improved model training relies on larger training sets. In some situations, supervised learning can improve the situation but this relies on human users to confirm results provided by the trained model. This is time consuming and can be unfeasible for larger data sets.
Therefore, there is required a method and system that provides an improved, more efficient and more effective way to carry out localised model training without overburdening human users or required larger labelled data sets.
The following machine learning methods and mechanisms to implement two complementary aspects of distributed AI deep learning at-the-edge (each private user-site, e.g. a target application domain without requiring the sharing of data, or on an AI device, e.g. AI chip). These two aspects may be used independently or in combination.
Locally for each user-site application (application target domain), deep reinforcement learning is implemented based on a human-in-the-loop data mining model to remove the need for a strong model trained on globally collected labelled training data of a large size. Instead, a weak model, pre-trained by independent small sized labelled data (non-target domain) is activated at each user-site for deployment (user-usage) and simultaneously performs local (per user-site) online model optimisation by cumulatively collecting informative samples from using the pre-trained weak model without exhaustively labelling all the data at every user-site to collect a large global training data pool. This model reduces human annotation by machine-guided selective data sampling for locally (distributed at-the-edge) optimised models at each and different application target domain according to its unique environmental context. This avoids the need for globally sharing training data across different application target domains to learn a strong model, so to comply with data protection and privacy preserving at each individual application domain.
In an example implementation a framework is iteratively updated by refining a Reinforcement Learning (RL) policy and Convolutional Neural Network (CNN) parameters alternately. In particular, a Deep Reinforcement Active Learning (DRAL) method is formulated to guide an agent (a model in a reinforcement learning process) in selecting training samples to be reviewed by human user who can provide “weak” feedback by confirming model generated predictions according to a ranked likelihood. The reinforcement learning reward is the uncertainty value of each human confirmation for each selected sample. A binary feedback (positive or negative) given by the human annotator and used to select the samples, which are then used to optimise iteratively (multiple times) a pre-trained CNN Re-ID model locally at each user-site by cumulative model fine tuning against collections of newly sampled data (unlabelled) using reinforcement deep learning. This distributed AI reinforcement model may be described as optimisation at-the-edge.
Globally, a mechanism enables distributed AI reinforcement model optimisation at-the-edge to also share global knowledge from multiple application target domains by knowledge ensemble and distillation through multi-model representation alignment and cumulation without sharing global training data. In particular, a knowledge distillation mechanism provides cumulate knowledge from distributed model learning at multiple domains. This results in a strong teacher model for knowledge ensemble and distillation by constructing a multi-branch deep network model, where each model branch captures a pre-learned model representation from a different user-domain with different training data while simultaneously learning the strong teacher model and providing enhanced model representation to each target domain. This may be described as global AI knowledge ensemble and distillation through model representation without sharing different target domain (user-site) training data.
Overall, this approach to this distributed AI deep model learning at-the-edge is designed to facilitate distributed model optimisation given partial (local) relatively small data that only requires limited computing resources (e.g. without hyperscale data centres), of which an extreme case is deep learning on embedded AI chips built into a new generation of body-worn smart cameras and mobile devices, e.g. ARM ML Processor and OD Processor, Nvidia Jetson TX2 GPU, and Google Edge TPU. This distributed AI deep model learning mechanism facilitates privacy-preserving AI for user-centred services whilst simultaneously cumulating globally knowledge from distributed AI model learning without global data sharing. This has become essential for empowering the rapid emergence of new AI chip technologies for large scale distributed user-cantered applications with user-cantered data ownership and privacy protection being essential to such distributed AI systems.
In accordance with a first aspect there is provided a method for optimising a reinforcement learning model comprising the steps of:
receiving a labelled data set;
receiving an unlabelled data set;
generating model parameters to form an initial reinforcement learning model using the labelled data set as a training data set;
finding a plurality of matches for one or more target within the unlabelled data set using the initial reinforcement learning model;
ranking the plurality of matches;
presenting a subset of the ranked matches and corresponding one or more target, wherein the subset of ranked matches includes the highest ranked matches;
receiving a signal indicating that one or more presented match of the highest ranked matches is an incorrect match;
adding information describing the indicated incorrect one or more match and corresponding target to the labelled data set to form a new training data set; and
updating the model parameters of the initial reinforcement learning model to form an updated reinforcement learning model using the new training data set. Therefore, the reinforcement learning model can be improved more efficiently and improving the effectiveness of human review. This localised model training improves the overall performance of the method and system. The method may be implemented as a system or distributed system, for example.
Advantageously, the subset of ranked matches further includes the lowest ranked matches, and before updating the model parameters of the initial reinforcement model, the method further comprising the steps of:
receiving a signal indicating that one or more presented match of the lowest ranked matches is a correct match; and
adding information describing the indicated correct one or more match and corresponding target to the new training data set. Whilst limiting the matches to the best matches provides an improvement (especially when incorrect matches amongst this group are detected and incorporated into the training set) alternatively, or additionally, matches from the lower or lowest ranking may be passed for review by the human user. Whilst receiving confirmation that such lower matches are not actual matches (and this can go some way to improving the model) receiving information confirming a match where it is not expected amongst the lowest ranked matches provides a significant boost to the training of the model when such information is included in the training data set. Doing both is especially useful and effective.
Optionally, wherein the unlabelled data set is larger than the labelled data set.
Optionally, the method may further comprise the steps of:
finding a plurality of new matches for one or more new target within the unlabelled data set using the updated reinforcement learning model;
ranking the plurality of new matches;
presenting a subset of the ranked new matches and corresponding one or more target, wherein the subset of ranked matches includes the highest ranked matches;
receiving a signal indicating that one or more presented match of the highest ranked new matches is an incorrect match;
adding information describing the indicated one or more incorrect new match and corresponding new target to the labelled data set to form a further new training data set; and
updating the model parameters of the initial reinforcement learning model to form an updated reinforcement learning model using the further new training data set. This defines a first iteration.
Optionally, the subset of ranked new matches may further include the lowest ranked new matches, and before updating the model parameters of the updated reinforcement model, the method may further comprise the steps of:
receiving a signal indicating that one or more presented new match of the lowest ranked new matches is a correct match; and
adding information describing the indicated correct one or more new match and corresponding target to the further new training data set. This may be done as part of the first iteration.
Optionally, the method may further comprise iterating the finding, ranking, presenting, receiving and updating steps for one or more further targets to further update the reinforcement learning model each iteration. Such iterations may continue until a criteria is reached (e.g. time, number of iterations, etc.)
Optionally, the one or more new target is a different target to an earlier one or more target. The matches presented to the human user may be for a single target or for several different targets. The target or targets may change, for different iterations or may stay the same.
Optionally, the step of updating the model parameters of the reinforcement learning model may further comprise:
finding a maximised reward applied to an action sequence used to update the model parameters of the initial reinforcement learning model.
Preferably, the reward, R, may be defined by:
where Xpt, Xnt are positive and negative sample batches obtained until time t, dg
Preferably, the method may further comprise the step of maximising Q* according to:
for all future rewards (Rt+1, Rt+2, . . . ) discounted by a factor γ to find an optimal policy π* used to update the model parameters of the reinforcement learning model. Other techniques may be used.
Optionally, the method may further comprise the step of forming a new reinforcement learning model by combining model parameters of the updated reinforcement learning model with a different updated reinforcement learning model that was generated using a different unlabelled data set. Therefore, models that are trained from different (private) data sets may be fused without having to merge the data.
Optionally, the labelled data set and the unlabelled data set are image data sets, natural language data sets, or geo-location data sets. Other data sets and types may be used.
Optionally, presenting the subset of the matches and corresponding one or more target and receiving the signal may further comprise presenting to a user an image of the target and an image matched with the target and receiving a true response from the user when the user determines a match and a false response from the user determines that the images don't match.
Preferably, the initial and new reinforcement learning models may be generated using a convolutional neural network architecture.
Advantageously, ranking the plurality of matches may be based on:
a softmax Cross Entropy loss function:
where nb is a batch size and pi(y) is a predicted probability on a ground-truth class y of an input target and a triplet loss is defined by:
where m is a margin parameter for positive and negative pairs for triplet samples xa being an anchor point, xp being a hardest positive sample, and xn being a negative sample of a different class to xa, where the loss is calculated from:
L
total
=L
cross
+L
tri.
Optionally, the method according to any previous claim may further comprise the step of selecting matches to present as the subset of matches.
Preferably, the subset of matches may be selected by building a sparse similarity graph based on a similarity value Sim(i,j) between two samples i, j calculated from
where q is the target and g={g1, g2, . . . , gn
Optionally, the method may further comprise the step of executing a k-reciprocal operation to build the sparse similarity matrix having nodes niϵ(q, g), where k-nearest neighbour are defined as N(ni,k), and k-reciprocal neighbours R(ni,k) of ni are obtained by:
R(ni,κ)={xj|(niϵN(xj,κ)){circumflex over ( )}(xjϵN(ni,κ))}.
Optionally, the method may further comprise the step of merging the parameters of the updated reinforcement learning model with parameters of a different updated reinforcement learning model trained using a different unlabelled training data set, to form a further cumulation of distributed reinforcement learning models.
In accordance with a second aspect, there is provided a method for optimising a reinforcement learning model comprising the steps of:
receiving from a first node, first model parameters of a first reinforcement learning model, the first reinforcement learning model trained using a first labelled data set and a first unlabelled data set as training data sets;
receiving from a second node, second model parameters of a second reinforcement learning model, the second reinforcement learning model trained using a second labelled data set and a second unlabelled data set as training data sets; and
merging the first and second model parameters to define a further reinforcement learning model. This allows models to be fused or merged without requiring access to different data sets at the same time. This aspect can be used with any of the above aspects or used with models trained in different ways.
Optionally, the first labelled data set same is the second labelled data set.
Optionally, the method may further comprise the steps of:
receiving from one or more further nodes, one or more further model parameters of one or more further reinforcement learning models, the one or more further reinforcement learning models trained using one or more further labelled data sets and one or more further unlabelled data sets as training data sets; and
merging the first, second and one or more further model parameters to define a further cumulation of distributed reinforcement learning models. Accumulating reinforcement learning models in this way provides an improved and more efficient result.
Optionally, the method may further comprise the step of sending the merged first and second model parameters to the first and second nodes. Two or more nodes may be used or benefit in this way.
Optionally, the method may further comprise the step of the first and second and second nodes using the further reinforcement model defined by the merged first and second model parameters to identify target matches within unlabelled data sets.
Preferably, the first and second model parameters may be merged by computing a soft probability distribution at a temperature T according to:
where i denotes a branch index, i=0, m, θi and θe are the parameters of a branch and teacher model, respectively. Other merging functions may be used.
Preferably, the method may further comprise the step of aligning model representations between branches using a Kullback Leibler divergence defined by:
In accordance with a third aspect, there is provided a data processing apparatus, computer or computer system comprising one or more processors adapted to perform the steps of any of the above methods.
In accordance with a fourth aspect, there is provided a computer program comprising instructions, which when executed by a computer, cause the computer to carry out any of the above methods.
In accordance with a fifth aspect, there is provided a computer-readable medium comprising instructions, which when executed by a computer, cause the computer to carry out any of the above methods.
The methods described above may be implemented as a computer program comprising program instructions to operate a computer. The computer program may be stored on a computer-readable medium.
The computer system may include a processor or processors (e.g. local, virtual or cloud-based) such as a Central Processing unit (CPU), and/or a single or a collection of Graphics Processing Units (GPUs). The processor may execute logic in the form of a software program. The computer system may include a memory including volatile and non-volatile storage medium. A computer-readable medium may be included to store the logic or program instructions. The different parts of the system may be connected using a network (e.g. wireless networks and wired networks). The computer system may include one or more interfaces. The computer system may contain a suitable operating system such as UNIX, Windows® or Linux, for example.
It should be noted that any feature described above may be used with any particular aspect or embodiment of the invention.
The present invention may be put into practice in a number of ways and embodiments will now be described by way of example only and with reference to the accompanying drawings, in which:
It should be noted that the figures are illustrated for simplicity and are not necessarily drawn to scale. Like features are provided with the same reference numerals.
Large-scale visual object recognition (in particular people and vehicles) in urban spaces has become a major focus for Artificial Intelligence (AI) research and technology development with rapid growth in commercial applications. There is a fundamental technological challenge and market opportunity driven by economical needs to develop scalable machine learning algorithms and software for large-scale visual recognition in urban spaces by exploring the huge quantity of video data using deep learning, critical for smart city, public safety, intelligent transport, urban planning and design, e.g. Alibaba's City Brain; smart shopping, e.g. Amazon Go; and the fast-emerging self-driving cars. People and vehicle visual identification and search on urban streets at city-wide scales is a difficult task but potentially can revolutionise future smart city design and management, a technology that has not been considered scalable only until the recent emergence and rapid adaptation of deep learning, enabled by two advances in recent years: (1) The availability of very large-sized and labelled imagery data for model training, and (2) the rise of cheap, widely accessible and powerful Graphics Processing Unit (GPU) for AI model learning, originally designed for the computer games industry, most notably the Nvidia GPUs. Over the last decade, there has been a huge amount of video data captured from 24/7 urban camera infrastructures (camera networks on the roads, transport hubs, shopping malls), social media (e.g. YouTube, Flickr), and increasingly more from mobile platforms (mobile phones, cameras on vehicle dashboards and body-worn cameras). However, the vast majority of visual data are unstructured and unlabelled.
The following examples describe image and video data sets where individual people with such images are targets. The aim is to identify the same people in different locations obtained by separate video and image feeds. However, the described system and method may also be applied to different data sets, especially where targets are identified in from separate sources.
The incredible success of deep learning in computer vision, text analysis, speech recognition, and natural language processing in recent years relies heavily upon the availability of large quantities of labelled training data. Deep neural network learning assumes fundamentally that (1) a large volume of data can be collected from multi-source domains (diversity), stored on a centralised database for model training (quantity), (2) human resources are available for exhaustive manual labelling of this large pool of shared training data (human knowledge distillation).
However, there are two emerging fundamental challenges to deep learning: (1) How to scale up model training on large quantities of unlabelled data from a previously unseen application domain (target domain) given a previously trained model from a different domain (source domain); (2) How to scale up model training when different target domain user application data are no longer available to a centralised data labelling and model training process due to privacy concerns and data protection requirements, e.g. the EU-wide adoption of the General Data Protection Regulation (GDPR) in 2018. Despite the current significant focus on centralised data centres to facilitate big data machine learning drawing from shared data collection interfaces (multiple users), e.g. cloud-based robotics, the world is moving increasingly towards localised and private (not-shared) distributed data analysis at-the-edge, which differs inherently from the current assumption of ever-increasing availability of centralised big data and shared data analysis. The existing centralised and shared big data learning paradigm faces significant challenges when privacy concerns become critical, e.g. large-scale public domain people recognition for public safety and smart city, healthcare patient data analysis for personalised healthcare. This requires fundamentally a new kind of deep learning paradigm, what may be called user-ensuite (privacy-preserving) human-in-the-loop distributed data mining for deep learning at-the-edge. This new type of deep learning at-the-edge protects user data privacy whilst increasing model capacity cumulatively so to benefit all users without sharing data, by assembling user knowledge distributed through localised deep learning from user-ensuite data mining. This emerging need for distributed deep learning by knowledge ensemble at each user site without global data sharing poses new and fundamental challenges to current algorithm and software designs. Deep learning at-the-edge requires a model design that can facilitate effective model adaptation to partial (local) relatively small data sets (compared with deep learning principles) on limited computing resources (without hyperscale data centres). In an extreme case, this may be deep learning using embedded AI chips built into a new generation of body-worn smart cameras and mobile devices, e.g. ARM ML Processor and OD Processor, Nvidia Jetson TX2 GPU, and Google Edge TPU. Currently, there is very little if any research and development for methods and processes to enable such an AI deep learning at-the-edge paradigm.
Mechanisms for distributed AI deep learning at-the-edge are provided by exploring human-in-the-loop reinforcement data mining at a user site, with a particular focus on optimising person re-identification tasks, although the underlying methodology and processes are readily applicable to wider deep learning at-the-edge applications and system deployments, especially for other data sources.
In one example, person re-identification (Re-ID) matches people across non-overlapping camera views distributed at distinct locations. Most existing supervised person Re-ID approaches employ a train-once-and-deploy scheme. This may be pairwise training data that are collected and annotated manually for every pair of cameras before learning a model. Based on this assumption, supervised deep learning based Re-ID methods have made a significant progress in recent years [27, 80, 53, 75, 41].
However, in practice this assumption is not easy to adapt due several reasons: Firstly, pairwise pedestrian data is difficult to collect since it is unlikely that a large number of pedestrians reappear in other camera views. Secondly, the increasing number of camera views amplifies the difficulties in searching for the same person among multiple camera views. Thirdly, and perhaps most critically, increasingly less user data will be made available for a global training data collection limiting the availability of a centralised manual labelling process which is essential for enabling deep learning, due to privacy and data protection concerns. To address these difficulties, one solution is to design unsupervised learning algorithms where centralised manual labelling of training data is not required. Some work has been focussed on transfer learning or domain adaption technique for unsupervised Re-ID [16, 64, 44]. However, unsupervised learning based Re-ID models are inherently weaker compared to supervised learning based models, compromising Re-ID effectiveness in any practical deployment.
Another possible solution is following the semi-supervised learning scheme that decreases the requirement of data annotations. Successful research has been done on either dictionary learning [43] or self-paced learning [18] based methods. These models are still based on a strong assumption that parts of the identities (e.g. one third of the training set) are fully labelled for every camera view. This remains impractical for a Re-ID task with hundreds of cameras obtained from 24/7 operation, which is typical in urban applications.
Both unsupervised and semi-supervised model training still assume the accessibility of large quantity of raw (unlabelled) data from diverse user sites. This has become increasingly less plausible due to privacy concerns. To achieve effective Re-ID given a limited budget for annotation (data labelling) and limited data access in the first place, the present method focusses on human-in-the-loop person Re-ID with selective labelling by human feedback online [63]. This approach differs from the common once-and-done model learning approach. Instead, a step-by-step sequential active learning process is adopted by exploring human selective annotations on a much smaller pool of samples for model learning. These cumulatively human-labelled data (binary verification) are used to update model training for improved Re-ID performance. Such an approach to model learning is naturally suited for reinforcement learning together with active learning.
Active learning is a technique for online human data annotation that aims to sample actively the more informative training data for optimising model learning without exhaustive data labelling. Therefore, the benefit from human involvement is increased without requiring significantly more manual review time. This involves selecting from an unlabelled set matches that are generated by using an initially trained model. These potential matches are then annotated by a human oracle (user), and the label information provided by the user is then employed for further model training. Preferably, these operations repeat many times until a termination criterion is satisfied, e.g. the annotation budget is exhausted. An important part of this process is the sample selection strategy. Some samples and annotations have a greater (positive) effect on model training than other. Ideally, more informative samples are reviewed requiring less human annotation cost, which improves overall performance of the system. Rather than a hand-design strategy, the present system provides a reinforcement learning-based criterion.
At step 60 a subset of these matches are presented to the human user. The matches comprise a target image and one more possible matches. Not all of the matches are required and the subset includes the higher or highest ranked results. These results are those with the greatest confidence that the matches are correct. However, they may still contain incorrect matches. In some implementations, lower or the lowest ranked matches are also presented. These are typically the matches with the lowest reliability or confidence. Therefore, the system considers these to be incorrect matches. Thresholds may also be used to determine which matches to include in the subset.
At step 70 the human user reviews the presented matches (to particular targets) and either confirms the match or indicates an incorrect match. This can be a binary signal obtained by a suitable user interface (e.g. mouse click, keystroke, etc.). These results relate to the originally unlabelled data, but which have now been annotated by the human user. These (reviewed) unlabelled data together with the indications of matches to particular targets are added to the labelled data to provide a new training data set at step 80. This updated training data set is use to update the model parameters of the reinforcement learning model at step 90. Whilst this method 10 provides an enhanced model, iterating the steps one or more times provides additional enhancements. The loop may end when a particular criteria is met.
In particular embodiments, it is the indications of incorrect matches for the higher or highest ranked matches and/or the indications of correct matches for the lower or lowest ranked matches. Therefore, in some implementations, only these data are added to form the new training data set. In any case, restricting the matches to the highest and/or lowest ranked matches improves model training as there will be proportionally more of these type of results, whilst reducing the amount of work or time required by a human user 110.
An AI knowledge ensemble and distillation method is also provided. This not only is more efficient (lower training cost) but is also more effective (higher model generalisation improvement). In knowledge ensemble, this method constructs a multi-branch strong model consisting of multiple weak target models of the same model architecture (therefore a shared model representation) with different model representation instances (e.g. different deep neural network instances of the same architecture initialised by different pre-training on different data from different target domains). This creates a knowledge ensemble “teacher model” from all of the branches, and enhances/improves simultaneously each branch together with the teacher model. Therefore, separate data sets can be used to enhance a model used by different systems without having to share data.
Each branch is trained with two objective loss terms: A conventional softmax cross-entropy loss which matches with the ground-truth label distributions, and a distillation loss which aligns the model representation of each branch to the teacher's prediction distributions, and vice versa. An overview of our knowledge ensemble teacher model architecture 200 is illustrated in
A person Re-ID task may be used to search for the same people among multiple camera views, for example. Recently, most person Re-ID approaches [72, 65, 12, 14, 49, 56, 11, 76, 25, 9, 73, 74, 13, 57, 54] try to solve this problem under the supervised learning framework, where the training data is fully annotated. Despite the high performance of these methods, their large annotation cost present difficulties. To address the high labelling cost problem, some earlier techniques propose to learn the model with only a few labelled samples or without any label information. Representative algorithms [48, 70, 4, 79, 39, 64, 45, 66] include domain transfer schemes, group association approaches, and some label estimation methods.
Besides the above-mentioned approaches, some earlier techniques aim to reduce the annotation cost in a human-in-the-loop (HITL) model learning process. When there are only a few annotated image samples, HITL model learning can be expected to improve the model performance by directly involving human interaction in the circle of model training, tuning or testing. When a human population is used to correct inaccuracies that occur in machine learning predictions, the model may be efficiently corrected and improved, thereby leading to better results. This is similar to the situation of a person Re-ID task whose pre-labelling information is hard to obtain with the gallery candidate size far beyond that of the query anchor. Wang et al. [63] formulates a Human Verification Incremental Learning (HVIL) model which aims to optimize the distance metric with flexible human feedback continuously in real-time. The flexible human feedback (true, false, false but similar) employed by this model involves more information and boosts the performance in a progressive manner. However, this technique still has increased time and resource costs.
Active Learning may be compared against Reinforcement Learning. Active Learning (AL) has been popular in the field of Natural Language Processing (NLP), data annotation and image classification tasks [59, 10, 6, 47]. Its procedure can be thought as human-in-the-loop setting, which allows an algorithm to interactively query the human annotator with instances recognized as the most informative samples among the entire unlabelled data pool. This work is usually done by using some heuristic selection methods but they have been met with limited effectiveness. Therefore, an aim is to address the shortcomings of the heuristic selection approaches by framing the active learning as a reinforcement learning (RL) problem to explicitly optimize a selection policy. In [20], rather than adopting a fixed heuristic selection strategy, Fang et al. attempts to learn a deep Q-network as an adaptive policy to select the data instances for labelling. Woodward et al [67] try to solve the one-shot classification task by formulating an active learning approach which incorporates meta-learning with deep reinforcement learning. An agent 120 learned via this approach may be enable to decide how and when to request a label.
Knowledge transfer may be attempted between varying-capacity network models [8, 28, 3, 51]. Hinton et al. [28] distilled knowledge from a large pre-trained teacher model to improve a small target net. The rationale behind this is in taking advantage of extra supervision provided by the teacher model during training the target model, beyond a conventional supervised learning objective such as the cross-entropy loss subject to the training data labels. Extra supervision may be extracted from a pre-trained powerful teacher model in form of class posterior probabilities [28], feature representations [3, 51], or inter-layer flow (the inner product of feature maps) [69]. Knowledge distillation may be exploited to distil easy-to-train large networks into harder-to-train small networks [28], to transfer knowledge within the same network [37, 21], and to transfer high-level semantics across layers [36]. Earlier distillation methods often take an offline learning strategy, requiring at least two phases of training. The more recently proposed deep mutual learning [75] overcomes this limitation by conducting an online distillation in one-phase training between two peer student models. Anil et al. [2] further extended this idea to accelerate the training of large scale distributed neural networks.
However, the existing online distillation methods lack a strong “teacher” model which limits the efficacy of knowledge discovery. As an offline counterpart, multiple nets are needed to be trained, which is therefore computationally expensive. The present system and methods overcome these limitations by providing an online distillation training algorithm characterised by simultaneously learning a teacher online and the target net, as well as performing batch-wise knowledge transfer in a one-phase training procedure.
Multi-branch Architectures may be based on neural networks and these can be exploited in computer vision tasks [60, 61, 26]. For example, ResNet [26] can be thought of as a category of a two-branch network where one branch is an identity mapping. Recently, “grouped convolution” [68, 31] has been used as a replacement of standard convolution in constructing multi-branch net architectures. These building blocks may be utilised as templates to build deeper networks to gain stronger model capacities. Despite sharing the multi-branch principle, the present method is fundamentally different from such existing methods since there is an objective is to improve the training quality of any target network, but not to use a new multi-branch building block. In other words, the present method may be described as a meta network learning algorithm, independent of the network architecture design.
Distributed Cumulative Model Optimisation On-Site
The following describes a base CNN Network. Initially, a generic deep Convolutional Neural Network (CNN) architecture may be provided as the base network with ImageNet pre-training, e.g. either Resnet-50 [26] or ResNet-110 [26]. It may be straightforward to apply any other network architectures as alternatives. To effectively learn the ID discriminative feature embedding, the present system and method may use both cross entropy loss for classification and triplet loss for similarity learning synchronously.
The softmax Cross Entropy loss function may be defined as:
where nb denotes the batch size and pi (y) is the predicted probability on the groundtruth class y of an input image.
Given triplet samples xa, xp, xn, xa is an anchor point. xp is hardest positive sample in the same class of xa, and xn is a hardest negative sample of a different class of xa. Finally we define the triplet loss as following:
where m is a margin parameter for the positive and negative pairs.
Finally, the total loss for can be calculated by:
L
total
=L
cross
+L
tri (3)
A Deep Reinforced Active Learner—An Agent
The framework of the present DRAL is presented in
t, St, Rt
The Deep Reinforcement Active Learning (DRAL) framework is shown in
The Action set defines a selection of an instance from the unlabelled gallery pool, hence its size is the same as the pool. At each time step t, when encountered with the current state St, the agent 120 decides the action to be taken based on its policy π(At|St). Therefore the At instance of the unlabelled gallery pool will be selected querying by human oracle 110. Once St=gk is performed, the agent 120 may be prevented from choosing it again in subsequent steps. The termination criterion of this process depends on a pre-defined Kmax which restricts the maximal annotation amount for each query anchor.
State. Graph similarity may be employed for data selecting in active learning framework [22, 46] by digging the structural relationships among data points. Typically, a sparse graph may be adopted which only connects data point to a few of its most similar neighbours to exploit their contextual information. In an example implementation a sparse similarity graph is constructed among query and gallery samples and this is taken as the state value. With a queried anchor q and its corresponding gallery candidate set g={g1, g2, . . . , gn
where dij is the Mahalanobis distance of i, j. A k-reciprocal operation is executed to build the sparse similarity matrix. For any node niϵ(q, g) of the similarity matrix Sim, its top κ-nearest neighbours are defined as N(ni, κ). Then the κ-reciprocal neighbours R(ni, κ) of ni is obtained through:
R(ni,κ)={xj|(niϵN(xj,κ)){circumflex over ( )}(xjϵN(ni,κ))} (5)
Compared with the previous description, the κ-reciprocal nearest neighbours are more related to the node ni, of which the similarity value remains or otherwise will be assigned as zero. This sparse similarity matrix is then taken as the initial state and imported into the policy network for action selection. Once the action is employed, the state value may be adjusted accordingly to better reveal the sample relations.
To better understand the update of state value, an example is provided in
For a state St at time t, the optimal action At=gk may be selected via the policy network, which indicates that the gallery candidate gk will be selected for querying by the human annotator 110. A binary feedback is the provided as ykt={1, −1}, which indicates gk to be the positive pair or negative of the query instance. Therefore the similarity Sim(q, gk) between q and gk will be set as:
The similarities of the remaining gallery samples gi, i≠k and query sample may also be re-computed, which aims to zoom in the distance among positives and push out the distance among negatives. Therefore, with positive feedback, the similarity Sim(q, gi) is the average score between gi with (q, gk), where:
Otherwise, the similarity Sim(q, gi) will only be updated when the similarity among gk and gi is larger than a threshold thred, where:
Sim(q,gi)=max(Sim(q,gi)—Sim(gk,gi),0) (8)
he k-reciprocal operation will also be adopt afterwards, and a renewed state St+1 is then obtained.
Reward. The reward function defines the agent task objective, which in the very specific task of active sample selecting for person re-id occasion, aiming to pick out more true positive match and hard-differentiate negative samples for each query at a fixed annotation budget.
Standard active learning methods adopt an uncertainty measurement, hypotheses disagreement or information density as the selection function for classification [7, 24, 81, 71]. A data uncertainty may be adopted as the objective function of the reinforcement learning policy.
For data uncertainty measurement, higher uncertainty indicates that the sample is harder to distinguish. Following the same principle [62] which extends a triplet loss formulation to model heteroscedastic uncertainty in a retrieval task, a similar hard triplet loss [27] may be performed to measure the uncertainty of data. Let Xpt, Xnt indicate the positive and negative sample batch obtained until time t, dg
where [•]+ is the soft margin function by at least a margin m. Therefore, all of the future rewards (Rt+1, Rt+2, . . . ) discounted by a factor Tat time t can be calculated as:
Once Q* is learned, the optimal policy π* can be directly inferred by selecting the action with the maximum Q value.
CNN Network Updating. For each query anchor, several samples may be actively selected via the proposed DRAL agent 120, which are then manually annotated by the human oracle 110. These pairwise data will be added to an updated training data pool (e.g. a training data set). The CNN network may then be updated gradually using fine-tuning. The triplet loss may be used as the objective function, and when more labelled data is involved, the model becomes more robust and smarter. The renewed network is employed for Re-ID feature extraction, which in return helps the upgrade of the state initialization. This iterative training scheme may be stopped when a fixed annotation budget is reached or when each image in the training data pool has been browsed once by our DRAL agent 120.
Simultaneous Knowledge Ensemble and Distillation
An online knowledge distillation training method may be based on the idea of simultaneous knowledge ensemble and distillation (SKED). A base network architecture may be either a CNN ResNet-50 or ResNet-110. Other network architectures may be adopted. For model construction, n labelled training samples for ={(xi, yi)}in with each belonging to one of C classes yiϵ={1, 2, . . . , C}.
The network θ outputs a probabilistic class posterior p(c|x, θ) for a sample x over a class c as:
where z is the logits or unnormalised log probability outputted by the network θ. To train a multi-class classification model, the Cross-Entropy (CE) measurement may be employed between a predicted and a ground-truth label distribution as the objective loss function:
where δc,y is the Dirac delta which returns 1 if c is the ground-truth label, and 0 otherwise. With the CE loss, the network may be trained to predict the correct class label in a principle of maximum likelihood. To further enhance the model generalisation, extra knowledge may be distilled from an online native ensemble teacher to each branch in training.
Multi-Branch Teacher Model Ensemble. An overview of a global knowledge ensemble model is illustrated in
To construct a model network, the model may be reconfigured by adding a separate CE loss cei to each branch which simultaneously learns to predict the same ground-truth class label of a training sample. While sharing the most layers, each branch can be considered as an independent multi-class classifier in that all of them independently learn high-level semantic representations. Consequently, taking the ensemble of all branches (classifiers) can make a stronger teacher model. One common way of ensembling models is to average individual predictions. This may ignore the diversity and importance variety of the member models of an ensemble. Whilst this may be used, an improved technique is to learn to ensemble by a gating component as:
where gi is the importance score of the i-th branch's logits zi, and ze are the logits of the teacher. In particular, the original branch may be denoted as i=0 for indexing convenience. The teacher model may be trained with the CE loss cee (Eq (12)), which may be the same as the branches.
Knowledge Distillation. Given the teacher's logits of each training sample, this knowledge may be distilled back into all branches in a closed-loop form. For facilitating knowledge transfer, soft probability distributions may be computed at a temperature of T for individual branches and the teacher as:
where i denotes the branch index, I=0, . . . , m, θi and θe the parameters of the branch and teacher models respectively. Higher values of T lead to more softened distributions.
To quantify the alignment of model representations between individual branches and the teacher ensemble in their predictions, we use the Kullback Leibler divergence from branches to the teacher, defined as
Overall Loss Function. An overall loss function is obtained for simultaneous knowledge ensemble and distillation (SKED) training as:
Where cei and cee are the conventional CE loss terms associated with the i-th branch and the teacher, respectively. The gradient magnitudes produced by the soft targets {tilde over (p)} are scaled by
so the distillation loss term is multiplied by a factor T2 to ensure that the relative contributions of ground-truth and teacher probability distributions remain roughly unchanged. Note, the overall objective function of this model is not an ensemble learning since (1) these loss functions corresponding to the models with different roles, and (2) the conventional ensemble learning often takes independent training from member models.
Model Update and Deployment. Unlike a two-phase offline distillation training, the enhancement/update of a target network and the global teacher model may be performed simultaneously and collaboratively, with the knowledge distillation obtained from the teacher to the target being conducted in each mini-batch and throughout the whole training procedure. Since there is one multi-branch network rather than multiple networks, there is only a need to carry out the same stochastic gradient descent through (m+1) branches, and training the whole network until converging, as the standard single-model incremental batch-wise training. There is no additional complexity for asynchronously updating among different networks which may be required in deep mutual learning [75]. Once the model is trained, all the auxiliary branches may be removed in order to obtain the original network architecture for deployment. Hence, the present method does not generally increase the test-time cost. Moreover, if the target application domain has no limitation on resources and access, then an ensemble model with all branches can be more easily deployed.
Experiment 1—Distributed Optimisation On-Site
Datasets. The following describes the results of various experiments used to evaluate the present system and method. For experimental evaluations, results on both large-scale and small-scale person re-identification benchmarks are reported for robust analysis: The Market-1501 [77] is a widely adopted large-scale re-id dataset that contains 1,501 identities obtained by Deformable Part Model pedestrian detector. It includes 32,668 images obtain from 6 non-overlapping camera views on a campus. CUHK01 [40] is a remarkable small-scale re-id dataset, which consists of 971 identities from two camera views, where each identity has two images per camera view and thus includes 3884 images which are manually cropped. Duke [50] is one of the most popular large scale re-id dataset which consists 36411 pedestrian images captured from 8 different camera views. Among them, 16522 images (702 identities) are adopted for training, 2228 (702 identities) images are taken as query to be retrieved from the remaining 17661 images.
Evaluation Protocols. The detailed information about training/testing split of these three datasets are demonstrated in Table 2.
For Market-1501 [77], [78] is followed with 750 training/751 test split on single-query evaluation settings. For Duke [50] 702 training/702 test split are evaluated. A 485 training/486 test split is used for the CUHK01 dataset [40]. Two evaluation metrics are adopted in this approach to evaluate the Re-ID performance. The first one is the Cumulated Matching Characteristics (CMC), and the second is the mean average precision (mAP) which considering person Re-ID task as an object retrieval problem.
Implementation Details. the proposed DRAL method is implemented using the Pytorch framework. A resnet-50 multi-class identity discrimination network is re-trained with a combination of triplet loss and cross entropy loss by 60 epochs (pre-train on Duke for Market1501 and CUHK01, pre-train on Market1501 for Duke), at a learning rate of 5E-4 by using the Adam optimizer. The final FC layer output feature vector (2,048-D) is extracted as the re-id feature vector in the present model by resizing all of the training images as 256×128. The policy network in this method consists of three FC layers setting as 256. The DRAL model is randomly initialized and then optimized with the learning rate at 2E-2, and (Kmax, ns, K) are set as (10, 30, 15) by default. The κ-reciprocal number for sparse similarity construction is set as 15 in this work. The balanced parameter thred and m are set as 0.4 and 0.2, respectively. With every 25% of the training data swarmed into the labelled pairwise data pool, the CNN network is fine-tuned with learning rate at 5E-6.
Performance Evaluation. Human-in-the-loop person re-identification does not require the pre-labelling data, but receives user feedback for the input query little by little. It is feasible to label many of the gallery instances, but to cut down the human annotation cost, an active learning technique is performed for sample selecting. Therefore, the proposed DRAL method (the present method and system) is compared with some active learning based approach and unsupervised/transfer based methods. The results are shown in table 3 in which we use the terminology ‘uns/trans’, ‘active’ to indicate the training style under investigation. Moreover, baseline results are computed by directly employing the pre-trained CNN model, and the upper bound result indicates that the model is fine-tuned on the dataset with fully supervised training data.
For unsupervised/transfer learning setting, thirteen state-of-the-arts approaches are selected for comparison including UMDL [48], PUL [19], SPGAN [16], Tfusion [44], TL-AIDL [64], ARN [42], TAUDL [39], CAMEL [70], SSDAL [58].
In tables 3, 4 and 6, the rank-1, 5, 10 matching accuracy is illustrated and mAP(%) performance on the Market1501 [77], Duke [50] and CUHK01 [40] dataset, of which the results of the present approach are in bold. The present method achieves 84.32% and 66.07% at rank-1 and mAP, which outperforms the second best unsupervised/transfer approaches by 14.02% and 24.87% on Market1501 [77] benchmark. For Duke [50] and CUHK01 [40] datasets, DRAL also achieves fairly good performance with rank-1 matching rate at 75.31% and 76.95%.
These results demonstrate clearly the effectiveness of the present active sample selection strategy implemented by the DRAL method, and shows that without annotating exhaustively without selection large quantities of training data, an improved re-identification model can be built effectively by DRAL.
Comparisons with Active Learning. Besides the approaches as mentioned above, some active learning based approaches are compared which involve human-machine interaction during training. Four active learning strategies are chosen as comparisons of which the model is trained through the same framework as the present method, of which an iterative procedure of these active sample selection strategy and CNN parameter updating is executed until the annotation budget is achieved. Here 20% of the entire training samples are selected via the reported active learning approaches, which indicates 388, 2588, 3304 are set as the annotation budget for termination on the CUHK01 [40], Market1501 [77], and Duke [50] dataset, respectively. Beside these active learning methods, we also compare the performance with another active learning approach HVIL [63], which runs experiments under a human-in-the-loop setting. The details of these approaches are described as follows: (1) Random, as a baseline active learning approach, we randomly pick some samples for querying; (2) Query Instance Uncertainty [15] (QIU), QIU strategy selects the samples with the highest uncertainty for querying; (3) Query By Committee [1] (QBC), QBC is a very effective active learning approach which learns an ensemble of hypotheses and queries the instances that cause maximum disagreement among the committee; (4) Graph Density [17] (GD), active learning by GD is an algorithm which constructs graph structure to identify highly connected nodes and determine the most representative data for querying. (5) Human Verification Incremental Learning [17] (HVIL), HVIL is trained with the human-in-the-loop setting which receives soft user feedback (true, false, false but similar) during model training, requiring the annotator to label the top-50 candidates of each query instance.
Table 3, 4 and 6 compare the rank-1, 5, 10 and mAP rate from the active learning models against DRAL, where the baseline model result is from directly employing the pre-trained CNN model. We can observe from these results that (1) all the active learning methods perform better than the random picking strategy, which validates that active sample selection does benefit person Re-ID performance. 2) DRAL outperforms the other active learning methods, with rank-1 matching rate exceeds the second best models QBC, HVIL and GC by 19.85%, 6.32% and 14.18% on the CUHK01 [40], Market1501 [77] and Duke [40] datasets, with a much lower annotation cost. This suggests that DRAL (the present method) is more effective than other active learning methods for person Re-ID by introducing the policy as a sample selection strategy.
Comparisons on Different Sizes of Labelled Data. We further compare the performance of the proposed DRAL approach with a varying amount of labelled data (indicated by Kmax) with fully supervised learning (UpperBound) on the three reported datasets. The rank-1, 5, 10 accuracies, mAP (%) and annotation costs are compared, where the cost is calculated through the times for labelling every two samples. Therefore with the training sample number n, the cost for the fully supervised setting will be n2. With the enlargement of training data size, the cost for annotating all of the data increases exponentially. Among the results, the baseline is obtained by directly employing the pre-trained CNN for testing. For the fully supervised setting, with all the training data annotated, this enables a fine-tuning of the CNN parameters with both the triplet loss and the cross-entropy loss seeking better performance. For the present DRAL method, we present the performance with Kmax setting as 3, 5 and 10 in Table 6. As can be observed, 1) with more data to be annotated, the model becomes stronger at the cost of increasing annotation. With the annotation number for each query increasing from 3 to 10, the rank-1 matching rate improves 14.4%, 9.47% and 19.23% on the Duke [50], Market1501 [77] and CUHK01 [40] benchmarks. 2) Compared to the fully supervised setting, the proposed active learning approach shows only around 3% rank-1 accuracy falling on each dataset. However, the annotation cost of DRAL is far below the supervised one.
Effects from Cumulative Model Optimisation. These results demonstrate that through iteratively increasing the size of labelled data, the model performance may be enhanced gradually. For each input query, we only associate the label to the gallery candidates derived from the DRAL, and adopted these pairwise labelled data for CNN parameter updating. We set the iteration as a fixed number 4 in these experiments on all datasets. With 25% of the overall training data used for active learning, the CNN model is fine-tuned and achieves improved performance.
Experiment 2—Knowledge Ensemble & Distillation
Datasets. We used four multi-class categorisation benchmark datasets in our evaluations (
Performance Metrics. We adopted the common top-n (n=1, 5) classification error rate. To measure the computational cost of model training and test, we used the criterion of floating point operations (FLOPs). For any network trained by our model, we reported the average performance of all branch outputs with standard deviation.
Experiment Setup. We implemented all networks and model training procedures in Pytorch. using NVIDIA Tesla P100 GPU. For all datasets, we adopted the same experimental settings as [34, 68] for making fair comparisons. We used the SGD with Nesterov momentum and set the momentum to 0.9. We deployed a standard learning rate schedule that drops the rate from 0.1 to 0.01 at 50% training halfway (50%) through training, and to 0.001 at 75%. For the training budget, we set 300/40/90 epochs for CIFAR/SVHN/ImageNet, respectively. We adopted a 3-branch model (m=2) design unless stated otherwise. We separated the last block of each backbone net from the parameter sharing (except on ImageNet we separated the last 2 blocks to give more learning capacity to branches) without extra structural optimisation (see ResNet-110 for example in
Performance Evaluation. Results on CIFAR and SVHN. Table 7 compares top-1 error rate performances of four varying-capacity state-of-the-art network models trained by the conventional and our SKED learning algorithms. We have these observations: (1) All different networks benefit from the SKED training algorithm, particularly with small models achieving larger performance gains. This suggests a generic superiority of our method for online knowledge distillation from the online teacher to the target student model. (2) All individual branches have similar performances, indicating that they have made sufficient agreement and exchanged respective knowledge to each other well through the proposed SKED teacher model during training.
Results on ImageNet. Table 8 shows the comparative performances on the 1000-classes ImageNet. It is shown that the proposed SKED learning algorithm again yields more effective training and more generalisable models in comparison to the vanilla SGD. This indicates that our method is generically applicable in large scale image classification settings.
Comparisons with Distillation Methods. We compared our SKED method with two representative alternative distillation methods: Knowledge Distillation (KD) [28] and Deep Mutual Learning (DML) [75]. The teacher model provides a constant uniform target distribution. For the offline competitor KD, we used a large network ResNet-110 as the teacher and a small network ResNet-32 as the student. For the online methods DML and SKED, we evaluated their performances using either ResNet-32 or ResNet-110 as the target student model. We observe from Table 9 that: (1) SKED outperforms both KD (offline) and DML (online) distillation methods in error rate, validating the performance advantages of our method over alternative algorithms when applied to different CNN models. (2) SKED takes the least model training cost and the same test cost as others, therefore giving the most cost-effective solution.
Comparisons with Ensembling Methods. Table 10 compares the performance of our multi-branch (3 branches) based model SKED-E and standard ensembling methods. It is shown that SKED-E yields not only the best test error but also enables most efficient deployment with the lowest test cost. These advantages are achieved at the second lowest training cost. Whilst Snapshot Ensemble takes the least training cost, its generalisation capability is unsatisfied with a drawback of much higher deployment cost.
It is worth noting that SKED (without branch ensemble) already outperforms comprehensively a 2-Net Ensemble in terms of error rate, training and test cost. Comparing a 3-Net Ensemble, SKED approaches the generalisation capability whilst having larger model training and test efficiency advantages.
The present methods and systems for distributed AI deep learning for model optimisation on-site and simultaneous knowledge ensemble and distillation. The present method and mechanisms avoid globally cantered human labelling on large sized training data by performing distributed target application domain specific model optimisation, and demonstrates the present method on the task of person re-identification.
First, we introduced a deep reinforcement active learning approach to human-in-the-loop selective sample feedback confirmation for incremental distributed model optimisation at each user site. Given the lack of a large quantity of pre-labelled training data, the present system and method improves the effectiveness of localised and distributed Re-ID model optimisation by a small number of selective samples and performs deep learning at-the-edge (distributed AI learning on-site). A key task for model design becomes how to select fewer and more informative data samples for model optimisation by user using an existing weak model at-the-edge (user usage per user site). A Deep Reinforcement Active Learning (DRAL) method provides a flexible reinforcement learning policy to select informative samples (ranked list) for a given input query. Those samples are then fed into a human annotator 110 so that the model can receive binary feedback (true or false) as reinforcement learning reward for DRAL model updating. Both this concept and the detailed processes for deep learning at-the-edge by distributed small data with human-in-the-loop reinforcement data mining delivers a performance advantage over current methods, including the previous non-deep learning human-in-the-loop model. An iterative model learning mechanism is implemented for simultaneously looped model optimisation update from both Deep Reinforcement Active Learning and Convolutional Neural Network training to achieve deep learning at-the-edge data mining for distributed Re-ID optimisation at each user site. Extensive performance evaluations were conducted on both large-scale and small-scale Re-ID benchmarks to demonstrate these improvements. The present system and method (DRAL) shows clear Re-ID performance advantages against current systems, including supervised learning, unsupervised/transfer learning, and human-in-the-loop relevance feedback learning based Re-ID methods.
Second, we further developed a multi-branch strong teacher ensemble model for simultaneous knowledge ensemble (from multiple model representations) and distillation (to target models). This approach can learn discriminatively both small and large deep network models with less computational cost, beyond the conventional offline methods for learning small models alone. The present method is also superior over existing online learning methods due to a very strong teacher ensemble model from multi-branch/multi-model simultaneously. Extensive performance evaluations on four image classification benchmarks show that a wide range of deep neural networks can at least benefit from the present multi-branch model ensemble and knowledge distillation mechanism. Significantly, smaller target models obtain performance gains, making the present method especially good for disseminating shared knowledge to distribute resource-limited and/or training data constrained target application domains.
[25] Y. Guo and N.-M. Cheung. Efficient and deep person re-identification using multi-level similarity. In CVPR, 2018.
[42] Y. Li, F. Yang, Y. Liu, Y. Yeh, X. Du, and Y. F. Wang. Adaptation and reidentification network: An unsupervised deep transfer learning approach to person re-identification. In CVPR, pages 172-178, 2018.
[44] J. Lv, W. Chen, Q. Li, and C. Yang. Unsupervised cross-dataset person reidentification by transfer learning of spatial-temporal patterns. In CVPR, 2018.
[72] L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null space for person re-identification. In CVPR, 2016.
As will be appreciated by the skilled person, details of the above embodiment may be varied without departing from the scope of the present invention, as defined by the appended claims.
For example, different data types may be used. Different reward functions may be used.
Many combinations, modifications, or alterations to the features of the above embodiments will be readily apparent to the skilled person and are intended to form part of the invention. Any of the features described specifically relating to one embodiment or example may be used in any other embodiment by making the appropriate changes.
Number | Date | Country | Kind |
---|---|---|---|
1908574.5 | Jun 2019 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2020/051420 | 6/12/2020 | WO |