This application is based on and claims priority to Great Britain Patent Application No. GB2116074.2, filed on Nov. 9, 2021, and Great Britain Patent Application No. GB2205999.2, filed on Apr. 25, 2022 in the Great Britain Intellectual Property, the disclosures of which are incorporated by reference herein in their entireties.
The present application generally relates to a method for training a machine learning, ML, model using class incremental learning, and to a computer-implemented method and apparatus for using the trained machine learning, ML, model.
Continual learning (also called lifelong learning) refers to the ability to continuously learn and adapt to new environments, exploiting knowledge gained from the past when solving novel tasks. Though being a common human trait, the lifelong learning paradigm hardly applies to artificial intelligence systems. Nevertheless, a classification model can be progressively trained on a constantly changing set of new classes, and this may be termed Class-Incremental Learning (CIL). Continual learning has been extensively studied in a class incremental fashion [7, 22, 23]. CIL may be used in several technical fields, for example image classification.
When training a model such as the one shown in
In the next time step t, several new classified images (i.e. several images belonging to new classes) are introduced and are indicated by squares and circles respectively. For example, the squares may be images of zebras and the circles images of horses. As indicated, some of the horse and zebra images are within the decision boundary 110 of the feature space for dogs and potentially would be mis-classified as dogs. The original classified images are shown as ghost triangles because they are not included when training the model at time step t. In other words, for each incremental training task and step t, the training set is composed of images belonging to the current class sets (e.g. horses and zebras) Ct, whereas past semantic categories (e.g. dogs) Cold ≙{Ct′}t′=1t−1 lack any training sample. The final objective for the model is to maximize the generalization (classification) accuracy on all the categories observed up to the end of the stream.
As shown in the final section of
The catastrophic forgetting phenomenon has also been described as the stability-plasticity dilemma, since there exists a trade-off between the search for preservation of old-task knowledge (i.e., stability) and the necessity to accommodate for the information obtained from the experience of new tasks (i.e., plasticity). To preserve past knowledge, many state-of-the-art methods store and rehearse training exemplars of previously seen classes, which might be source of privacy or storage issues. Alternatively, some recent approaches resort to prototypical feature representations to inject information of former classes in later incremental steps. However, if not updated, those representations progressively become staler and more outdated as the incremental learning progresses and new classes are learned.
Most of the successful known CIL methods use exemplars of old classes Cold to rehearse past knowledge [1, 4, 5, 8, 9, 12, 18, 20, 25, 27, 32, 36]. However, storing samples belonging to all classes might be impractical due to limited resource availability or privacy requirements. To address CIL without storing exemplars, regularisation methods have been proposed [2, 6, 16, 39, 41]; the common goal is to identify key model parameters to solve old tasks, and prevent their change when learning a new task. Alternatively, knowledge distillation has been proposed [11, 19], where representations of new classes are forced to only slightly deviate from their original version computed at the beginning of the incremental step for learning the current task. Yet, those methods usually underperform when compared to the state-of-the-art (SotA) solutions.
In a more recent work [43], class prototypes were used to inject past knowledge. Although showing promising results, this method fails to capture the representation drift that is present while incrementally training the model and that is illustrated in {Ft′}t′=1t and the old classes by Cold. The old classes may have been computed when the corresponding data is available and then kept fixed for the rest of the training.
A different work [40] proposes to estimate the change of prototypes (e.g. templates of average dogs) of old classes (e.g. dogs) when learning new classes (e.g. horses and zebra). The old templates may be represented by Πold{t′}t′=1t and the old classes by Cold and the new classes by Ct. However, this method is limited in scope to embedding learning and devises a deterministic non-learnable module to estimate prototype shift.
There are also examples from the patent literature. For example, US2020151619 describes a system and method for accounting for the impact of concept drift in selecting machine learning training methods to address the identified impact. US2021224696 describes a computer system for detecting a concept drift of a machine learning model in a production environment for adaptive optimization of the concept drift detection. US2021157704 describes a system which can monitor applications and analyze the metrics to determine if one or more of the applications are regressing or performing as expected. US2021073627 describes a method for detecting degradation of the machine learning model based on key performance indicators.
The present applicant has recognised the need for an improved system and method that addresses these drawbacks.
In a first approach of the present techniques, there is provided a computer-implemented method for training a classifier using class incremental learning, CIL, wherein the classifier is a machine learning, ML, model comprising a feature extraction model and a classification model and wherein at each incremental time step in the CIL a new training dataset having a set of samples with class labels is obtained and an optimisation over a plurality of stages is applied. The training method may comprise for each incremental step: extracting, using the feature extraction model, a first set of features representing the samples within the new training set at the beginning of the incremental step and a second set of features representing the samples within the training set at a subsequent stage within the optimisation; learning a feature drift model using the extracted first and second sets of features, wherein the feature drift model estimates the drift between the features representing each sample as the classification model is updated; obtaining a first set of old class prototypes for the beginning of the incremental step, wherein each class prototype is a semantic representation of the sets of features of the samples which have the same class label and wherein each old class prototype represents old samples which are samples that are not present in the new training set; learning a semantic drift model using the set of old class prototypes and first set of features, wherein the semantic drift model estimates the drift between the class prototypes for each class as the classification model is updated; inferring at least one of a second set of old class prototypes and a set of old features using the feature drift model and the semantic drift model, wherein the second set of prototypes are semantic representations of the old samples at a subsequent stage within the optimisation and the set of old features are the feature representations of revived old samples at the subsequent stage within the optimisation; and using at least one of the inferred second set of old class prototypes and set of old features to update the ML model. When the final incremental step is completed, the method may further comprise outputting the trained classification model.
In other words, there is provided a computer-implemented method for training a machine learning, ML, model. More specifically, there may be provided a computer-implemented method for performing class incremental learning of at least one or both of feature and semantic representation using a machine learning, ML, model. More specifically, there may be provided a computer-implemented method for performing class incremental learning using a machine learning, ML, model, the method comprising: inferring a set of prototypes of old classes using at least one of a feature drift model which is used to represent feature drift and a semantic drift model which is used to represent semantic drift; and using the inferred set of prototypes to update the ML model.
The method may learn how to update semantic representations (also termed class prototypes or templates) of old concepts (classes) by modelling drift of semantic representations. The method may include modelling the relationship with the new classes currently being learned for which data samples are available. The method may also learn how to update feature representations of old concepts (classes) by modelling drift of feature representations. The method may include modelling the trajectory of samples in the feature space spanned by parameters of the task network. These two models may be used separately or in combination to re-estimate continually evolving feature distributions of old concepts.
By implementing the modelling described above, this avoids the need to store samples of old classes to access evolving data distributions of old classes. The modelling also avoids using fixed representations of old classes which typically become staler at each new incremental phase. In other words, the techniques described provide a ML framework capable of modelling inter-class relationships and estimating the evolution of semantic and feature representations. The techniques described also provide a framework to achieve high performance on class-incremental learning, without requiring data samples to be stored.
When compared to prior art techniques, e.g. US2020151619, it is noted that our proposed method focuses on representation drift instead of concept drift, that is, we consider change of learned representations of ML models instead of only change of data. Our method may thus be considered as aiming to model change of representation drifts instead of determining their impact on model performance. There is an aim of improving the new representations together with old representations by updating a model, instead of selecting a model from multiple models. For example, considering US2021224696 as a contrast, our proposed method may be described as using a single ML model which is continually updated using novel data and is never retrained on past data. Thus, there is no need to store and/or access past data. For example, considering US2021157704, our method by contrast aims to directly address a continual learning problem where the training data is continually changing and data from the past cannot be accessed. We assume that the training data is accessible for a limited period of time and cannot be stored and reused. When compared to US2021073627, our method aims at directly addressing the performance degradation of an ML model caused by catastrophic forgetting by modelling semantic and/or feature representation drifts. Only new data which is currently available is leveraged.
Each new training dataset may be represented by t and thus the complete dataset for the CIL may be represented as {
t}t=0T=(
t,
t), t=0, . . . , T, where
t={xt,j}j=1N
t={yt,j∈
t}j=1N
t is the set of class labels observed at this step and Nt is the number of samples per step. Each sample may be an image, text data or audio data. When images are used as data and the ML task is image classification, the ML model may be an image classifier.
The feature drift model may be represented by Γγt observed at each incremental step. The second semantic drift model may be represented by Ψψ
old to the set of class labels
t observed at each incremental step.
The first set of features representing each sample within the new training set at the beginning of the incremental step may be represented by t0. The second set of features representing each sample within the training set at a subsequent stage within the optimisation (i.e. the nth optimisation stage) may be represented by
tn. The first set of old class prototypes for the beginning of the incremental step may be represented by Πoldt,0. The second set of old class prototypes for the subsequent stage within the optimisation may be represented by Πoldt,n. From the old class prototypes at the subsequent stage, we can estimate the set of features for the old samples
oldn which are not part of the training dataset
t. These old samples which are estimated may be termed revived old samples or revived evanescent representations because the representations were no longer appearing in the dataset.
The extracting, learning, obtaining, inferring and using steps may be repeated for each optimisation stage within each incremental step. When learning the feature drift and semantic drift models, the feature extraction and classification models are kept fixed. These learning steps may be termed a learning phase. When using the inferred information to update the ML model (i.e. to update the feature extraction and classification models), the feature drift and semantic drift models are kept fixed. This updating step and preceding inferring step may be termed an inference phase. At each learning and using step, the models may be trained until convergence where the convergence criterion for a model is early stopping the optimisation of model parameters if the training loss does not change for a fixed/set number of steps. There may be an initialisation stage for the first optimisation stage n=0. In the initialisation stage, both the feature extraction model fθ0. The set of class prototype(s) Πnew0 can be computed and then we initialise the first set of old class prototypes Πoldt=1,0 as Πnew0.
Inferring a set of prototypes of old classes using the feature drift model may comprise training the feature drift model using the available set of features; and using the trained feature drift model to estimate the feature drift on past data whereby the set of prototypes of old classes is inferred. Training may be done using any suitable technique and depends on the architecture of the machine learning model.
The feature drift model may use a linear deep neural network (DNN), e.g. a multilayer perceptron (MLP). The feature drift model may be learned by using a feature drift loss function which minimises the error between the extracted second set of features and a third set of features which are the estimate of these features from the first set of features using the feature drift model. Merely as an example, the loss function Lft may be expressed using the mean squared error function:
L
f
t(Ft0,Ftn;γtn)=Lmse,ft≙∥Γγ
where Ft0 is the extracted first set of features, Ftn is the extracted second set of features and Γγ
The feature drift model may use a variational model such as a variational auto-encoder (VAE). Training the feature drift model may comprise updating the weights γ of the model to maximize the likelihood p(F∈tn|F∈
t0; γ) where
t0 denotes the set of features extracted at the optimization step 0 (i.e. the extracted first set of features),
tn denotes the set of features extracted using the feature extractor fΘ
tn|F∈
t0; γtn) we can estimate drifts more accurately at each incremental step and the optimisation stage. In this arrangement, the loss function Lft for feature drift may be expressed as:
L
f
t(Fto,Ftn;γtn)=βLrec,ft+(1−α)Lkl,ft+(α−λinfo−1)Linfo,ft
where Lrec,ft is the reconstruction loss, Lkl,ft is the KL divergence loss and Linfo,ft is the loss of the InfoVAE and α, β and λinfo are constants.
Inferring a set of prototypes of old classes using the semantic drift model may comprise training the semantic drift model using the available set of features; and using the trained semantic drift model to estimate the semantic drift on past data whereby the set of prototypes of old classes is inferred. Training may be done using any suitable technique and depends on the architecture of the machine learning model.
The semantic drift model may use a linear deep neural network (DNN), e.g. a multilayer perceptron (MLP). The semantic drift model may be learned by using a semantic drift loss function which minimises the error between the first set of old class prototypes and a third set of old class prototypes which are the estimate of these old class prototypes from the extracted first set of features using the semantic drift model. Merely as an example, the loss function Lst for semantic drift may be expressed as:
L
s
t(Fto,Πoldt,o;ψtn)=Lmse,st∥Ψψt
where Fto is the extracted first set of features, Πoldt,o is the obtained first set of old class prototypes and Ψψt
The semantic drift model may use a variational model such as a variational auto-encoder (VAE). Training the semantic drift model may comprise updating the weights ψ of the model to maximize the likelihood p(π∈Πoldt,n=0|F∈t0; ψ) where
t0 denotes the set of features extracted at the optimization step 0 (i.e. the extracted first set of features), Πoldt,0 is the set of prototypes of old classes at the optimization step 0 (i.e. the first set of old class prototypes), F is the up-to-date feature representation and π is the up-to-date semantic representation. Using the expression, p(π∈Πoldt,n=0|F∈
t0; ψtn) we can more accurately capture the drift at incremental steps and optimisation stages. In this arrangement, the loss function Lst for semantic drift may be expressed as:
L
s(Fto,Πoldt,o;ψtn)=Lrec,s+λkldLkld,s
where Lrec,st is the reconstruction loss, Lkl,st is the KL divergence loss and λkld,s is a constant set to 1.
The feature drift and semantic drift model may be jointly trained, for example by fusing them to estimate the distribution of revived evanescent representations p(F∈Foldn) using either architecture. For this purpose, we optimise model parameters by minimising a fusion loss which measures discrepancy between the estimated distributions from either model p(F∈Foldn; γtn) and p(F∈Foldn; ψtn) employing a suitable training objective also termed a fusion loss may be expressed as
L
fus
t=∥Πold,st,n−Πold,ft,n∥22+λcorr∥ρ(Πold,st,n)−ρΠold,ft,n)∥22
where the subscript s and f denotes the updated prototypes of old classes estimated by the semantic and feature drift models, respectively, ∥·∥22 is the squared 2 norm, λcorr>0 is the regularisation parameter and ρ(Π) is the normalised correlation matrix of Π[3].
Inferring at least one of a second set of old class prototypes and a set of old features using the feature drift model and the semantic drift model may comprise using the feature drift model and the semantic drift model jointly or individually. The feature drift model Γγ
As set out above, the feature drift model may use a linear deep neural network (DNN). In this example, the feature drift model enables directly tracking of the trajectory of the old class prototypes Πoldt,n as the optimisation proceeds. Thus, the second set of old class prototypes can be mapped from the first set of old class prototypes Πoldt,0 at the start of the incremental step. The distribution of the features of the revived evanescent representations may then be approximated by a Gaussian distribution from the second set of old class prototypes.
As set out above, the feature drift model may use a variational model and inferring the second set of old class prototypes may comprise using the trained feature drift model to compute the posterior probability p(π∈Πoldt,n>0|π∈Πoldt,n=0; γ) where Πoldt,0 is the first set of prototypes of old classes, Πoldt,n is the second set of prototypes of old classes at the step t, and γ are the weights of the feature drift model. In other words, the trained feature drift model may be used to approximate the probability that the set of features is a member of the set of features for the revived evanescent representations at the nth optimisation stage given the set of features is a member of the set of features for the revived evanescent representations at the start of the time step. p(F∈oldn|F∈
old0). This distribution of the features of the revived evanescent representations may also be approximated by a Gaussian distribution from the second set of old class prototypes.
The semantic drift model Ψψt
As set out above, the semantic drift model may use a linear deep neural network (DNN). In this example, the semantic drift model enables directly tracking of the trajectory of the old class prototypes Πoldt,n as the optimisation proceeds. Thus, the second set of old class prototypes can be mapped from the extracted second set of features tn. The distribution of the features of the revived evanescent representations may then be approximated by a Gaussian distribution from the second set of old class prototypes.
As set out above, the semantic drift model may use a variational model and inferring the second set of old class prototypes may comprise using the trained feature drift model to compute the posterior probability p(π∈Πoldt,n>0|F∈tn; ψ), where Πoldt,n is the set of prototypes updated at the nth epoch at the step t,
tn denotes the set of features extracted using the feature extractor fΘ
oldn|F∈
tn). This distribution of the features of the revived evanescent representations may also be approximated by a Gaussian distribution from the second set of old class prototypes.
In other words, a linear model may be used for at least one of the feature drift model and the semantic drift model. Similarly, a non-linear model may be used for at least one of the feature drift model and the semantic drift model.
Using the inferred second set of old class prototypes to update the ML model may comprise training the feature extraction model and the classification model based on at least one of the inferred second set of old class prototypes and the inferred set of old features. The inferred second set of old class prototypes and the inferred set of old features may be used to generate the revived evanescent representation, i.e. the revived old samples. The training dataset for the updating step may include the data in the new dataset so that the feature extraction model and the classification model are trained on the new dataset with the revived evanescent representations. The training may be done by any suitable means but when training the feature extraction model and the classification model, the feature drift and semantic drift models are kept fixed.
The feature extraction model may be trained using a distillation loss which approximates the difference between the overall classification objective Lcct and the loss Lpct of the previous model fθt. The distillation loss may be expressed as the
2 distance between representations (i.e. set of features) extracted from
t using the current feature extraction model fθ
The classification model may be trained using a cross-entropy loss. The classification model may also be trained using a representation drift loss, for example a representation drift loss function Lrdt may be calculated using
where yF is the one-hot label vector of the class cF∈old and
oldn,t is the set of revived evanescent representations sampled from the estimated distribution using the updated prototypes of
old.
The feature extraction and classification models may be trained together. For example, the overall classification objective Lcct computed at each step t may be calculated from:
L
cc
t(Dt,Foldn;θt,ϕt)=Lcet+λrdLrdt+λfkdLfkdt,
where Lcet is the cross-entropy loss, Lrdt and Lfkdt are the representation drift and distillation loss are used only for t>0 with loss balancing parameters λrd>0 and λfkd>0.
In another approach of the present techniques, there is provided a method of classifying a sample, the method comprising inputting a sample, e.g. an image; extracting a plurality of features from the sample using a feature extraction model of a classifier which has been trained according to the method described above; classifying the sample based on the extracted plurality of features using a classification model of a classifier which has been trained according to the method described above; and outputting the result of the classifying step to give a classification of the sample.
In another approach of the present techniques, there is provided an apparatus for performing or modelling drift of learned representations for class-incremental learning using a machine learning, ML, model. The apparatus may further comprise at least one image capture device for capturing images or videos to be processed or classified by the trained ML model. The apparatus may further comprise at least one interface for providing a result of the processing by the ML model to a user of the apparatus.
The apparatus may be any one of: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, a drone, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, or a smart consumer device (such as a smart fridge). It will be understood that this is a non-exhaustive and non-limiting list of example apparatus.
In a related approach of the present techniques, there is provided a non-transitory data carrier carrying processor control code to implement the methods described herein.
As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.
Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.
The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD- or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog® or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.
It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.
In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.
The methods described above may be wholly or partly performed on an apparatus, i.e. an electronic device, using a machine learning or artificial intelligence model. The model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training. Here, “obtained by training” means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.
As mentioned above, the present techniques may be implemented using an AI model. A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/o may be implemented through a separate server/system.
The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), transformers and deep Q-networks.
The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
The application file contains at least one drawing executed in color. Copies of this patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which:
As background, it is useful to understand the terms concept drift and representation drift.
Concept drift refers to the change of the input-output relationship (statistics) undergone by data seen by a ML framework. When the ML model is trained on a dataset, it might suffer from performance degradation if applied to data with distribution evolving over time. Concept drift does not necessarily involve a change in model parameters but focuses on how data distribution has drifted from each training phase of the ML model.
Representation drift indicates how feature representations learned by a ML model are constantly evolving throughout its training procedure. When optimizing a ML model with training samples, model parameters are updated to provide feature representations that better help to solve the target task on the available samples. In a continual learning framework, the constantly changing feature representations used to better model current training data distribution leads to catastrophic forgetting as illustrated in
As explained in more detail below, the proposed new framework may be considered to learn the relationship and drift between old and new classes in order to re-estimate the semantic representations (e.g. prototypes) of old classes using the continually evolving feature representations. We show that our learnable framework modeling representation drift provides models with higher capacity, and leads to overall improved classification performance. For example, experiments on multiple benchmarks show that the proposed approach achieves SotA results for the CIL problem. The proposed new framework uses CIL and more background on the learning representations for CIL is set out below.
At each step t∈[T]={0, 1, . . . , T} of CIL, we are given a dataset t=(
t,
t), t=0, . . . , T, where
t={xt,j}j=1N
t={yt,j∈
t}j=1N
t is the set of class labels observed at this step, and
t∩
t′=Ø, ∀t≠t′.
As in other popularly employed CIL models [43], the proposed framework is composed of a feature extraction model fθ∈ and a classification model hϕ∈
with parameters θ∈Θ and ϕ∈Φ. An example of such a CIL model is shown in
where LcctLμ
t; θt, ϕt) is the expected loss of the model gt
hϕ
t sampled from a distribution μt at step t. The slack variable ϵt′≥Lcp,t′−Lpp,t′ measures the difference between the loss of the current model on the previous datasets {
t′}t′=0t−1 denoted by Lcp,t′
l(gt(
t′))), and the loss of the previous models {gt′}t′=0t−1 on the previous datasets {
t′}t′=0t−1 denoted by Lpp,t′
l(gt′(t′)). The variable ϵt, controls forgetting of representations of old classes.
Generative classifiers, such as Bayesian networks or supervised variational autoencoders (VAEs), implementing hϕ∈ optimize equation (1) to model joint distribution of classes and features p(C, F;
) where
=Θ∪Φ is the set of parameters of the model. Discriminative classifiers, such as a softmax classifier, optimize equation (1) to model p(C|F;
).
In CIL, we do not have access to samples {t′}t′=0t−1 at time t. CIL methods aim to model p(C, F|
t) without using {
t′}t′=0t−1, where
In order to elucidate the dynamics of models used for CIL in this setting, we factorize the class posterior probability p(C∈|F∈
) by
where PC, PA: =P1+P2+P3+P4, and PB: =P12+P13+P14+P24+P34 are expressed in Table 1 shown in t is dropped from the statements when dependency on
t is trivial.
Deep learning models have been employed to estimate and model these probabilities implicitly at different parts of the models by learning feature representations of old and new classes, and make predictions for the new classes as follows:
p(C∈|F∈
)∝PA−PB. (3)
As explained in more detail below, to address this problem and bring the evanescent representations to life, we employ class prototypes (templates or prototypical representations, the terms can be used interchangeably) π∈Πold. These class prototypes can be obtained using any standard technique, for example as taught in [43] and may be regarded as semantic representations of Cold. We leverage prototypes at the beginning of an incremental step to model their distribution p(F∈old|πc∈Πold) where πc which is the class prototype for the class c. We estimate the distribution as described in more detail below, then we update p(F∈
old) throughout the incremental step by modeling the representation drift to revive evanescent representations.
t=(
t,
t)}t=0T where
t={xt,j}j=1N
t={yt,j∈
t}j=1N
t is the set of class labels observed at this step, Nt is the number of samples per step and
t∩
t′=Ø, ∀t≠t′. This first step is shown as S200 in
and a classification model hϕ∈
.
As shown in the algorithm and also in step S202, the next step is to train both the feature extraction model fθ0. Once the model has been trained for time step 0, the class prototype(s) Πnew0 can be computed and then we initialise Πoldt=1,0 as Πnew0. The loss function and the computation of the class prototypes are described in more detail below. Steps S202 and S204 may thus be considered an initialisation phase.
The following processes are then repeated for each step t>0. As shown at step S206, the set of features at the beginning of the step t0 and at the nth optimisation stage of the step
tn are extracted or computed using the trained feature extractor model and the new class data
t for that time step. As shown in steps S208 and S210, the relationship between old and new semantic and feature representations are modeled using two models. A feature drift model Γγ is parameterized by γ, and is used to represent feature drift. A semantic drift model Ψψ is parameterized by ψ and is used to represent semantic drift. The parameters γ and ψ may be optimized in n epochs (or stages) using an optimization algorithm, while training models at each incremental step to estimate the relationship among representations. This first phase may be termed a learning phase, and the models Γγt
Once the models Γγ and Ψψ have optimized parameters, they are used to estimate the revived evanescent representations (RERs) as shown at step S212. The method of estimating the revived evanescent representations depends on the architecture as explained in more detail below. For example, the revived evanescent representations may be estimated by first estimating oldn which are the set of features for the revived evanescent representations at the nth optimisation stage. As explained in more detail below,
oldn may be estimated by using the semantic drift model and feature drift model individually or fused. Then, the old class prototypes Πoldt,n may be computed by class-wise averaging features sampled from p(F∈
oldn). This phase may be termed the inference phase.
The next step S214 is then to train the feature extraction model fθt=
t∪
oldn. As for the semantic and feature drift models, the parameters of the feature extraction and classification models may be optimized in n epochs (or stages) using an optimization algorithm. The training may be done using any suitable technique as described in more detail below.
As shown in steps S208, S210 and S214, there is alternating training of the semantic drift and feature drift models, and the feature extraction and classification models for each optimisation stage. As shown in step S216, there is an inner loop within the algorithm to include an optional optimisation at each incremental time step t. The optimisation may be over N stages (also termed epochs). If n<N, there is an iteration through steps S206 to S214 to train the semantic drift and feature drift models Γγt
In other words, at the end of each optimization stage n, the feature extraction model fθ
Once there are no more stages, i.e. for n=N, as shown at step S216, there is a final learning phase for the incremental step in which the semantic drift and feature drift models Γγtt} by class-wise average of feature representations fθ
t) of input samples and re-initialise the complete set of class prototypes for the next time step as Πoldt+1,0=Πoldt,n∪Πnewt, where Πoldt,n=Ø for t=0. In other words, there is another inference phase to obtain the set of class prototypes which contains the newly obtained class prototypes and the previously obtained set of old class prototypes. As shown at step S224, if there are more time steps there is an iteration through steps S206 to S220 to train the semantic drift and feature drift models Ψψt
As shown in
In other words, the feature learning model may be represented by fnew,ti which is an evolving feature representation on a sample Xnewi from a new class. Although the example given above is in the field of image classification, the sample may be an image, text data or audio data. As shown in
{Fnew,0i}i↔{Fnew,ti}
An updated prototype (also termed semantic representation) for the revived evanescent representations which is learnt from this modelling may be expressed as:
{πold,tj}jtr j
The semantic drift is learnt from the available prototypes as denoted by:
{Fnew,0i}i↔{πold,0j}j
The updated prototype for the revived evanescent representations which is learnt from this modelling may be expressed as:
{πold,tj}jcls
The updated prototype for the revived evanescent representations which is learnt from both types of modelling may converge to a unique representation which is expressed as:
{πold,0j}j
The following subsections present the method proposed for modelling drifts. To identify and train the feature drift and semantic drift models Γγ and Ψψ, two different core architectures are proposed. A first core architecture which may be used is a Gaussian model (GM) in which the class conditional feature distribution is modelled as a parametric Gaussian curve. The second core architecture which may be used is a variational model (VM) which comprises a conditional variational encoder (VAE) having an encoder and a decoder separated by a latent space. Such GM and VM models are parametrized by deep neural networks (DNNs). Although Gaussian processes can be used for VMs [14], we consider GMs and VMs individually to explicate the variational structure of VMs.
t0→
tn. As shown in
t (i.e., the available training set at step t). The architecture may be summarised in the table below.
In a second variation shown in tn), conditioned on the representations of the same samples at the beginning of the current incremental step (
t0). The encoder and decoders are composed of two FC layers each. We perform conditioning in input and latent spaces by concatenation along the channel dimension. The input, output and conditioning variables of Γγt
t and C to the cardinality of Coldt. The architecture may be summarised in the table below.
t as representations revive and evolve throughout the step t>0. At stage n>0, we extract the set of features at the beginning of the time step
t0 and the set of features extracted after n>0 optimisation stages to train the model Γγt
t0 and
tn. Both
t0 and
tn are thus part of the available information on the new class features. The model weights (also known as parameters γ) for the first model Γγ are updated in this first phase.
When using a GM as shown in t0→
tn by the MLP. Then, Γγt
When using a VM as shown in tn|F∈
t0; γtn). Thereby, we can statistically model the FD across different stages [n] at a given step t.
t. In this second phase, we exploit the trained model to infer the feature drift undergone by features of
old and distribution p(F∈
oldn) of revived evanescent representations.
As shown in oldn) of revived evanescent representations at stage n by a Gaussian distribution which is schematically illustrated in
oldn: πc∈Πold,ft,n)˜N(πc,σc), where N is the normal distribution and σc is the standard deviation estimated at step t′ when c∈Ct′. As illustrated in
As shown in oldn|F∈
old0). At stage n=0, we resort to p(F∈
old0) because no feature drift has to be estimated, and we model the distribution of revived evanescent representations by p(F∈
old0))∝p(F∈
oldn: πc├Πoldt,0)˜N(πc,σc), that is, the distribution p(F∈
old0)) is approximated by the empirical model p(F∈
oldn: πc∈Πoldt,0) identified by prototypes πc∈Πoldt,0 which is estimated by a Gaussian model N(πc,σc) with mean set to the prototype πc and with the standard deviation σc. At n>0, the training features are sampled from the posterior probability p(F∈
oldn|F∈
old0; γtn)·p(F∈
old0). The standard deviation is kept fixed at every step t″>t′. The input of the model is the prototype set Πoldt,0 and noise z˜N(0, I) is drawn from the Normal distribution with zero mean and unit variance. The output of the model is the set of features for the revived evanescent representations, i.e.
oldn.
In other words, in the second phase the prototypes of the old classes are modelled using the model learnt in the first phase. The model Γγ is thus implemented in the second phrase which uses a probabilistic framework to estimate feature drift on the unavailable past data. The model weights for the first model Γγ are kept fixed in phase 2. In this example, in the first phase, the inputs of the model are the feature set t0 and the feature set
tn and both of these inputs are available. The output of the model is the set of reconstructed features for the nth stage
tn. In the second phase, the input of the model is the prototype set Πoldt,n=0 which is available data and the output of the model is the feature set for the revived evanescent representations and/or the old prototypes. Both of these outputs are inferred by the model.
n0→Πoldt,0. As shown in
oldt feature vectors of dimension D both arranged in a D×B and a D×
oldt matrix, respectively.
oldt denotes the number of past classes present at the current incremental step t>0 and C is set equal to the cardinality of
oldt. D is the number of feature channels at the output of the feature extractor (which is set to 512). In addition, the number of output channels of the first FC layer is set to 2*B and B is set equal to the cardinality of
t (i.e., the available training set at step t). The architecture may be summarised in the table below.
In a second variation shown in old0) whose distribution is approximated as p(
old0; Πoldt,0)˜N(πc,σc). The function is conditioned on the new classes
t0 which can be extracted from the dataset. The encoder and decoder are composed of two FC layers each. We perform conditioning in input and latent spaces by concatenation along the channel dimension. The input, output and conditioning variables of Ψψt
t and C to the cardinality of Coldt. The architecture may be summarised in the table below.
t0 and train a network or semantic drift model Ψψo
t0 and the set of features of the evanescent representations at the beginning of the time step
old0. We employ the prototypes π∈Πoldt,0 to model p(F∈
old0))∝p(F∈
oldn: πc∈Πoldt,0)˜N(πc,σc). The old class prototypes Πoldt,0 at stage 0 are known. As opposed to the feature drift model, the semantic drift model captures the semantic drift observed at each new step and thus an individual model Ψψ is optimised at the start, and fixed for the rest of the step (i.e. ψ0n=ψ0t). As explained in more detail below, when model fusion is adopted, instead the semantic drift model Ψψ is re-trained once per stage, to account for the feature drift estimated by the feature drift model.
When using a GM as shown in t0 to the old prototypes at the beginning of the incremental step. In other words, we identify Ψψt
t0→Πoldt,0. Then the semantic drift model is trained to model the semantic drift between representations for classes available at the current step
t and those experienced in the past
old. The model is learned using any suitable technique as explained in more detail below.
In the non-linear implementation shown in oldn) of evanescent representations revived at stage n=0 by p(F∈
oldn: πc∈Πoldt,0)˜N(πc,σc). Then a conditional VM (e.g. a VAE) is trained by adjusting the weights to maximise the probability that a set of features are within the features of evanescent representations revived at stage n>0 given that the set of features are within the features of the new class features at stage 0, i.e. maximising the likelihood p(F∈
oldn|F∈
t0; γ). The training can be done using any suitable technique.
old to the representations for the current classes
t. The drift is captured at the beginning of the current step t, when up-to-date representations of both sets are available. We now exploit the trained semantic drift model Ψψt
oldn).
tn→Πoldt,n is trained to estimate the relationship between feature and prototypical representations at stage n. In a similar manner to that described for
oldn) of revived evanescent representations at stage n by a Gaussian distribution p(F∈
oldn: πc∈Πold,st,0)˜N(πc,σc) where N is the normal distribution and σc is the standard deviation estimated at step t′ when c∈Ct′.
As shown in oldn|F∈
tn)∀n≥0. To generate training feature samples, we perform inference using p(F∈
oldn|F∈
old0; ψtn)·p(F∈
tn), where the feature set
tn is provided by the feature extraction model fθ
t.
Thus, as shown in t0 and the output of the model is the prototype set Πoldt,n=0. In the first phase, the model weights for the second model Ψψ⋅ are updated using this input and output. In other words, the semantic drift is learnt. In the second phase, the model Ψψ is implemented. The input of the model is the feature set
tn and the output of the model is the prototype set Πoldt,n>0 which is inferred using the model. The model weights for the first model Γγ are fixed during the second phase.
Thus, as shown in tn=0 and the prototype set Πoldt,n=0. The output of the model is the set of reconstructed prototypes
tn and noise z˜N(0, I) is drawn from the Normal distribution with zero mean and unit variance. The output of the model is the old class features
oldn which are inferred by the second model. The model weights for the second model Ψψ are fixed during the second phase. For this semantic branch we use a conditional variational auto-encoder but it is noted that in
old0 are not known but the old class prototypes at stage 0 are known and denoted by Πoldt,0.
old0 at beginning of time step t. These are used to train the semantic drift model as described below.
A new class dataset t is obtained or input at step S502. In step S504, the trained feature extraction model fθ
t0. Similarly, in step S506, the feature representations at the nth optimisation stage of the time step, i.e.
tn, are extracted using the feature extraction model fθ
As shown in step S508, the feature drift model Γγttn. The loss function Lft for feature drift may be expressed as:
L
f
t(Fto,Ftn;γtn)=Lmse,ft∥Γγt
Similarly, the training of the VM in
L
f
t(Fto,Ftn;γtn)=βLrec,ft+(1−α)Lkl,ft+(α−λinfo−1)Linfo,ft
where Lrec,ft is the reconstruction loss, Lkl,ft is the KL divergence loss and Linfo,ft is the loss of the InfoVAE, α, β and λinfo are constants, with β set to le1 in all experiments, λinfo set to le1 and α set to −le1 for the CIFAR100 dataset, λinfo set to le2 and α set to −le2 for the test set TinyImageNet and CUB200 datasets.
As shown in step S510, the semantic drift model Ψψt
L
s
t(Fto,Πoldt,o;ψtn)=Lmse,st=∥Ψψt
In the non-linear implementation shown in
L
s(Fto,Πoldt,o;ψtn)==Lrec,s+λkldLkld,s
where Lrec,st is the reconstruction loss, Lkl,st is the KL divergence loss and λkld,s is a constant set to 1 in all experiments, i.e. for the CIFAR100, TinyImageNet and CUB200 datasets.
At step S512, there is an optional step of fusing the feature and semantic drift models. The fusion may be done by joint training or as explained in relation to the inference step by fusing the output of separately trained models. In the example of Lf(Ft0, Ftn; γtn) and Lst
Ls(Ft0, Πoldt,0; ψtn) may be used to denote the objectives used to individually train feature and semantic drift models Γγt
The outputs of Γγt
L
fus
t=∥Πold,st,n−Πold,ft,n∥22+λcorr∥ρ(Πold,st,n)−ρ(Πold,ft,n)∥22
where the subscript s and f denotes the updated prototypes of old classes estimated by the semantic and feature drift models, respectively, ∥·∥22 is the squared 2 norm, λcorr>0 is the regularisation parameter and ρ(Π) is the normalised correlation matrix of Π [3]. Finally, the renovated distributions p(F∈Foldn; γtn) and p(F∈Foldn; ψtn) are linearly combined with equal weights to obtain p(F∈Foldnd).
Then, the overall objective used to learn representation drift may be is defined by:
L
drift
t
=L
s
t
+L
f
t+λfusLfust
where λfus>0 is the loss balancing parameter. The values of λfus and λcorr may be experimentally finetuned for each drift model configuration, dataset and incremental set-up as explained in more detail below. In particular, we perform gridsearch such that λfus,λcorr∈{le2, le1, le0, le-1, le-2, le-3, le-4, le-5} and select the best value combination. We note that Ldriftt measures the loss of current models on inferred representations of old classes. Thereby, we aim at reducing forgetting (ϵt), i.e. the discrepancy between RERs as estimated by drift models and their evanescent (unavailable) counterparts by training models optimising Ldriftt.
In all the aforementioned setups, we may employ the Adam optimiser, for example as described in [4] with fixed learning rate η and train until convergence by performing early stopping, that is the model is trained until the loss function does not change for a predefined constant number of steps τ=25. We may also experimentally finetune the value of the learning rate η∈{le-3, le-4, le-5} for each drift model configuration, dataset and incremental setup. Finally, we may also apply weight normalisation, for example as described in [10], to Ψψt
old0 are not known but the old class prototypes at stage 0 are known and denoted by Πoldt,0. Like
old0 at beginning of time step t. These are used to infer the new features using the feature drift model as described below.
The new class dataset t is obtained or input at step S602. In step S604, the trained feature extraction model fθ
t0. In step S606, the feature extraction model fθ
tn. We can enhance the training objective by a distillation loss Lfkdt
Lfkd(
t), for example as described in [8], to reduce the entity of representation drift across incremental tasks. The distillation loss Lfkdt is defined by the
2 distance between representations extracted from
t using the current feature extraction model fθ
t. The distillation loss may be expressed as:
Lpct is not explicitly mentioned in the equation above but provides information regarding shareability of representations among consecutive steps t−1 and t. Therefore, models optimising Lfkdt can make use of the feature shareability for learning drifts.
The new class features tn can be used in step S610 to infer the distribution of revived evanescent representations p(F∈
oldn) using the trained model Ψψt
At step S616, the classification model hϕLce(
t) may be used. As an example, the cross entropy loss may be a standard function such as that defined in the book “Deep Learning” by Goodfellow et al. published by MIT. When t>0, to mitigate forgetting of previous tasks, we generate features of the representations of the old classes
old, by modeling the drift and estimating the distribution p(F∈
old). The representation drift loss function Lrdt may be then calculated using
where yF is the one-hot label vector of the class cF∈old and
oldn,t is the set of revived evanescent representations sampled from the estimated distribution using the updated prototypes of
old. In other words, F∈Foldn,t is estimated by modelling representation drift as described above. The loss Lrdt approximates Lμ(
t; θt, ϕt)≤ϵt′ of gt on the previous datasets {
t′}t′=0t−1 using their inferred representations.
In the flowchart, the feature extraction model and the classification model are shown as being trained at separate steps but it will be appreciated that the models may be trained together. Then, the overall classification objective Lcct computed at each step t is:
L
cc
t(Dt,Foldn;θt,ϕt)=Lcet+λrdLrdt+λfkdLfkdt,
where Lrdt and Lfkdt are used only for t>0 with loss balancing parameters λrd>0 and λfkd>0. An alternative expression is:
where θt,ϕt are the parameters of the current feature extraction model and classification model, respectively.
In summary, the first phase shown in
where Ldriftt is the loss of the current model on inferred representations of old classes, t0 denotes the set of features extracted using a feature extractor fΘ
tn denotes the set of features extracted using fΘ
t, Πoldt,0 is the set of prototypes of old classes Cold at the optimization step 0, Πoldt,n is the set of prototypes updated at the nth epoch at the step t, γtn denotes the parameters of the feature drift model at time step t updated with n>0 optimisation stages and ψtn denotes the parameters of the semantic drift model at time step t updated with n>0 optimisation stages. It is noted that
t0 and
tn contain only representations of new classes Ct because only
t is available to the step t. Gradient backpropagation may be used for training both models. It is noted that the convergence criterion for a model is early stopping the optimisation of model parameters if the training loss does not change for a fixed/set number of steps.
The second phase shown in =
t∪Foldn for example by a loss function which may be represented as:
The apparatus comprises at least one processor 102 coupled to memory 104. The at least one processor 102 may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit. The memory 104 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.
The apparatus may further comprise at least one image capture device 108 for capturing images or videos to be processed by the ML model 106. The ML model may be an image classification model such as that shown in
The apparatus may further comprise at least one interface 110 for providing a result of the processing by the ML model to a user of the apparatus. For example, the apparatus 100 may comprise a display screen to receive user inputs and to display the results of implementing the ML model 106.
We evaluate our approach on multiple standard CIL benchmarks, including CIFAR100 [17], TinyImageNet [24] and CUB200-2011 [33] datasets.
We devise 3 class-incremental setups, where first the framework is trained on half of the available semantic classes (except for one setup on CIFAR 100, where only 40 classes are selected as the first task), and then the remaining class set is evenly divided into respectively 5, 10 or 20 incremental steps (or phases—the terms may be used interchangeably). Class order is selected randomly and then fixed for every class split.
ResNet-18 [10] is used as a backbone. The model is trained for 100 epochs (i.e. N=100 and each stage corresponds to one epoch over t) at each incremental step with Adam optimizer. The learning rate is initially set to 0.001 on CIFAR 100 and TinyImageNet and 0.0001 for CUB datasets. It is decreased by a factor of 0.1 after 45 and 90 epochs, for example as suggested in [43]. Images are cropped to 32×32, 64×64 and 256×256 for CIFAR100, TinyImageNet and CUB200-2011, respectively, and randomly flipped. We apply input and label augmentation, for example as described in [43]. We set batch size to 64, and λfkd=10 and λrd=10 in all experiments.
We employed lightweight DNNs to identify networks of Γγ and Ψψ for modeling feature and representation drifts. In particular, we investigate both the use of GMs with a multi-layer perceptron (MLP) which may have only two layers, and generative models in the form of a condition VAE [30, 42] to implement VMs, whereby hyperparameters are experimentally tuned. In the following sections we will denote a multi-layer perceptron by MLP and a generative model by VAE. The VAE may have encoder and decoder modules composed of a couple of FC layers and the MLP may have two layers.
The fusion loss Lfust defined above may be implemented by two methods for an ablation:
We compare our approach with many methods storing exemplars of old classes (EEIL [4], iCarl [25], UCIR [12]) and other state-of-the-art approaches that instead avoid their use (EWC [16], LwF [19], LwM [8], PASS [43], SDC [40]). All the methods are evaluated with the ResNet18 image classification model and batch size of 64 as described in [15]. We perform gridsearch over key hyperparameters of [2, 1, 9, 3, 6, 7]. As for the exemplar-based methods, we store 20 samples with herd selection, for example as described in [12,25]. In addition, we use the original code of PASS [43]. We evaluate the SDC [40] method by employing the prototype drift compensation proposed in [40] to update prototypes of past classes, and model old-class feature distribution by Gaussians as discussed above. In particular, we employ the original code of [40] to evaluate and compensate for the feature drift of old-class prototypes (i.e. in place of the proposed semantic and feature representation drift models), and we use the estimated up-to-date representations Πoldt,n by SDC [40] to approximate feature distribution of old classes with a parametrized Gaussian model, i.e. p(F∈oldn: πc∈Πoldt,n)˜N(πc,σc), where N is the normal distribution and σc is the standard deviation estimated at step t′. In the following sections we will show how our CIL method outperforms SOTA non-exemplar frameworks, while surpassing also some approaches using exemplars.
To evaluate and compare different approaches we resort to the accuracy metric [25], defined as the average top-1 classification accuracy over all classes up to the current incremental step.
In
We observe that our framework outperforms exemplar-based competitors, while also providing an accuracy boost over state-of-the-art methods not using exemplars [43, 40]. In particular, it is possible to notice that our representation drift modeling yields superior performance with respect to that proposed in [40]. This is especially true when semantic and feature representation shifts are jointly taken into account, showing that they both individually model crucial and complementary information, not fully captured by the framework proposed by [40]. Finally, we remark that employing a probabilistic generative modeling can further boost the final accuracy, especially when multiple incremental steps are performed. Nonetheless, a simple MLP architecture already provides satisfactory results.
In
In
Here non-exemplar methods provide quite low results, especially when the number of incremental steps is increased. Adopting the method proposed in [40] to compensate for modeling shift of prototypes using a softmax classifier seems to have no beneficial effect, showing that it fails to adequately model semantic drift in a fine-grained classification set-up with high semantic similarity among classes. On the other end, our framework demonstrates to successfully capture model representation drift; by injecting up-to-date knowledge of old classes, in fact, we manage to much more effectively mitigate catastrophic forgetting.
where akc denotes the top-1 accuracy for class c attained at the step k<t. We then compute the class-wise average of forgetting measures over all past classes at each step. This class-average measure of forgetting (%) is plotted in
It will be appreciated that the measures of forgetting may also be calculated for the other datasets but for efficiency just the CUB100 results are shown here. Similar improvements are seen with the proposed approach. Only the known method iCARL [9] shows some improvement over the proposed approach for the CIFAR100 and TinyImageNet datasets. However, we note that iCARL [9] yields a lower or comparable classification accuracy when compared to the proposed approach. This suggests that a focus of iCARL [9] on preserving past knowledge (i.e. stability) is accompanied by a less efficient learning of novel classes (i.e. plasticity).
As explained above, our framework enables implementation of drift models using different GMs and VMs. We studied the accuracy of a GM (MLP) and VM (VAE) for modeling different drifts and their fusion in the results section above. The results suggest that the accuracy of GMs and VMs depends on statistical sufficiency of data which affects capacity of fθ and learned representations as follows:
The methods described above are elucidated further with an ablation study and related analyses detailed below.
The first ablation study comprises analysing how well semantic drift is modeled and employed to update prototypical class representations. We explore how well representation drifts are modeled and prototypes are updated considering the drifts during incremental learning. For this purpose, we compute low-dimensional embedding of high dimensional representation vectors using the Isomap [31] so that the 2D embeddings of the feature vectors can be visualised. The results are shown in
In the next set of analyses, we investigate how our method proposed to model representation drifts, captures and preserves semantic relationships between feature representations of old and new classes. For this purpose, we compute prototypical representations (or prototypes) of novel classes on the training data available at an incremental step t and estimate the revived prototypes of old categories by modeling semantic and feature drifts (employed individually or fused). Then, we express inter-class relationships in the form of Euclidean distances between prototypes of past and new classes, and observe the evolution of such distance throughout an incremental step t. In particular, we compute distance values between pairs of class prototypes at the beginning and at the end of the same incremental step t, and measure their change across the optimisation interval (i.e. in the form of absolute value of the difference between the estimates performed at the beginning and the end of step t.
We compute the Euclidean and cosine distances between estimated (revived) prototypes of old classes (i.e. computed over training data and fixed [43], or updated by [40] or by drift models) and their reference (i.e. evanescent) representations (computed over the test set) at each incremental step. In other words, we measure and analyze representation drift with geometric distance functions. We provide per-step class-wise average distance values (main curves), as well as the maximum and minimum class values that identify, respectively, the upper and lower bound of the shaded regions for each setup. Feature prototypes of old classes are estimated by computing class wise averages of feature representations over training data when they are available at an incremental step, which can then be simply fixed for the rest of the training (as done in PASS [43]), or can be updated by SDC [40] or by the proposed drift models. Evanescent prototypes of the same old classes are instead computed over the test set (unavailable during training).
We replicate the analysis on the CIFAR100 shown in
The results show that our proposed methods can update the prototypes to be closer and better aligned with the reference prototypes compared to the state of the art PASS and SDC methods. Furthermore, we observe that the improved accuracy with respect to the known methods is shared among the three datasets chosen for evaluation. In particular, we notice the remarkable improvement experienced on the CUB200 dataset. Our method, in fact, provides much lower Euclidean distance between revived and evanescent representations, whose average value is kept almost constant as incremental training progresses and new classes are introduced. The same trend can be observed for the cosine similarity of the estimated and evanescent representations, whose value tends to be steady through the incremental training and much closer to the upper bound when our method is adopted. Thereby, our proposed methods reduce the drift between semantic representations in terms of geometric distances (i.e. Euclidean and cosine distances).
In
We focus on analyzing the change of semantic relationship during the first incremental step (i.e. t=1). We notice how prototypes estimated by leveraging the modeled semantic representation drift tend to more effectively preserve inter-class relationships with respect to novel classes. By utilising feature drift alone, in fact, we notice that class representations tend to modify their interconnections. Nonetheless, the proposed model fusion allows to better identify and retain inter-class relationships, whereas keeping prototypes fixed (as in PASS) causes a greater impairment of inter-class relationships as new representations are learned.
In the next set of analyses, we analyse the relationship between semantic drift and model accuracy. Specifically, we analyse the relationship tying the accuracy of the classification model and the normalised distance between the revived and evanescent prototypes of old classes. Normalisation is performed by dividing individual distances computed for single past classes by the average distance among all the past classes. We then provide the average of normalised distance values at each incremental step. We evaluate the accuracy of the proposed method when fusing semantic and feature drift models and adopting GM to identify representation drift, alongside with that of the SDC and PASS.
We observe that classification accuracy measured at each incremental step and distance between the estimated and evanescent (also known as ground-truth) prototypes are negatively correlated, with similar trends shared by the different methods being analysed across the CIFAR100, TinyImageNet and CUB200 datasets. It is worth noting that our method and SDC display very similar correlation patterns, whilst the latter reaches lower accuracy and higher distance values at the final incremental step.
The results show that state of the art PASS and SDC models overfit to training data as the models are trained incrementally. For instance, accuracy of PASS and SDC continues to decrease below 50% and their produced prototypical representations diverge from the reference prototype as the incremental steps increase. However, the proposed methods limit the distance between prototypical representations by 0.3 and the accuracy by 50%. Therefore, we argue that our proposed method yields superior performance compared to the state of the art methods of PASS and SDC by more accurately tracking and modeling evanescent old class prototypes.
In the final set of analyses we use statistical analyses of representations. We explore how well the estimated revived prototypes of old classes exemplify feature representations of samples of such categories (which are unavailable during training at incremental steps). To this end, we first compute probability distributions over the set of old classes based on the Euclidean distance between feature representation and class prototypes by:
p
F(c)=exp(−∥F−πc)∥2/ζ)/Σj exp(−∥F−πj)∥2/ζ)
where F∈Fold is the representation of a test sample of old as extracted by the current feature extractor, {πj}j are estimated prototypes of
old and ζ is set to 0.1. It is noted that the expression above is equivalent to the Softmax function and in this experiment we have used ζ set to 0.1 although 0.01 may also be used in different experiments.
We analyse the change of entropy (H) and cross-entropy (CE) of pF across incremental steps in old, we compute the mean H and CE of pF, and then, we average over all F corresponding to the same c.
We observe that our method provides higher H and smaller CE compared to PASS, and this trend is shared across CIFAR100, TinyImageNet and CUB200. This result suggests that information capacity of representations learned by our methods increases along with classification accuracy more rapidly compared to the SotA as models are incrementally trained.
To further validate this claim, for each old class c, we compute the probability distribution over feature representations of input samples of all past categories, defined by
where F∈Fold are feature representations of test samples and {πj}j are revived old-class prototypes. In addition, τ is set to 0.1. In
Once more, we observe that our method causes entropy of pc to reach higher values compared to PASS, indicating that prototypical representations revived by our method are more informative than their fixed counterparts, while still being representative of the corresponding class, as suggested by the lower CE of pF.
We also analyse the learning curves of semantic and feature drift models.
For semantic drift models, we show the results for the total loss (denoted by Fusion (VAE/MLP)) comprising Ls and Lfus. As for feature drift models, we present the individual behaviour of the loss Lf denoted by Only Feat Drift (VAE/MLP) (without aggregating the loss Lfus), along with combination of the loss functions Lf and Lfus denoted by Fusion (VAE/MLP).
In the results, we observe that the convergence point stabilises in all the analysed setups as learning progresses during each incremental step. This suggests that, as the classification model learns representations of new classes throughout an incremental step, drift models converge to a stable configuration, in which they can provide a robust estimate of representation drift. Moreover, we observe that the VMs (VAE) converge faster than GMs (MLP). In the analyses, we first observe that our methods provide distributions with higher entropy and smaller cross-entropy compared to the state of the art PASS and SDC methods. This result suggests that information capacity of representations learned by our methods increases more as the models are incrementally learned. In addition, our methods provide better semantic representations which represent categories with higher confidence (measured by class-wise cross-entropy) compared to PASS and SDC.
Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.
Contributions of the approach proposed above to overcome all the aforementioned limitations and address the CIL problem can be summarized as follows:
| Number | Date | Country | Kind |
|---|---|---|---|
| 2116074.2 | Nov 2021 | GB | national |
| 2205999.2 | Apr 2022 | GB | national |