This application claims priority to and the benefit of Netherlands Patent Application No. 2034236, titled “METHOD AND SYSTEM FOR CONTINUAL LEARNING IN ARTIFICIAL NEURAL NETWORKS BY IMPLICIT-EXPLICIT REGULARIZATION IN THE FUNCTION”, filed on Feb. 28, 2023, and the specification and claims thereof are incorporated herein by reference.
The invention relates to a computer-implemented method and system for continual learning in artificial neural networks by implicitly and explicitly regularizing the function space of said artificial neural networks.
Continual learning on a sequence of tasks with nonstationary data distributions results in catastrophic forgetting of older tasks as training the CL model with new information interferes with previously consolidated knowledge [1,2]. Experience-Rehearsal (ER) [9] is one of the first works to address catastrophic forgetting by explicitly maintaining a memory buffer and interleaving previous task samples from the memory with the current task samples. Several works such as GEM [13], iCaRL [15] build on top of the ER to further reduce catastrophic forgetting in CL. More recently, under low buffer regimes, Deep Retrieval and Imagination (DRI) [14] uses a generative model to produce additional (imaginary) data based on limited memory. ER-ACE [16] focuses on preserving learned representations from drastic adaptations by combating representation drift under low buffer regimes. To leverage learning across tasks in a resource-efficient manner, Gradient Coreset Replay (GCR) [17] proposes maintaining a core-set to select and update the memory buffer. Although rehearsal-based methods are fairly effective in challenging CL scenarios, they suffer from overfitting, exacerbated representation drift and prior information loss in low-buffer regimes thereby hurting the generalizability of the model.
Regularization, whether implicit or explicit, is an important component in reducing the generalization error in DNNs. Although the parameter norm penalty is one way to regularize the CL model, parameter sharing using multitask learning [11] can lead to better generalization and generalization error bounds if there exists a valid statistical relationship between tasks. Contrastive representation learning [12] that solves pretext prediction tasks to learn generalizable representations across a multitude of downstream tasks is an ideal candidate as an auxiliary task for implicit regularization. In CL, Task Agnostic Representation Consolidation (TARC) [7] proposes a two-stage learning paradigm in which the model learns generalizable representations first using Supervised Contrastive loss (SupCon) [12] followed by a modified supervised learning stage. Similarly, Co2L [18] first learns representations using modified SupCon loss and then trains a classifier only on the last task samples and buffer data. OCDNet [19] employs a student model, and distills relational and adaptive knowledge using modified SupCon objective. However, OCDNet does not leverage the generic information captured within the projection head to further reduce the overfitting of the classifier.
Explicit regularization in the function space imposes soft constraints on the parameters and optimizes the learning goal to converge upon a function that maps inputs to outputs [20]. Therefore, several methods opt to directly limit how much the input/output function changes between tasks to promote generalization [3, 4, 8, 20]. Function Distance Regularization (FDR) [20] and Dark Experience Replay (DER++) [4] save the model responses at task boundaries and apply consistency regularization while replaying data from the memory buffer. Instead of storing the responses in the buffer, Complementary Learning System-ER (CLS-ER) [3] maintains dual semantic memories to enforce consistency regularization.
However, multitasking and explicit classifier regularization in addition to consistency regularization in these approaches might enable further generalization in CL.
Rehearsal-based approaches that maintain a bounded memory buffer to store and replay samples from previous tasks have been fairly successful in mitigating catastrophic forgetting. However, these methods show strong performance only in presence of large buffer size and fail to perform well under low-buffer regimes and longer task sequences due to overfitting, prior information loss and representation drift. The method of the current invention comprises the step of intertwining implicit and explicit regularization to instill robust inductive biases and improve the generalization of the continual learning model, especially in low-buffer regimes.
It is an object of the current invention to correct the shortcomings of the prior art and to provide a solution for instilling robust inductive biases and for improving the generalization of the continual learning model, especially in low-buffer regimes. This and other objects which will become apparent from the following disclosure, are provided with a computer-implemented method for continual learning in artificial neural networks, a computer-readable medium, and an autonomous vehicle comprising a data processing system, having the features of one or more of the appended claims.
In a first aspect of the invention, the computer-implemented method for learning of an artificial neural network on an input of a continual stream of tasks, wherein said method comprises a continual learning model comprising the steps of:
The input of the continual stream of tasks can be obtained from images captured by a video recorder, a scene recorder or any other type of image capturing device, in particular but not exclusively as mounted on a vehicle to continually adapt and acquire knowledge from an environment surrounding said vehicle.
The step of learning generalizable features through an auxiliary task preferably comprises the steps of:
The method of the current invention preferably comprises the step of creating positive and negative embedding pairs of input samples using label information, wherein input samples belonging to a same class of an anchor are labelled as positives, and wherein input samples belonging to a different class than the class of the anchor are labelled as negatives.
The method of the current invention preferably comprises the step of learning visual representations by maximizing a cosine similarity between positive pairs of said correlated views while simultaneously minimizing a cosine similarity between negative pairs of said correlated views, wherein
The method of the current invention preferably comprises the step of using a mapping function for connecting geometric relationships between samples in the unit hypersphere of the classifier projection and in the unit hypersphere of the projection head.
The method of the current invention preferably comprises the step of regularizing the output activations of the classifier projection by capturing mean element-wise squared differences in correlations of l2-normalized output activations of the projection head and correlations of l2-normalized output activations of the classifier projection.
In a second embodiment of the invention, the computer-readable medium is provided with a computer program wherein when said computer program is loaded and executed by a computer, said computer program causes the computer to carry out the steps of the computer-implemented method according to any one of aforementioned steps.
In a third embodiment of the invention, the autonomous vehicle comprising a data processing system loaded with a computer program wherein said program is arranged for causing the data processing system to carry out the steps of the computer-implemented method according to any one of aforementioned steps for enabling said autonomous vehicle to continually adapt and acquire knowledge from an environment surrounding said autonomous vehicle.
The input of the continual stream of tasks comprises images captured by any image capturing device. When said images are fed into a system controlled according to the computer-implemented method of the invention, the system will have robust inductive biases and an improved learning of general features, especially in low-buffer regimes. This may improve the swift adaptation of said autonomous vehicle to the environment.
The invention will hereinafter be further elucidated with reference to the drawing of an exemplary embodiment of a computer-implemented method, a computer program and an autonomous vehicle comprising a data processing system according to the invention that is not limiting as to the appended claims.
The accompanying drawing, which are incorporated into and form a part of the specification, illustrates one or more embodiments of the present invention and, together with the description, serves to explain the principles of the invention. The drawing is only for the purpose of illustrating one or more embodiments of the invention and is not to be construed as limiting the invention. In the drawing,
Whenever in the FIGURES the same reference numerals are applied, these numerals refer to the same parts.
Deep neural networks (DNNs) deployed in the real world often bump into dynamic data streams and need to learn sequentially with data becoming progressively available overtime [1]. However, continual learning (CL) over a sequence of tasks causes catastrophic forgetting (CF) [2,9], a phenomenon in which acquiring new information disrupts the consolidated knowledge, and, in the worst case, previously acquired information is completely forgotten. Humans, on the other hand, excel at continual learning by incrementally acquiring, consolidating and transferring knowledge across a multitude of tasks [6]. Humans use robust inductive biases to generalize better while reusing the consolidated knowledge [10].
Regularization is a form of inductive bias that has been traditionally used in training DNNs for improving generalization.
Implicit regularization biases the learning objective without enforcing any explicit constraints on the objective. Multitask learning (MTL), which entails learning auxiliary tasks, acts as an implicit regularizer by sharing representations between related tasks [11]. Contrastive representation learning (CRL), where representations of similar samples are pulled closer to each other while dissimilar ones are pushed away, is a good candidate for the auxiliary tasks [12].
Explicit regularization optimizes learning objective by imposing additional constraints. Consistency regularization is one such approach where the consistency in predictions is enforced between a source model and a target model using soft targets [5]. Different choices of source model can distill knowledge to induce bias that improve generalization [26,28] or robustness [27,30] or other desirable properties [29].
Thus intertwining proper implicit and explicit regularization can instill inductive biases for improving the model performance.
CL typically comprises of t∈{1, 2, . . . , T} sequence of tasks with the model learning one task at a time. Each task is specified by a task-specific data distribution Dt with {(xi,yi)}i=1N pairs. The CL model Φθ=f, g, g′, h consists of a shared backbone f, a linear classifier g, a classifier projection MLP g′ and a projection head h. The classifier g represents all classes belonging to all tasks and the projection head h captures the 12-normalized representation embeddings. The classifier's embeddings are further projected onto a unit-hypersphere using another projection MLP g′. CL is especially challenging when data pertaining to previous tasks vanish as the CL model progresses on to the next task. Therefore, to approximate the previously seen task-specific data distributions, the method of the current invention comprises the step of maintaining a memory buffer Dm using reservoir sampling [21]. To restrict the empirical risk on all tasks seen so far, ER minimizes the following objective:
where B is a training batch, Lce is a cross-entropy loss, t is the index of the current task, and σ(·) is the softmax function. When the buffer size is limited, the CL model learns sample-specific features rather than class-wide/task-wide representative features, resulting in poor performance. Therefore, the method of the current invention comprises an implicit regularization step using parameter sharing and multitask learning, and explicit regularization step in the function space to guide the optimization of the CL model towards generalization.
The method of the current invention preferably comprises the step of learning an auxiliary task that complements continual supervised learning by accumulating generalizable representations in shared parameters. To this end, the method of the current invention preferably comprises the step of using a supervised contrastive loss (CRL) [12] for learning shared representation. CRL involves highly correlated multiple augmented views of the same sample which are then propagated forward through the encoder f and the projection head h.
To learn visual representations, the CL model should learn to maximize the cosine similarity (l2-normalized dot product) between the positive pairs from the multiple views while simultaneously pushing away the negative embeddings from the rest of the batch. To this end, the method of the current invention preferably comprise the step of using label information to create positive and negative embedding pairs in a training batch. Specifically, samples belonging to the same class as the anchor are considered positive, while the rest of the training batch samples are considered negative. The loss takes the following form:
where z=h(f(·)) is any arbitrary 128-dimensional 12-normalized projection, τ is a temperature parameter, I is a set of B indices, N(i)≡I\{i} is a set of negative indices and P(i)≡{p∈A(i): yp=yi} is a set of projection indices that belong to the same class as the anchor zi and |P(i)| is its cardinality. The use of multiple positives and negatives for each anchor based on the class membership in Eqn. 2 implicitly encourages learning from hard positives and hard negatives without actually requiring hard negative mining.
Theorem 1 (Wen & Li (2021) [22]): (Feature similarity) Features learned by f through CRL are similar to those learned via cross-entropy as long as: (i) The augmentation in CRL do not corrupt semantic information, and (ii) The labels in cross-entropy rely mostly on these semantic information.
Let xp+ and xp++ be two augmented positive samples of such that yp+=yp++. Furthermore, it is assumed that the raw data samples are generated in the following form: xp=ζp+ξp where ζp represents the semantic information in the image while ξp˜Dξ=N(0, σ) represents spurious noise. Given semantic preserving augmentations, Wen & Li (2021) [22] state that the contrastive learning learns similar discriminative features as cross-entropy. Similarly, as CRL in Equation 2 employs both semantic-preserving augmentations and labels to create positive pairs, it can be assumed that the inner product from semantic information <zζ
The CL model equipped with multitask learning implicitly encourages the shared encoder f to learn generalizable features. However, the classifier g that decides the final predictions is still prone to overfitting under low buffer regimes. Therefore, the method of the current invention aims to explicitly regularize the learning trajectory of the CL model in the function space defined by the classifier g. To this end, the output activation of encoder f is denoted as F∈RB,D∈RD
∈RD
where η is a decay parameter and γ is an update rate. The EMA of a model can be considered to form a self-ensemble of intermediate model states that leads to a better internal representation [3]. Therefore, the method of the current invention comprises the step of using the soft targets (predictions) of the EMA model to regularize the learning trajectory in the function spaces and
of the CL model:
where ∥·∥F is the Frobenius norm, z and y{circumflex over ( )} are projection head and classifier responses of the CL model, respectively, and ze and y{circumflex over ( )}e are that of the EMA model. As soft targets carry more information per training sample than ground truth labels, knowledge of the previous tasks can be better preserved by ensuring consistency in predictions thereby leading to drastic reductions in overfitting.
It is pertinent to note that restricting the output space to a unit hypersphere can improve training stability in representation learning [23]. Moreover, well-clustered projections in the hypersphere are linearly separable from the rest of the samples. Therefore, regularizing classifier using representations learned on a unit hypersphere can considerably reduce the generalization error. As semantically similar inputs tend to elicit similar responses. To this end, the method of the current invention preferably comprises the step of aligning geometric structures within the classifier's hypersphere with that of the projection head's hypersphere to further leverage global relationship between samples established using instance discrimination task. It is assumed that there exists a mapping function:
→
and its inverse
:
→
that establish a connection between the geometric relationship between the points in both hyperspheres. Therefore, to guide the classifier towards the activation correlations in the unit hypersphere of the projection head, the method of the current invention comprises the step of regularizing the differences in the outer products of Z and C i.e.,
where Gh and Gg are outer products, and stopgrad(·) ensures that the backpropagation of gradients occurs only through the classifier. Equation 4 regularizes both the classifier and the projection head using the EMA of the CL model while Lp in Equation 6 captures the mean element-wise squared difference between Gh and Gg matrices of CL model.
Theorem 2: (Johnson-Lindenstrauss Lemma): Let ϵ∈(0,1) and Dg>0 such that for any integer n, Dg≥4(ϵ2/2−ϵ3/3)−1 ln n. Then for any set of points Z∈RD
Fundamentally, Johnson-Lindenstrauss Lemma [24] proves that one can effectively reduce the dimensions of any n∈RD
During CL training, batches of current task are forward propagated through Φθ to obtain classification and projection embeddings. The method of the current invention preferably comprises the step of employing a two-pronged approach aimed at implicit regularization using hard parameter sharing and multitask learning, and a novel explicit regularization in the function space to guide the optimization of the CL model towards generalization under low buffer regimes.
Specifically Φθ learns generalizable features through Eqn. 2 and task-specific features through Eqn. 1. To consolidate the information pertaining to previous tasks better, the method of the current invention preferably comprises the step of maintaining a memory buffer and an EMA of the CL model which also serves as an inference model for evaluation. The method of the current invention preferably comprises the step of enforcing consistency in predictions on rehearsal data using Eqn. 4. To further reduce overfitting and discourage label bias in the classifier, the method of the current invention preferably comprises the step of emulating geometric structures using Eqn. 6. During each training iteration, the method of the current invention preferably comprises the step of updating the memory buffer using reservoir sampling [21] and the stochastically updating the EMA using Eqn. 3. The overall learning objective is as follows:
where α, β and γ are hyperparameters.
The method of the current invention, called ImEx-Reg, is illustrated in
Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other.
, Model Φθ ={f, g, g′, h},
← { }
= 0
≠∅ then
+= λ [
+
]
+=
+ α
+ β
Typical application areas of the invention include, but are not limited to:
Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the method of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the appended claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment.
Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.
Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, ALGOL, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2034236 | Feb 2023 | NL | national |