Method and System for Continual Learning in Artificial Neural Networks by Implicit-Explicit Regularization in the Function

Information

  • Patent Application
  • 20240296321
  • Publication Number
    20240296321
  • Date Filed
    February 28, 2023
    2 years ago
  • Date Published
    September 05, 2024
    a year ago
Abstract
A computer-implemented method for continual learning in deep neural networks that introduces robust inductive biases by intertwining implicit regularization, using a projection head through auxiliary contrastive representation learning, and explicit consistency regularization on the soft targets using exponential moving average. To further leverage the global relationship between representations learned, the method of the current invention comprises a regularization strategy of guiding the classifier towards the activation correlations in the unit hypersphere of the projection head. These implicit and explicit regularizations encourage the model to learn generalizable representations, thereby reducing task interference and catastrophic forgetting.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to and the benefit of Netherlands Patent Application No. 2034236, titled “METHOD AND SYSTEM FOR CONTINUAL LEARNING IN ARTIFICIAL NEURAL NETWORKS BY IMPLICIT-EXPLICIT REGULARIZATION IN THE FUNCTION”, filed on Feb. 28, 2023, and the specification and claims thereof are incorporated herein by reference.


BACKGROUND OF THE INVENTION
Field of the Invention

The invention relates to a computer-implemented method and system for continual learning in artificial neural networks by implicitly and explicitly regularizing the function space of said artificial neural networks.


Background Art

Continual learning on a sequence of tasks with nonstationary data distributions results in catastrophic forgetting of older tasks as training the CL model with new information interferes with previously consolidated knowledge [1,2]. Experience-Rehearsal (ER) [9] is one of the first works to address catastrophic forgetting by explicitly maintaining a memory buffer and interleaving previous task samples from the memory with the current task samples. Several works such as GEM [13], iCaRL [15] build on top of the ER to further reduce catastrophic forgetting in CL. More recently, under low buffer regimes, Deep Retrieval and Imagination (DRI) [14] uses a generative model to produce additional (imaginary) data based on limited memory. ER-ACE [16] focuses on preserving learned representations from drastic adaptations by combating representation drift under low buffer regimes. To leverage learning across tasks in a resource-efficient manner, Gradient Coreset Replay (GCR) [17] proposes maintaining a core-set to select and update the memory buffer. Although rehearsal-based methods are fairly effective in challenging CL scenarios, they suffer from overfitting, exacerbated representation drift and prior information loss in low-buffer regimes thereby hurting the generalizability of the model.


Regularization, whether implicit or explicit, is an important component in reducing the generalization error in DNNs. Although the parameter norm penalty is one way to regularize the CL model, parameter sharing using multitask learning [11] can lead to better generalization and generalization error bounds if there exists a valid statistical relationship between tasks. Contrastive representation learning [12] that solves pretext prediction tasks to learn generalizable representations across a multitude of downstream tasks is an ideal candidate as an auxiliary task for implicit regularization. In CL, Task Agnostic Representation Consolidation (TARC) [7] proposes a two-stage learning paradigm in which the model learns generalizable representations first using Supervised Contrastive loss (SupCon) [12] followed by a modified supervised learning stage. Similarly, Co2L [18] first learns representations using modified SupCon loss and then trains a classifier only on the last task samples and buffer data. OCDNet [19] employs a student model, and distills relational and adaptive knowledge using modified SupCon objective. However, OCDNet does not leverage the generic information captured within the projection head to further reduce the overfitting of the classifier.


Explicit regularization in the function space imposes soft constraints on the parameters and optimizes the learning goal to converge upon a function that maps inputs to outputs [20]. Therefore, several methods opt to directly limit how much the input/output function changes between tasks to promote generalization [3, 4, 8, 20]. Function Distance Regularization (FDR) [20] and Dark Experience Replay (DER++) [4] save the model responses at task boundaries and apply consistency regularization while replaying data from the memory buffer. Instead of storing the responses in the buffer, Complementary Learning System-ER (CLS-ER) [3] maintains dual semantic memories to enforce consistency regularization.


However, multitasking and explicit classifier regularization in addition to consistency regularization in these approaches might enable further generalization in CL.


Rehearsal-based approaches that maintain a bounded memory buffer to store and replay samples from previous tasks have been fairly successful in mitigating catastrophic forgetting. However, these methods show strong performance only in presence of large buffer size and fail to perform well under low-buffer regimes and longer task sequences due to overfitting, prior information loss and representation drift. The method of the current invention comprises the step of intertwining implicit and explicit regularization to instill robust inductive biases and improve the generalization of the continual learning model, especially in low-buffer regimes.


BRIEF SUMMARY OF THE INVENTION

It is an object of the current invention to correct the shortcomings of the prior art and to provide a solution for instilling robust inductive biases and for improving the generalization of the continual learning model, especially in low-buffer regimes. This and other objects which will become apparent from the following disclosure, are provided with a computer-implemented method for continual learning in artificial neural networks, a computer-readable medium, and an autonomous vehicle comprising a data processing system, having the features of one or more of the appended claims.


In a first aspect of the invention, the computer-implemented method for learning of an artificial neural network on an input of a continual stream of tasks, wherein said method comprises a continual learning model comprising the steps of:

    • maintaining a fixed-size memory buffer for storing data distributions of previous tasks; and
    • providing the network with:
      • an encoder (backbone);
      • a linear classifier;
      • a classifier projection with a multi-layer perceptron;
      • a projection head,


        wherein the method comprises the steps of:
    • implicitly regularizing the continual learning model by learning generalizable features through an auxiliary task such as supervised contrastive learning; and
    • explicitly regularizing the learning model by:
      • calculating an exponentially moving average of parameters of the continual learning model; and
      • using predictions of said exponentially moving average of the model for regularizing the continual learning model in a function space of the linear classifier and in a function space of the projection head; and
      • aligning geometric structures within a unit hypersphere of the linear classifier with a unit hypersphere of the projection head.


The input of the continual stream of tasks can be obtained from images captured by a video recorder, a scene recorder or any other type of image capturing device, in particular but not exclusively as mounted on a vehicle to continually adapt and acquire knowledge from an environment surrounding said vehicle.


The step of learning generalizable features through an auxiliary task preferably comprises the steps of:

    • augmenting multiples correlated views of an input sample; and
    • propagating said correlated views forward through an encoder and a projection head of the network.


The method of the current invention preferably comprises the step of creating positive and negative embedding pairs of input samples using label information, wherein input samples belonging to a same class of an anchor are labelled as positives, and wherein input samples belonging to a different class than the class of the anchor are labelled as negatives.


The method of the current invention preferably comprises the step of learning visual representations by maximizing a cosine similarity between positive pairs of said correlated views while simultaneously minimizing a cosine similarity between negative pairs of said correlated views, wherein


The method of the current invention preferably comprises the step of using a mapping function for connecting geometric relationships between samples in the unit hypersphere of the classifier projection and in the unit hypersphere of the projection head.


The method of the current invention preferably comprises the step of regularizing the output activations of the classifier projection by capturing mean element-wise squared differences in correlations of l2-normalized output activations of the projection head and correlations of l2-normalized output activations of the classifier projection.


In a second embodiment of the invention, the computer-readable medium is provided with a computer program wherein when said computer program is loaded and executed by a computer, said computer program causes the computer to carry out the steps of the computer-implemented method according to any one of aforementioned steps.


In a third embodiment of the invention, the autonomous vehicle comprising a data processing system loaded with a computer program wherein said program is arranged for causing the data processing system to carry out the steps of the computer-implemented method according to any one of aforementioned steps for enabling said autonomous vehicle to continually adapt and acquire knowledge from an environment surrounding said autonomous vehicle.


The input of the continual stream of tasks comprises images captured by any image capturing device. When said images are fed into a system controlled according to the computer-implemented method of the invention, the system will have robust inductive biases and an improved learning of general features, especially in low-buffer regimes. This may improve the swift adaptation of said autonomous vehicle to the environment.


The invention will hereinafter be further elucidated with reference to the drawing of an exemplary embodiment of a computer-implemented method, a computer program and an autonomous vehicle comprising a data processing system according to the invention that is not limiting as to the appended claims.





BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The accompanying drawing, which are incorporated into and form a part of the specification, illustrates one or more embodiments of the present invention and, together with the description, serves to explain the principles of the invention. The drawing is only for the purpose of illustrating one or more embodiments of the invention and is not to be construed as limiting the invention. In the drawing,



FIG. 1 shows a schematic diagram for the computer-implemented method according to an embodiment of the invention.





DETAILED DESCRIPTION OF THE INVENTION

Whenever in the FIGURES the same reference numerals are applied, these numerals refer to the same parts.


Deep neural networks (DNNs) deployed in the real world often bump into dynamic data streams and need to learn sequentially with data becoming progressively available overtime [1]. However, continual learning (CL) over a sequence of tasks causes catastrophic forgetting (CF) [2,9], a phenomenon in which acquiring new information disrupts the consolidated knowledge, and, in the worst case, previously acquired information is completely forgotten. Humans, on the other hand, excel at continual learning by incrementally acquiring, consolidating and transferring knowledge across a multitude of tasks [6]. Humans use robust inductive biases to generalize better while reusing the consolidated knowledge [10].


Regularization is a form of inductive bias that has been traditionally used in training DNNs for improving generalization.


Implicit regularization biases the learning objective without enforcing any explicit constraints on the objective. Multitask learning (MTL), which entails learning auxiliary tasks, acts as an implicit regularizer by sharing representations between related tasks [11]. Contrastive representation learning (CRL), where representations of similar samples are pulled closer to each other while dissimilar ones are pushed away, is a good candidate for the auxiliary tasks [12].


Explicit regularization optimizes learning objective by imposing additional constraints. Consistency regularization is one such approach where the consistency in predictions is enforced between a source model and a target model using soft targets [5]. Different choices of source model can distill knowledge to induce bias that improve generalization [26,28] or robustness [27,30] or other desirable properties [29].


Thus intertwining proper implicit and explicit regularization can instill inductive biases for improving the model performance.


Preliminary

CL typically comprises of t∈{1, 2, . . . , T} sequence of tasks with the model learning one task at a time. Each task is specified by a task-specific data distribution Dt with {(xi,yi)}i=1N pairs. The CL model Φθ=f, g, g′, h consists of a shared backbone f, a linear classifier g, a classifier projection MLP g′ and a projection head h. The classifier g represents all classes belonging to all tasks and the projection head h captures the 12-normalized representation embeddings. The classifier's embeddings are further projected onto a unit-hypersphere using another projection MLP g′. CL is especially challenging when data pertaining to previous tasks vanish as the CL model progresses on to the next task. Therefore, to approximate the previously seen task-specific data distributions, the method of the current invention comprises the step of maintaining a memory buffer Dm using reservoir sampling [21]. To restrict the empirical risk on all tasks seen so far, ER minimizes the following objective:










=


1






i





"\[LeftBracketingBar]"





"\[RightBracketingBar]"











(

x
,
y

)

~

















(


σ

(

g

(

f

(
x
)

)

)

,
y

)








(
1
)







where B is a training batch, Lce is a cross-entropy loss, t is the index of the current task, and σ(·) is the softmax function. When the buffer size is limited, the CL model learns sample-specific features rather than class-wide/task-wide representative features, resulting in poor performance. Therefore, the method of the current invention comprises an implicit regularization step using parameter sharing and multitask learning, and explicit regularization step in the function space to guide the optimization of the CL model towards generalization.


Implicit Regularization

The method of the current invention preferably comprises the step of learning an auxiliary task that complements continual supervised learning by accumulating generalizable representations in shared parameters. To this end, the method of the current invention preferably comprises the step of using a supervised contrastive loss (CRL) [12] for learning shared representation. CRL involves highly correlated multiple augmented views of the same sample which are then propagated forward through the encoder f and the projection head h.


To learn visual representations, the CL model should learn to maximize the cosine similarity (l2-normalized dot product) between the positive pairs from the multiple views while simultaneously pushing away the negative embeddings from the rest of the batch. To this end, the method of the current invention preferably comprise the step of using label information to create positive and negative embedding pairs in a training batch. Specifically, samples belonging to the same class as the anchor are considered positive, while the rest of the training batch samples are considered negative. The loss takes the following form:










=




i

I





-
1




"\[LeftBracketingBar]"


P

(
i
)



"\[RightBracketingBar]"








p


P

(
i
)




[







·





/
τ

-

log





n


N

(
i
)




exp



(






·





/
τ

)





]








(
2
)







where z=h(f(·)) is any arbitrary 128-dimensional 12-normalized projection, τ is a temperature parameter, I is a set of B indices, N(i)≡I\{i} is a set of negative indices and P(i)≡{p∈A(i): yp=yi} is a set of projection indices that belong to the same class as the anchor zi and |P(i)| is its cardinality. The use of multiple positives and negatives for each anchor based on the class membership in Eqn. 2 implicitly encourages learning from hard positives and hard negatives without actually requiring hard negative mining.


Theorem 1 (Wen & Li (2021) [22]): (Feature similarity) Features learned by f through CRL are similar to those learned via cross-entropy as long as: (i) The augmentation in CRL do not corrupt semantic information, and (ii) The labels in cross-entropy rely mostly on these semantic information.


Let xp+ and xp++ be two augmented positive samples of such that yp+=yp++. Furthermore, it is assumed that the raw data samples are generated in the following form: xppp where ζp represents the semantic information in the image while ξp˜Dξ=N(0, σ) represents spurious noise. Given semantic preserving augmentations, Wen & Li (2021) [22] state that the contrastive learning learns similar discriminative features as cross-entropy. Similarly, as CRL in Equation 2 employs both semantic-preserving augmentations and labels to create positive pairs, it can be assumed that the inner product from semantic information <zζp+, zζp++> will overwhelm that from the noisy signal <zξp+, zξp++>. Therefore, Equation 2 exerts some form of implicit regularization during optimization to focus on semantic information contained in the images (such as from the environment) rather than spurious correlations. As it is expected that labels in cross-entropy to focus on semantic features to learn classification, it is hypothesized that both CRL and cross-entropy share a common hypothesis class and sharing representations across these tasks especially benefits CL under low-buffer regimes.


Explicit Regularization

The CL model equipped with multitask learning implicitly encourages the shared encoder f to learn generalizable features. However, the classifier g that decides the final predictions is still prone to overfitting under low buffer regimes. Therefore, the method of the current invention aims to explicitly regularize the learning trajectory of the CL model in the function space defined by the classifier g. To this end, the output activation of encoder f is denoted as F∈RB,Df, that of projection head h as Z∈RB,Dh, and that of classifier projection g′ as C∈RB,Dg, where Df, D9, Dh denote the dimensions of output Euclidean spaces. Let custom-character∈RDg and custom-character∈RDh be the function spaces represented by the classifier g and the projection head h. Let θ and θEMA be parameters of the CL model and its corresponding exponential moving average (EMA). The EMA model is then stochastically updated as follows:










θ
EMA

=

{





θ
EMA

,





if


γ



𝒰

(

0
,
1

)









η



θ
EMA


+


(

1
-
η

)


θ


,



otherwise








(
3
)







where η is a decay parameter and γ is an update rate. The EMA of a model can be considered to form a self-ensemble of intermediate model states that leads to a better internal representation [3]. Therefore, the method of the current invention comprises the step of using the soft targets (predictions) of the EMA model to regularize the learning trajectory in the function spaces custom-character and custom-character of the CL model:











=






(


x
j

,

y
j


)

~









-





F
2






(
4
)











=






(


x
j

,

y
j


)

~








y
^

-


y
^

e




F
2






where ∥·∥F is the Frobenius norm, z and y{circumflex over ( )} are projection head and classifier responses of the CL model, respectively, and ze and y{circumflex over ( )}e are that of the EMA model. As soft targets carry more information per training sample than ground truth labels, knowledge of the previous tasks can be better preserved by ensuring consistency in predictions thereby leading to drastic reductions in overfitting.


It is pertinent to note that restricting the output space to a unit hypersphere can improve training stability in representation learning [23]. Moreover, well-clustered projections in the hypersphere are linearly separable from the rest of the samples. Therefore, regularizing classifier using representations learned on a unit hypersphere can considerably reduce the generalization error. As semantically similar inputs tend to elicit similar responses. To this end, the method of the current invention preferably comprises the step of aligning geometric structures within the classifier's hypersphere with that of the projection head's hypersphere to further leverage global relationship between samples established using instance discrimination task. It is assumed that there exists a mapping functioncustom-character:custom-charactercustom-character and its inverse custom-character:custom-charactercustom-character that establish a connection between the geometric relationship between the points in both hyperspheres. Therefore, to guide the classifier towards the activation correlations in the unit hypersphere of the projection head, the method of the current invention comprises the step of regularizing the differences in the outer products of Z and C i.e.,










G
h

=


ZZ
T









D
h

,

D
h








(
5
)










G
g

=


CC
T









D
g

,

D
g














=


1







stopgrad

(

G
h

)

-

G
g




F
2







(
6
)








where Gh and Gg are outer products, and stopgrad(·) ensures that the backpropagation of gradients occurs only through the classifier. Equation 4 regularizes both the classifier and the projection head using the EMA of the CL model while Lp in Equation 6 captures the mean element-wise squared difference between Gh and Gg matrices of CL model.


Theorem 2: (Johnson-Lindenstrauss Lemma): Let ϵ∈(0,1) and Dg>0 such that for any integer n, Dg≥4(ϵ2/2−ϵ3/3)−1 ln n. Then for any set of points Z∈RDh, there exists a mapping function









:


R

D
h





"\[Rule]"



R

D
g




s
.
t




p



,

q

P

,



(

1
-
ϵ

)






p
-
q



2











(
p
)


-




(
q
)





2




(

1
+
ϵ

)






p
-
q



2







Fundamentally, Johnson-Lindenstrauss Lemma [24] proves that one can effectively reduce the dimensions of any n∈RDh points in an Euclidean subspace to Dg=0 log n/ϵ2 dimensions without distorting the distances between n points more than 1±ϵ. Magen (2002) [25] further observed that Johnson-Lindenstrauss Lemma preserves any ‘Dg dimensional angle’ by projecting down to dimension O(Dgε−2 log n). Therefore, it can be safely assumed that it is possible to transfer geometric structures in projection head to low-dimensional classifier's hypersphere without the loss of generality. To this end, the method of the current invention preferably comprises the step of introducing Lp as a mapping function to emulate rich geometric structures learnt within projection head's hypersphere to compensate for the weak supervision in the classifier under low-buffer regimes.


Putting it all Together

During CL training, batches of current task are forward propagated through Φθ to obtain classification and projection embeddings. The method of the current invention preferably comprises the step of employing a two-pronged approach aimed at implicit regularization using hard parameter sharing and multitask learning, and a novel explicit regularization in the function space to guide the optimization of the CL model towards generalization under low buffer regimes.


Specifically Φθ learns generalizable features through Eqn. 2 and task-specific features through Eqn. 1. To consolidate the information pertaining to previous tasks better, the method of the current invention preferably comprises the step of maintaining a memory buffer custom-character and an EMA of the CL model which also serves as an inference model for evaluation. The method of the current invention preferably comprises the step of enforcing consistency in predictions on rehearsal data using Eqn. 4. To further reduce overfitting and discourage label bias in the classifier, the method of the current invention preferably comprises the step of emulating geometric structures using Eqn. 6. During each training iteration, the method of the current invention preferably comprises the step of updating the memory buffer using reservoir sampling [21] and the stochastically updating the EMA using Eqn. 3. The overall learning objective is as follows:











=








(

x
,
y

)

~







[



+

α



+

β




]

+




(

x
,
y

)

~




λ

[



+



]







(
7
)







where α, β and γ are hyperparameters.


The method of the current invention, called ImEx-Reg, is illustrated in FIG. 1 and is detailed in Algorithm 1.


Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other.












Algorithm 1 Proposed Method



















Input: Data streams custom-character  , Model Φθ ={f, g, g′, h},




Hy-perparameters, α, β and λ, Memory buffer custom-character  ← { }



 1:
for all tasks t ∈ {1, 2, ... T} do



 2:
 for all Iterations e ∈ {1, 2, ... E} do



 3:
  custom-character  = 0



 4:
  Sample a minibatch (Xt,Yt) ∈ custom-character



 5:
  Ft = f(Xt)



 6:
  Ŷt, Zt, Ct = g(Ft), h(Ft), g′(g(Ft))



 7:
  if custom-character  ≠∅ then



 8:
   Sample a minibatch (Xm, Ym) ∈ custom-character



 9:
   Fm = f(Xm)



10:
   Ŷ, Z, C = g(Fm), h(Fm), g′(g(Fm))



11:
   Fe = fe(Xm)



12:
   Ye, Ze, Ce = ge(Fm), he(Fm), g′e(ge(Fm))



13:
   custom-character  += λ [ custom-character  + custom-character  ]



14:
  custom-character  += custom-character  + α custom-character  + β custom-character



15:
  Update Φθ and custom-character



16:
return model Φθ










Typical application areas of the invention include, but are not limited to:

    • Road condition monitoring
    • Road signs detection
    • Parking occupancy detection
    • Defect inspection in manufacturing
    • Insect detection in agriculture
    • Aerial survey and imaging


Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the method of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the appended claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment.


Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.


Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, ALGOL, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.


REFERENCES



  • 1. Parisi, German I., et al. “Continual lifelong learning with neural networks: A review.” Neural Networks 113 (2019): 54-71.

  • 2. Michael McCloskey and Neal J Cohen. “Catastrophic interference in connectionist networks: The sequential learning problem.” In Psychology of learning and motivation, volume 24, pages 109 165. Elsevier, 1989. 1, 2

  • 3. Arani, Elahe, Fahad Sarfraz, and Bahram Zonooz. “Learning fast, learning slow: A general continual learning method based on complementary learning system.” arXiv preprint arXiv:2201.12604 (2022).

  • 4. Buzzega, Pietro, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. “Dark experience for general continual learning: a strong, simple baseline.” Advances in neural information processing systems 33 (2020): 15920-15930.

  • 5. Bhat, Prashant, Bahram Zonooz, and Elahe Arani. “Consistency is the key to further mitigating catastrophic forgetting in continual learning.” arXiv preprint arXiv:2207.04998 (2022).

  • 6. Li, Zhizhong, and Derek Hoiem. “Learning without forgetting.” IEEE transactions on pattern analysis and machine intelligence 40, no. 12 (2017): 2935-2947.

  • 7. Bhat, Prashant, Bahram Zonooz, and Elahe Arani. “Task Agnostic Representation Consolidation: a Self-supervised based Continual Learning Approach.” arXiv preprint arXiv:2207.06267 (2022).

  • 8. Sarfraz, Fahad, Elahe Arani, and Bahram Zonooz. “SYNERgy between SYNaptic consolidation and Experience Replay for general continual learning.” arXiv preprint arXiv:2206.04016 (2022).

  • 9. Ratcliff, Roger. “Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.” Psychological review 97.2 (1990): 285.

  • 10. Goyal, Anirudh, and Yoshua Bengio. “Inductive biases for deep learning of higher-level cognition.” Proceedings of the Royal Society A 478.2266 (2022): 20210068.

  • 11. Sebastian Ruder. “An overview of multi-task learning in deep neural networks”. arXiv preprint arXiv:1706.05098, 2017. 1

  • 12. Khosla, Prannay, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. “Supervised contrastive learning.” Advances in Neural Information Processing Systems 33 (2020): 18661-18673.

  • 13. Lopez-Paz, David, and Marc'Aurelio Ranzato. “Gradient episodic memory for continual learning.” Advances in neural information processing systems 30 (2017).

  • 14. Wang, Zhen, Liu Liu, Yiqun Duan, and Dacheng Tao. “Continual learning through retrieval and imagination.” In AAAI Conference on Artificial Intelligence, vol. 8. 2022.

  • 15. Rebuffi, Sylvestre-Alvise, et al. “icarl: Incremental classifier and representation learning.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.

  • 16. Caccia, Lucas, Rahaf Aljundi, Nader Asadi, Tinne Tuytelaars, Joelle Pineau, and Eugene Belilovsky. “New insights on reducing abrupt representation change in online continual learning.” arXiv preprint arXiv:2203.03798 (2022).

  • 17. Tiwari, Rishabh, Krishnateja Killamsetty, Rishabh lyer, and Pradeep Shenoy. “GCR: Gradient Coreset Based Replay Buffer Selection For Continual Learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 99-108. 2022.

  • 18. Cha, Hyuntak, Jaeho Lee, and Jinwoo Shin. “Co2l: Contrastive continual learning.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9516-9525. 2021.

  • 19. Li, Jin, Zhong Ji, Gang Wang, Qiang Wang, and Feng Gao. “Learning from Students: Online Contrastive Distillation Network for General Continual Learning.” In Proceedings of the International Joint Conference on Artificial Intelligence, vol. 7. 2022.

  • 20. Benjamin, Ari S., David Rolnick, and Konrad Kording. “Measuring and regularizing networks in function space.” arXiv preprint arXiv:1805.08289 (2018).

  • 21. Vitter, Jeffrey S. “Random sampling with a reservoir.” ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57.

  • 22. Wen, Zixin, and Yuanzhi Li. “Toward understanding the feature learning process of self-supervised contrastive learning.” In International Conference on Machine Learning, pp. 11112-11122. PMLR, 2021.

  • 23. Wang, Tongzhou, and Phillip Isola. “Understanding contrastive representation learning through alignment and uniformity on the hypersphere.” In International Conference on Machine Learning, pp. 9929-9939. PMLR, 2020.

  • 24. Dasgupta, Sanjoy, and Anupam Gupta. “An elementary proof of a theorem of Johnson and Lindenstrauss.” Random Structures & Algorithms 22, no. 1 (2003): 60-65.

  • 25. Magen, Avner. “Dimensionality reductions that preserve volumes and distance to affine spaces, and their algorithmic applications.” In International Workshop on Randomization and Approximation Techniques in Computer Science, pp. 239-253. Springer, Berlin, Heidelberg, 2002.

  • 26. Gowda, S., Zonooz, B. & Arani, E. (2022). “InBiaseD: Inductive Bias Distillation to Improve Generalization and Robustness through Shape-awareness.” Proceedings of The 1st Conference on Lifelong Learning Agents, in Proceedings of Machine Learning Researc 199:1026-1042

  • 27. Gowda, Shruthi, Bahram Zonooz, and Elahe Arani. “Does Thermal data make the detection systems more reliable?.” arXiv preprint arXiv:2111.05191 (2021).

  • 28. Bhat, Prashant, Elahe Arani, and Bahram Zonooz. “Distill on the Go: Online knowledge distillation in self-supervised learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2678-2687. 2021.

  • 29. Arani, Elahe, Fahad Sarfraz, and Bahram Zonooz. “Noise as a resource for learning in knowledge distillation.” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3129-3138. 2021.

  • 30. Arani, Elahe, Fahad Sarfraz, and Bahram Zonooz. “Computer-Implemented Method of Training a Computer-Implemented Deep Neural Network and Such a Network.” U.S. patent application Ser. No. 17/382,121, filed Feb. 10, 2022.


Claims
  • 1. A computer-implemented method for learning of an artificial neural network on an input of a continual stream of tasks, the method comprising a continual learning model comprising the steps of: maintaining a fixed-size memory buffer using reservoir sampling for storing data distributions of previous tasks;providing the network with an encoder, a linear classifier, a classifier projection with a multi-layer perceptron, and a projection head;implicitly regularizing the continual learning model by learning generalizable features through an auxiliary task such as supervised contrastive learning; andexplicitly regularizing the learning model by: calculating an exponentially moving average of parameters of the continual learning model;using predictions of said exponentially moving average for regularizing the continual learning model in a function space of the linear classifier and in a function space of the projection head; andaligning geometric structures within a unit hypersphere of the linear classifier with a unit hypersphere of the projection head.
  • 2. The method according to claim 1, wherein the step of learning generalizable features through an auxiliary task comprises the steps of: augmenting multiples correlated views of an input sample; andpropagating said correlated views forward through the encoder and a projection head of the network.
  • 3. The method according to claim 1 further comprising the step of creating positive and negative embedding pairs of input samples using label information, wherein input samples belonging to a same class of an anchor are labelled as positives, and wherein input samples belonging to a different class than the class of the anchor are labelled as negatives.
  • 4. The method according to claim 1 further comprising the step of learning visual representations by maximizing a cosine similarity between positive pairs of said correlated views while simultaneously minimizing a cosine similarity between negative pairs of said correlated views.
  • 5. The method according to claim 1 further comprising the step of using a mapping function for connecting geometric relationships between points of the unit hypersphere of the classifier and points of the unit hypersphere of the projection head.
  • 6. The method according to claim 1 further comprising the step of regularizing the output activations of the classifier projection by capturing mean element-wise squared differences in the correlations of 12-normalized output activations of the projection head and the correlations of 12-normalized output activations of the classifier projection.
  • 7. A computer-readable medium provided with a computer program, wherein when the computer program is loaded and executed by a computer, the computer program causes the computer to carry out the steps of the computer-implemented method according to claim 1.
  • 8. An autonomous vehicle comprising a data processing system loaded with a computer program, wherein the program is arranged for causing the data processing system to carry out the steps of the computer-implemented method according to claim 1 for enabling the autonomous vehicle to continually adapt and acquire knowledge from an environment surrounding the autonomous vehicle.
Priority Claims (1)
Number Date Country Kind
2034236 Feb 2023 NL national