This application claims priority to and the benefit of Netherlands Patent Application No. 2033155, titled “METHOD AND SYSTEM FOR RELATIONAL GENERAL CONTINUAL LEARNING WITH MULTIPLE MEMORIES IN ARTIFICIAL NEURAL NETWORKS”, filed on Sep. 27, 2022, and Netherlands Patent Application No. 2034291, titled “METHOD AND SYSTEM FOR RELATIONAL GENERAL CONTINUAL LEARNING WITH MULTIPLE MEMORIES IN ARTIFICIAL NEURAL NETWORKS”, filed on Mar. 8, 2023, and the specification and claims thereof are incorporated herein by reference.
The invention relates to a computer-implemented method and system for relational general continual learning with multiple memories in artificial neural networks.
Deep Neural Networks undergo catastrophic forgetting of previous information when trying to learn continually from data streams [1]. Research in continual learning has approached the challenge of mitigating forgetting from multiple perspectives. First, Regularization-based approaches [2, 3, 4] constrain updates to parameters important to previously seen information. Parameter-isolation based methods [5, 6], meanwhile, assign different sets of parameters to different tasks, preventing interference between tasks. Finally, rehearsal-based methods [7, 8] retrain the network on previously seen information from a memory-buffer. Of these, rehearsal-based methods have been proven to work even under complex and more realistic evaluation protocols [9, 10]. However, the performance of rehearsal-based systems still lags behind regular training, where all data is available at once, indicating a gap in the construction of these learning systems, which might be deployed in self-driving cars, home robots, translators, recommender systems etc.
The human brain is one of the most successful learning systems we know. Humans are capable of continuously learning complex new tasks without forgetting previous tasks, while also successfully transferring information between past and previous tasks [11]. This includes intricate interactions between multiple complementary learning systems (CLS theory), learning at different rates. One can distinguish between plastic learners and stable learners. The fast so called “plastic” learners are responsible for quick adaptation to new information, whereas the slow “stable” learners are responsible for consolidating information from the fast learners into more generalizable forms [12]. These learners interact to form cognitive representations that involve both elemental as well as relational similarities [13]. For e.g. humans not only maintain knowledge of heights of objects across time, but also knowledge of relative heights such as A is taller than B. Relational similarities lead to structure-consistent mappings that enable abstraction, serial higher cognition, and reasoning. Therefore, inculcating these aspects of human learning systems could boost continual learning [13].
Motivated by CLS theory of human learning, recent research has augmented rehearsal-based methods with multiple interacting memories, further improving their performance [14, 15, 16]. One mode of interaction consists of stable models slowly consolidating information from plastic models adapting quickly to new information, into more generalizable forms [14, 15]. Additionally, to mitigate forgetting, the plastic models are constrained to maintain element similarity with the stable models for previous experiences through instance-wise knowledge distillation, forming another mode of interaction [14, 15, 16]. Enforcing relational similarities between models has been attempted through relational knowledge distillation, which constrains models to maintain higher-order relations between representations of the data-points [17].
Drawing from linguistic structuralism, suggests that the relational similarities among samples are vital knowledge to be taken into account when distilling knowledge from a teacher to a student. To this end, they introduce losses with the aim of maintaining pairwise distances and triplet-wise angles between samples in the stable teacher's and learning student's subspaces. Similarity-preserving knowledge distillation [18], meanwhile, attempts to integrate knowledge of pairwise similarities of activations in the teacher into the student. Variants of these losses have also been proposed, where class-wise relations are further taken into account using class prototypes. Following the success of these methods, newer methods have extended them to applications such as medical image classification and image translation [20]. However, the approaches discussed so far focus exclusively on regular training protocols, where complete data is available at all times, and not on continual learning, where the data streams with bounded access to previous data.
While a few continual learning methods such as [24], [25], [26], have attempted to capture relational data, they either only work with innately relational data in natural language processing, or only capture instance-wise relations using approaches such as self-supervised training, and not higher-order relational similarity at the data level that can be captured by relational knowledge distillation.
Among the continual learning methods that do employ higher-order relational similarity is [21], which attempts to stabilize the angle relationships between triplets in exemplar graphs for few-shot continual learning, where the base-task has many classes and the remaining classes are divided across multiple tasks with few examples per class. Nevertheless, this is a much simpler setting as the majority of the information is learnt early on, reducing the impact of forgetting. Furthermore, this approach is not applicable to general continual learning as it cannot deal with blurry task boundaries. Finally, Relational-Guided Representation learning for Data-Free Class Incremental Learning (R-DFCIL) applied angle-wise distillation loss on the samples of the new task in a data-free scenario, with a model trained on the previous tasks acting as a teacher. As with previous methods, is also incapable of dealing with blurry task boundaries. Moreover, is concerned with the data-free setting, which is complex to tune as it requires training a generator for previously seen images, can introduce a distribution shift on past data, and is also unnecessary for practical deployment as keeping a small memory-buffer for rehearsal is only a small memory overhead.
So far, no prior art disclosed using higher-order relational similarities at the data level for multi-memory continual learning with rehearsal, that can deal with blurry task-boundaries. It is an object of the current invention to correct the shortcomings of the prior art. This and other objects which will become apparent from the following disclosure, are provided with a computer-implemented method for continual learning in artificial neural networks, a computer-readable medium, and an autonomous vehicle comprising a data processing system, having the features of one or more of the appended claims.
Embodiments of the present invention are directed to a computer-implemented method for learning of artificial neural networks on a continual stream of tasks comprising the steps of:
The computer-implemented method preferably comprises the step of training the at least one plastic model by calculating a task loss, such as a cross-entropy loss, on samples selected from a current stream of tasks and from samples stored in the memory buffer.
The computer-implemented method preferably comprises the step of calculating the elemental knowledge distillation loss on samples selected from the memory buffer.
The computer-implemented preferably comprises the step of calculating the relational similarity loss on samples selected from a current stream of tasks and a from samples stored in the memory buffer.
The computer-implemented method of the current invention preferably comprises the step of calculating a first total loss by:
The computer-implemented method of the current invention preferably comprises the steps of:
The computer-implemented method of the current invention preferably comprises the step of transferring relational similarities in both the memory as well as current samples from the at least one stable model to the at least one plastic model, using a relational similarity loss such as a cross-correlation-based relational similarity loss.
The computer-implemented method of the current invention preferably comprises the step of calculating a second total loss by:
In another embodiment of the invention, the computer-readable medium is provided with a computer program wherein when said computer program is loaded and executed by a computer, said computer program causes the computer to carry out the steps of the computer-implemented method according to any one of aforementioned steps.
In another embodiment of the invention, an autonomous vehicle is proposed comprising a data processing system loaded with a computer program wherein said program is arranged for causing the data processing system to carry out the steps of the computer-implemented method according to any one of aforementioned steps for enabling said autonomous vehicle to continually adapt and acquire knowledge from an environment surrounding said autonomous vehicle.
The invention will hereinafter be further elucidated with reference to the drawing of an exemplary embodiment of a computer-implemented method, a computer program and an autonomous vehicle comprising a data processing system according to the invention that is not limiting as to the appended claims. Objects, advantages and novel features, and further scope of applicability of the present invention will be set forth in part in the detailed description to follow, taken in conjunction with the accompanying drawings, and in part will become apparent to those skilled in the art upon examination of the following, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, serve to explain the principles of the invention. The drawings are only for the purpose of illustrating one or more embodiments of the invention and are not to be construed as limiting the invention. In the drawings,
Whenever in the FIGURES the same reference numerals are applied, these numerals refer to the same parts.
The method of the current invention can be divided into the following components:
The method of the current invention comprises the step of formulating a dual-memory setup for continual learning. Concretely, consider a plastic model P parameterized by WP and a stable model S parameterized by WS. The plastic model is learnable, whereas the stable model is maintained as an exponentially moving average (EMA) of the plastic model [14]. This allows the method of the current invention to deal with blurry task boundaries without having to explicitly rely on task boundaries for building “teacher” models. Additionally, the method of the current invention comprises the step of employing a bounded memory-buffer M, updated using reservoir sampling, which aids the buffer in approximating the distribution of samples seen by the models [23]. For an update coefficient α and an update frequency v∈c∈(0, 1), the method of the current invention comprises the step of updating the stable model with probability v v at training iteration n:
αn=min(1−1/n+1),α)
W
S=αnWS+(1−αn)WP (Equation 1)
However, this only helps in dealing with forgetting in the stable model. Ideally, the plastic model should have a mechanism for remembering earlier knowledge i.e. show element similarity of representations across time, and to have some forward transfer from this knowledge i.e. reason based on past experience. Moreover, if the plastic model undergoes catastrophic forgetting, it would hamper the stable model as well (See Equation 1), which further necessitates the need for such a mechanism. Consequently, the method of the current invention comprises the step of distilling the knowledge of representations of memory samples from the stable teacher model, back to the plastic student model. Specifically, at any given point in time, the method of the current invention comprises the step of sampling a batch each from the “current” or “task” stream (XB, YB) and memory (XM, YM). Then, the elemental similarity loss (LES) is defined as:
L
multi-mem((XB,YB),(XMYM))=LT(XB∪XM,YB∪YM)+βLES(XM) (Equation 3)
Knowledge distillation transfers element similarities i.e. individual representations from a teacher model to a student model. However, there is further knowledge embedded in the relations between representations, which is important to the structure of the representation space as a whole. For e.g. examples from similar classes may lead to similar representations, whereas highly dissimilar classes might lead to highly divergent representations. Relational similarity aids the learning of structurally-consistent mappings that enable higher cognition and reasoning. Therefore, the method of the current invention comprises the step of additionally instilling relational similarity between the teacher stable model and the student plastic model on both the memory as well as current samples.
Specifically, the corresponding representations from the pre-final layers of the stable and plastic model are batch-normalized, represented by Ai(S), Ai(P)∈b
Putting everything together (Equations 3 and 4), the total loss for continual learning becomes:
L
total
=L
multi-mem((XB,YB),(XMYM))+γLRS(XM,XB) (Equation 5)
46.55
53.42
48.67
54.66
Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other.
Typical application areas of the invention include, but are not limited to:
Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the method of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the append-ed claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment.
Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.
Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, ALGOL, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2033155 | Sep 2022 | NL | national |
| 2034291 | Mar 2023 | NL | national |