This application claims priority to and the benefit of Netherlands Patent Application No. 2032721, titled “A Framework for Continual Learning Method in Vision Transformers with Representation Replay”, filed on Aug. 10, 2022, and the specification and claims thereof are incorporated herein by reference.
The present invention relates to method for continual task learning, a data processing apparatus comprising means for carrying out said method, a computer program to carry out said method, and at least partially autonomous driving system comprising a neural network that has been trained using said method.
Globally speaking, research in the field of artificial intelligence and deep learning has resulted in deep neural networks (DNNs) that achieve compelling performance in various domains [1, 2, 3]. Different types of architectures have been proposed in the deep learning literature to solve various tasks in computer vision and natural language processing [9, 10]. Transformers have been the dominant choice of architecture in natural language processing [11] and the recent breakthrough of Transformers in image recognition [47] motivated the community to adapt them to other vision tasks including object detection [12] and depth prediction [13]. Transformers in the computer vision domain is dubbed “Vision Transformers” [47]. It has also been shown that Transformers are robust compared to CNNs and make more reliable predictions [23]. They employ repeated blocks of Multi-Head Self Attention (MHSA) to learn relationships between different patches of an image at every block.
Most of the deep learning literature focuses on learning a model on a fixed dataset sampled from the same distribution [4] and are incapable of learning from sequential data over time. Continual learning [6, 7, 8, 15] is a research topic that studies the capability of deep neural networks to constantly adapt to data from new distributions while retaining the information learned from old data (consolidation) [7]. In an ideal continual learning setting, the model should sequentially learn from data belonging to new tasks (either new domains or new set of classes), while not forgetting the information learned from the previous tasks. Such a model would be more suitable for deployment in real life scenarios such as robotics and autonomous driving, where learning, adapting and making decisions continuously is a primary requirement.
In order to elucidate the concepts of (i) continual task learning and (ii) transformers in a continual task learning framework both concepts are first discussed in relationship to the prior art below.
Complementary Learning Systems (CLS) posits that the ability to continually acquire and assimilate knowledge over time in the brain is mediated by multiple memory systems [33]. Inspired by the CLS theory, CLS-ER [22] proposed a dual-memory method which maintains short-term and long-term semantic memories that interact with the episodic memory. On the other hand, Dual-Net [32] endowed fast-learner with label supervision and slow-learner with unsupervised representation learning thereby decoupling the representation learning from supervised learning. Although these approaches show remarkable improvements over vanilla-ER, these methods replay raw pixels of past experiences, inconsistent with how humans continually learn [19]. In addition, replaying raw pixels can have other ramifications including large memory footprint, data privacy and security concerns [34]. Based on the hippocampal indexing theory [35], hippocampus stores non-veridical [36 37}, high-level representations of neocortical activity patterns while awake. Several works [38, 39] mimic abstract representation rehearsal in brain by storing and replaying representations from intermediate layers in DNNs. Although high-level representation replay can potentially mitigate memory overhead, replaying representations as is over and over leads to overfitting and reduced noise tolerance. Overfitting is one of the main causes of catastrophic forgetting in neural networks.
Presently, known training methods suffer from what is known as “catastrophic forgetting”, which refers to the main issue in continual learning where the model overfits to the data from the new task or forgets the information learned from old tasks (when the learning algorithm overwrites weights important to the old tasks while learning new tasks) [8]. Despite recent research in this field, continually learning an accurate model, and performing well on the previously seen classes without catastrophic forgetting is still an open research problem. Furthermore, Vision transformers is still in nascent stage as far as computer vision is considered and representation replay in Transformers has not been explored in the field of continual learning. Different approaches have tried to solve the problem of catastrophic forgetting in neural networks, where a model should retain its learned knowledge about past tasks while learning the new tasks. Catastrophic forgetting occurs due to rewriting of weights in the network which are important to the old tasks while updating the network to learn the new task or overfitting the network on the new task samples. Though experience replay has been found to help mitigate catastrophic forgetting in continual learning, replaying raw pixels can have other ramifications including large memory footprint, data privacy and security concerns [34].
This application refers to published references. Such published references are given for a more complete background and is not to be construed as an admission that such publications are prior art for purposes of determining patentability.
Accordingly, embodiments of the present invention aim to combat the persistent problem of “catastrophic forgetting” in deep neural networks that are trained on the task of classifying images or recognizing objects or situations within such images. Embodiments of the present invention reduce the persistence of said problem without causing privacy concerns and reducing the memory footprint by the method according to claim 1, which represents a first aspect of the invention. To improve training accuracy the second and third function are provided as student and teacher respectively. In such setup, the step of providing memory stored representations of task samples to the third function occurs without providing generated representations of task samples to the third function.
It is noted that changing the first function from adaptable to fixed occurs after the first neural network learns a first task. The fixing of the first function would here relate to parameters of the first function such as weights. This reduction in plasticity after having been fitted to a task prevents forgetting, while still allowing a part of the first neural network to adjust to other, subsequent tasks that are being learned. In a sense, also separate from any other features, the method preferably comprises teaching the first neural network a plurality of tasks, wherein the first function becomes fixed after having been taught a first task of the plurality of tasks.
Beneficially, and for all embodiments, first and second neural networks are preferably designed as vision transformers so as to allow training based on large data volumes, such as a continuous stream of high-resolution images.
To ensure that consistency in intermediate representations is learned by the first neural network during the first task and fixed before the onset of future tasks, the first few layers, such as 1-2 layers, or 1-3 layers, of the first function may be used to process veridical inputs, wherein its output, the generated representations of task samples along with a ground truth label are stored to the memory.
Optionally, consolidating task knowledge is performed during intermittent periods of inactivity, such as exclusively during intermittent periods of inactivity. In one example, this is after learning a, or each, task. In an at least partially autonomous vehicle with a camera for generating the training images and a computer for implementing the method according to the invention these periods of inactivity could for example be the periods in which the vehicle is parked or otherwise goes unused for more than a predefined period of time, such as an hour. This prevents that the expected behavior of task performance changes during use, resulting in unexpected behavior for the user of such a vehicle.
It is further possible for the step of consolidating task knowledge across multiple tasks in the third function to comprise aggregating the weights of the second function by exponential moving average to form the weights of the third function. This allows the third function, and thereby the second neural network to effectively act as a teacher to the first network when updating the first neural network.
Optionally, the memory is populated during task training or at a task boundary. Task boundary here means after learning of a task or before the learning of a new task. This could be any task of a plurality of tasks if the method comprises teaching the first neural network to perform a plurality of tasks. In one example generated representations of task samples may be provided to and stored in the memory. Alternatively, the memory is populated prior to task training and gradually replaces the representations with newly generated representations. This allows the method to initially build on semantic memory collected prior to said task training.
Beneficially, stored representations from the memory are learned together with representations of task samples from the first function and provided to the second function. This allows replay of memory to counterbalance new learning experience. Additionally, and further beneficially, the representations stored in the episodic memory are synchronously processed by the second and third function.
To provide the first neural network with optimized learning goals and further prevent overfitting the method may comprise determining a loss function comprising a loss of representation rehearsal and a loss presenting an expected Minkowski distance between corresponding pairs of predictions by the second and third functions, wherein the loss function balances both losses using a balancing parameter, and wherein the first neural network is updated using said loss function.
According to a second aspect of the invention there is provided a data processing apparatus comprising means for carrying out the method according to the first aspect of the invention.
According to a third aspect of the invention there is provided a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method according to the first aspect of the invention.
According to a fourth aspect of the invention there is provided an at least partially autonomous driving system comprising at least one camera designed for providing a feed of input images, and a computer designed for classifying and/or detecting objects using the first neural network, wherein said first neural network has been trained using method according to the first aspect of the invention.
Objects, advantages and novel features, and further scope of applicability of the present invention will be set forth in part in the detailed description to follow, taken in conjunction with the accompanying drawings, and in part will become apparent to those skilled in the art upon examination of the following, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, serve to explain the principles of the invention. The drawings are only for the purpose of illustrating one or more embodiments of the invention and are not to be construed as limiting the invention. In the drawings:
The invention is further elucidated by
The first block can in this example be seen as a combination of a patch embedding and a self-attention block and the second block can be seen as brain inspired components proposed in the method. Input images for training are received by the first block comprising the first function Gw via patch embedding. The first function generates representations of task samples and feeds these to the second block. In the second block these generated representations are provided to the second function Fw, and optionally also to the memory Dm. Optionally, because the memory may be pre-populated with representations. In any case, generated and stored task representations are provided to the second function together. Optionally, stored task representations are also provided to the third function Fs. Task knowledge is consolidated in the third function Fs using the exponential moving average of the weights of the second function Fw. After the second function has learned a first task, the first function Gw is changed from adaptable to fixed, in that its weights and structure become unalterable for learning of subsequent tasks. The first and second neural networks are respectively referred to as the working model and the stable model. The stable model is also interchangeable with the term teacher or teacher model.
Embodiments of the present invention are thus shown to replay intermediate representations instead of veridical/raw inputs and utilize exponential moving average of the working model, herein also called the first neural network, as teacher, namely the second neural network, to distill the knowledge of previous tasks.
A methodology is here proposed wherein internal representations of Vision Transformers is replayed instead of raw pixels of an image. This effectively mitigates catastrophic forgetting in continual learning by maintaining an exponential moving average of the first neural network, hereinafter also called the working model, to distill learned knowledge from past tasks. Replaying internal representations here saves memory in applications with large input resolutions and would furthermore eliminate any privacy concerns.
In further detail it should be mentioned that the continual learning paradigm normally consists of T sequential tasks with data becoming progressively available over time. During each task t E {1, 2, . . . , T}, the task samples and corresponding labels (xi, yi) (i=1 to N) are drawn from the task-specific distribution Dt. The continual learning vision transformer model fθ is sequentially optimized on one task at a time and the inference is carried out on all tasks seen so far. For DNNs, continual learning is especially challenging as data becoming progressively available over time violates the i.i.d assumption leading to overfitting on the current task and catastrophic forgetting of previous tasks.
Strong empirical evidence suggests an important role for experience rehearsal in consolidating memory in the brain [19]. Likewise, in DNNs, ER stores and replays a subset of previous task samples alongside current task samples. By mimicking the association of past and present experiences in the brain, ER partially addresses the problem of catastrophic forgetting. The learning objective is as follows:
In equation 1 ŷi, ŷj, are CL model predictions, a represents a balancing parameter, Dm is the memory buffer and (CE) is the cross-entropy loss. To further augment ER in mitigating catastrophic forgetting, we employ two complementary learning systems based on abstract, high-level representation replay. The stable and the working model in our proposed method interact with the episodic memory and consolidate information about previous tasks better than vanilla-ER. In the following sections, we elaborate on the working of each component of our proposed method.
In a more detailed example of the invention two Transformer-based complementary systems are proposed—here the first and second neural networks—that acquire and assimilate knowledge over short and long periods of time. The first neural network is here also known as the working model, and is reminiscent of a hippocampus in a human brain, which encounters new tasks and consolidates knowledge over short periods of time. As the knowledge of the learned tasks is encoded in the weights of the DNNs, weights of the working model are adapted to achieve maximum performance on the current task. However, abrupt changes to the weights causes catastrophic forgetting of older tasks. To consolidate knowledge across tasks, working model gradually aggregates weights into the stable model, here the second neural network, during intermittent stages of inactivity, akin to knowledge consolidation in the neocortex of a human brain.
Knowledge consolidation in the stable model can be done in several ways: keeping a copy of the working model at the end of each task, weight aggregation through exponential moving average (EMA), or by leveraging self-supervision. Reducing complexity of computation weight aggregation through exponential moving average (EMA) is preferred.
The design of the stable model as an exponential moving average of the working model is as follows:
θs=γθs+(1−γ)θω Equation 2
where θw and θs are the weights of working and stable models respectively, and γ is a decay parameter. As the working model focuses on specializing on the current task, the copy of the working model at each training step can be considered specialized on a particular task. Therefore, aggregation of weights throughout CL training can be deemed as an ensemble of specialized models consolidating knowledge across tasks resulting in smoother decision boundaries.
In line with non-veridical rehearsal in the brain, the invention proposes an abstract, high-level representation rehearsal for transformers. The working model comprises of two nested functions: Gw and Fw. The first few layers, such 1-3 layers, of transformer Gw process veridical inputs and its output (r) along with the ground truth label are stored into an episodic memory Dm. To ensure consistency in intermediate representations, Gw is learned during the first task and fixed before the onset of future tasks. On the other hand, the later layers Fw processes abstract, high-level representations and remains learnable throughout CL training. During intermittent stages of inactivity, the stable counterpart Fs(.) is updated as per Equation 2.
The episodic memory can either be populated during the task training or at the task boundary. The representations stored in the episodic memory are taken together with the current task representations, and are synchronously processed by Fw(.) and Fs(.). The learning objective for representation rehearsal can thus be obtained by adapting Equation 1 as follows:
The method in the framework as shown in
The representations which were taken together are then processed by Fw(.) while only the representations of previous task samples are processed by Fs(.).
During intermittent stages of inactivity, the knowledge in the working model is consolidated into the stable model through Equation 2. Although the knowledge of the previous tasks is encoded in the weights of stable model, the weights collectively represent a function (Fs(.)) that maps representations to the outputs [43]. Therefore, to retrieve the structural knowledge encoded in the stable model, we propose to regularize the function learnt by the working model by enforcing consistency in predictions of the working model with respect to the stable model i.e.
Here cr represents the expected Minkowski distance between the corresponding pairs of predictions and p∈{1, 2, . . . , ∞}. Consistency regularization [48] enables the working model to retrieve structural semantics from the stable model which accounts for the knowledge pertaining to previous tasks. Consequently, the working model adapts the decision boundary for new tasks without catastrophically forgetting previous tasks.
The final learning objective for the working model is as follows:
where β is balancing parameter.
∀{l = 1, ..., T}, model θw = Fw(Gw(.)), buffer
m = { }
do
m ≠ ∅ then
m
Sample representations, and labels from buffer
= Fw(r′)
Feed representations to working and teacher models
= Fs(r′)
x (Eq. 4)
Distillation loss for buffered samples
= Fw(Gw(x))
Feed images to working and teacher models
= Fs(Gs(x)
repr (Eq. 3)
Cross-entropy loss
=
repr +
cr(Eq. 5)
Update working model parameters θw
EMA update for teacher model
m ← (r, y)
Store representations and labels in the buffer
Typical application areas of the invention include, but are not limited to:
Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other. Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the method of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the append-ed claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment.
Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.
Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, ALGOL, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2032721 | Aug 2022 | NL | national |