This application claims priority to and the benefit of Netherland Patent Application No. 2032027, titled “METHOD AND SYSTEM FOR DYNAMIC COMPOSITIONAL GENERAL CONTINUAL LEARNING”, filed on May 31, 2022, and the specification and claims thereof are incorporated herein by reference.
The invention relates to a computer-implemented method and a system for dynamic compositional general continual learning of deep neural networks.
In recent years, deep neural networks (DNNs) have achieved human-level performance in several applications [1][2]. These networks are trained on multiple tasks within an application with the data being received under an independent and identically distributed (i.i.d) assumption. This assumption is satisfied by shuffling the data from all tasks, and balancing and normalizing the samples from each task in the application [3]. Consequently, DNNs can achieve human-level performance on all tasks in these applications by modelling the joint distribution of the data as a stationary process. Humans, on the other hand, can model the world from inherently non-stationary and sequential observations [4]. Learning continually from the more realistic sequential and non-stationary data is crucial for many applications such as lifelong learning robots [5] and self-driving cars [6]. However, vanilla gradient-based training for such continual learning setups with a continuous stream of tasks and data leads to task interference in the DNN's parameters, and consequently, catastrophic forgetting on old tasks [7]. Therefore, there is a need for methods to alleviate task-interference and catastrophic forgetting in continual learning.
Lately, some works have aimed to address these challenges in continual learning. These can be broadly classified into the following categories:
While these methods mitigate catastrophic forgetting to some extent, regularization-based and parameter isolation-based methods often fail on one or more of general continual learning (GCL) desiderata [17, 18] such as use of task-boundaries, requirement of task-identity of example at test time, and unconstrained growth or capacity depletion [17] of networks over a long sequence of tasks. Recent rehearsal-based methods [19, 20], however, adhere to the GCL desiderata, and outperform previous state-of-the-art methods.
Though rehearsal-based methods improve over other categories, they still suffer from catastrophic forgetting through task interference in the DNN parameters, as all parameters respond to all examples and tasks. This could be resolved by inculcating parameter-isolation in the rehearsal-based methods. However, it is worth noting that unlike parameter-isolation methods, compositionality and sparsity in the brain is not “static”. There is evidence that the brain responds to stimuli in a dynamic and compositional manner, with different “modules” or subsets of neurons responding “dynamically” to different stimuli, often reusing many previously learnt components [21].
The advantages of dynamic and compositional response to stimuli have also been explored in deep learning in stationary settings through mechanisms such as gating of modules, early-exit, and dynamic routing, along with training losses that incentivize sparsity and consequent compositionality of neural activations. These works observed that DNNs trained to predict dynamically also learn to respond differently to different inputs. Furthermore, the learned DNNs demonstrate clustering of parameters in terms of “tasks” such as similarity, difficulty, and scale of inputs [22, 23, 24], indicating dynamic modularity and compositionality.
Network pruning [35], a popular method for compressing DNNs, can be seen as an indirect attempt at mimicking modularity and sparsity in the human brain, by extracting a sub-network of the DNN that is primarily responsible for the task at hand. Pruning is generally achieved through removing unimportant connections such as weights with low magnitudes—called unstructured pruning, or removing unimportant structures such as unimportant channels, filters, or layers—called structured pruning [35]. These approaches have achieved success in both i.i.d. as well as continual learning set-ups [37, 38]. However, the nature of modularity achieved through pruning is static, where all neurons react to every stimulus. Therefore, continual learning approaches which introduce “dynamic” sparsity still led to a static non-modular network where the network doesn't drop, reuse, and recompose different modules, instead using all parameters to respond to every input. Some methods [8, 36] try to control the amount of learning in important parameters, but this too results finally in a static non-modular network and falls directly or indirectly under the regularization-based approaches.
Recently, few works have introduced modularity and compositionality to continual learning setups. SG-F [25, 26], MNTDP [27], LMC [26], and MoE[28] follow approaches that expand the network in response to new tasks or outlier examples. To this end, they propose methods to initialize and project new modules in the existing feature spaces and accumulate and freeze old and consolidated information. However, these methods either use task-identities at test time or fail to perform convincingly on multiple complex datasets without the use of task-identities at test-time, and therefore cannot be considered as general continual learning algorithms. Furthermore, they require theoretically unconstrained network growth for continual learning over long sequences, which is proven to be unnecessary by the experimental results detailed later in this document, as even standard networks like ResNets [29] can learn complex datasets compositionally and modularly in the i.i.d. settings [22, 23, 24]. Abati et al. [30] starts with a standard ResNet but tries to remove convolutional filters dynamically by growing task-specific units at each convolutional layer (for each task), resulting in large growth of network size over long sequences. Additionally, they employ task-boundaries to freeze a few units based on the validation set, and thus require a much larger memory buffer for previous samples. Finally, Chen et al. [31] employs a constant-capacity network for online continual learning but starts continual training with a network that employs multiple residual blocks at every single layer, which is equivalent to training multiple ResNets, which as we argued earlier, is not necessary.
Note that this application refers to a number of publications. Discussion of such publications is given for more complete background and is not to be construed as an admission that such publications are prior art for patentability determination purposes.
Embodiments of the present invention correct the short-comings of the prior art and provide a solution for dynamically compositional continual learning in deep neural networks. This will become apparent from the following disclosure directed to a computer-implemented method for general continual learning in deep neural networks, a data processing system, and a computer-readable medium, having the features of one or more of the appended claims.
In a first aspect of the invention, the computer-implemented method for general continual learning in deep neural networks maintains a standard network through training and adheres to general continual learning desiderata i.e. it can work without knowledge of task-identities at test-time, it does not employ task-boundaries and it has bounded memory even when training on longer sequences. Said computer-implemented method for general continual learning in deep neural networks, comprises the steps of:
For mitigating catastrophic forgetting, the method comprises the step of maintaining a constant-size memory buffer by updating said memory buffer using reservoir sampling. In particular, the step of updating said memory buffer is applied exclusively when the network predictions are correct.
In order to ensure that all activation information at the location is removed when the neural network structure comprises a batch normalization layer, the method comprises the step of applying actions after said batch normalization layer.
Advantageously, the method comprises the step of providing a neural network wherein the structure of said neural network comprises a ResNet architecture. In particular, the method comprises the step of providing a neural network wherein the structure of said neural network comprises four blocks wherein the agents of the self-attention network are linked to the convolutional layers of the last three of said blocks, wherein each block comprises two residual blocks and wherein each residual block comprises two convolutional layers. Additionally, the method comprises the step of removing channels from the outputs of the convolutional layers.
In order to reach the self-attention, the method comprises the steps of:
The size of the hidden layer is preferably between channels/8 and channels/64, more preferably, the size of the hidden layer is channels/16.
Preferably, the method comprises the step of using a Sigmoid with a temperature. The temperature serves the purpose of tuning the range of outputs of the self-attention layers, ensuring that the probabilities being sampled from to pick the action aren't too small and that enough activations are picked to enable learning.
The method comprises the step of calculating at least one task loss (LT) wherein a cross-entropy loss is minimized and applied on current data and on data stored in the memory buffer. Task losses seek to enhance performance on the task or application, which is the primary objective of training the network. Image classification is a preferred application of the current invention; therefore, the current invention is preferably embodied such as a cross-entropy loss is minimized. These losses are applied on both current as well as memory samples.
L=L
T(inputs)=CE(X,Y)
where, CE refers to a standard cross-entropy function, and X and Y refer to the input image and corresponding label respectively.
Furthermore, aiming at providing a good sparsity-accuracy trade-off, the method comprises the step of calculating, for each agent, at least one agent loss comprising a reward function and a corresponding policy gradient loss, wherein the reward function comprises the steps of:
The method comprises the step of calculating at least two consistency losses applied on final representations of the network and on sub-networks of the agent, wherein for each consistency loss a mean squared error loss is minimized for enforcing consistency. Consistency losses seek to impose consistency between replayed and memory/saved representations, thereby mitigating forgetting of soft knowledge.
The method comprises the step of calculating at least one prototype loss wherein a ratio of pairwise mean squared errors between representations of same classes to pairwise mean squared errors between representations from different classes is minimized, and wherein said prototype loss is applied on current data and on data stored in the memory buffer. Prototype losses incentivize the learning of input-adaptive class prototypes by pulling final representations from agent subnetworks together when they are from the same class and pushing them away from each other when they are from different classes.
The method comprises the step of calculating at least one exploration loss wherein, for each agent, an entropy of action probabilities is maximized and wherein said exploration loss is applied on current data. Exploration losses seek to “explore” the solution space, and therein avoid activating the same units repeatedly.
The method comprises the step of calculating a total loss function for achieving continual learning by providing a weighted sum of the at least one task loss, the at least one agent loss, the at least one consistency loss, the at least one prototype loss, and the at least one exploration loss.
The method comprises the step of multiplying the at least one exploration loss with a weight smaller than a weight of the at least one task loss, a weight of the at least one agent loss, a weight of the at least two consistency losses, and a weight of the at least one prototype loss. The exploration losses always use a small weight as too much exploration can hinder learning itself.
In order to give the agents a better search space when they start searching for a solution, the method comprises the step of establishing a warmup stage of training for a plurality of initial epochs of a first task, wherein the at least one task loss is exclusively applied on current data and wherein remaining losses are excluded.
In a second embodiment of the invention, the computer-readable medium is provided with a computer program wherein when said computer program is loaded and executed by a computer, said computer program causes the computer to carry out the steps of the computer-implemented method according to any one of aforementioned steps.
In a third embodiment of the invention, the data processing system comprise a computer loaded with a computer program wherein said program is arranged for causing the computer to carry out the steps of the computer-implemented method according to any one of aforementioned steps.
Objects, advantages and novel features, and further scope of applicability of the present invention will be set forth in part in the detailed description to follow, taken in conjunction with the accompanying drawings, and in part will become apparent to those skilled in the art upon examination of the following, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, serve to explain the principles of the invention. The drawings are only for the purpose of illustrating one or more embodiments of the invention and are not to be construed as limiting the invention. In the drawings:
Whenever in the figures the same reference numerals are applied, these numerals refer to the same parts.
The proposed computer-implemented method for general continual learning combines rehearsal-based methods with dynamic modularity and compositionality. Concretely, the method aims at achieving three objectives:
To achieve dynamic and compositional response to inputs, multiple agent subnetworks are defined in the DNN, each responsible for zeroing out activations of a layer, based on the input to that layer. The agents are rewarded for choosing actions that lower parameter utilization (sparse and compositional responses) if the network predictions are accurate but are penalized heavily for choosing actions that lead to inaccurate predictions. Furthermore, the representations that the agents sample actions from are incentivized to be pushed together for inputs from same classes, and pulled away for inputs from different classes, resulting in the learning of dynamic class prototypes in the agent subnetworks. To reduce forgetting and achieve competent task performance, we maintain a constant-size memory-buffer in which we store previously seen examples. The network is trained on current examples alongside previous examples to both maintain performance on current and previous tasks, as well as to make multi-scale associations of current and previous soft-knowledge.
The approach according to the current invention is divided into two components:
Agents are built into the network structure such that the network tries to perform competently on the application while showcasing sparse responses for any given example. For any layer, the agent is a self-attention network which processes the input to that layer to emit as many outputs as activations in the layer. These outputs are converted into probabilities and sampled from as a Bernoulli distribution to decide corresponding binary actions, where 0 means dropping the corresponding activation and 1 means using the corresponding activation. Therefore, across the network, there are multiple agents, each of which tries to induce sparsity and modularity locally by zeroing out activations, while they all co-operate to achieve competent application-performance globally. In practice, the actions are applied after the batch normalization layer, if any, to ensure all activation information at the location is removed.
Embodiments of the present invention are preferably embodied such as the network is a ResNet-18 with agents corresponding to the convolutional layers of the last 3 of the 4 blocks, with each block containing 2 residual blocks with 2 convolutional layers each, resulting in 12 agents in total. While the agent can be used for any/all convolutional layers, no agent is used in the first block as it has been noted that earlier layers undergo minimal forgetting [34], are highly transferrable [33], and get used for most examples even when learned with dynamic modularity [30]. These agents then remove activations (channels) from the convolutional layer's outputs as discussed earlier. However, as the residual connections in ResNet retain the activation information removed by the agents, the agent is applied for the second (i.e. final) convolutional layer in the block to the residual activations as well. The agent subnetworks (i.e. self-attention networks) use pointwise convolution, batch normalization, and global average pooling to get a channel-length representation, which is sent through a multilayer perceptron (MLP) with one hidden layer of size channels/16 and Sigmoid activation, and then multiplied with the original channel-length representation to get the self-attention. The action probabilities are computed from the result of this self-attention operation using Sigmoid with a temperature. The temperature serves the purpose of tuning the range of outputs of the self-attention layers, ensuring that the probabilities being sampled from to pick the action aren't too small and that enough activations are picked to enable learning. The general structure of the agent at any given layer is shown in
A variety of losses are used to achieve the objectives of sparsity and compositionality, competent application performance, and mitigating performance. A constant-sized memory buffer is maintained and is updated using reservoir sampling [32], which helps approximate global data distribution in the buffer, without the use of task-boundaries. To be noted that in this embodiment, the memory buffer is only updated if the predictions were made correctly. At any given time during training, data is sampled once from the current data stream as well as from the memory buffer.
In detail, the following losses may be used separately or in combination:
L=L
T(inputs)=CE(X,Y)
where, CE refers to a standard cross-entropy function, and X and Y refer to the input image and corresponding label respectively.
where al,i refers to the use/drop binary action taken at the ith activation of the Ith layer, and λ>0 is a penalty imposed on incorrect predictions, kr is a keep ratio for the ratio of activations we wish to retain at each layer.
This results in the following policy-gradient loss to be minimized:
where, pi,j is the ith probability released by the Ith probability layer (see
L
C(Y′S,Y′R)=MSE(Y′S,Y′R)
where, the subscripts S and R refer to saved (i.e. from memory buffer) and replayed (i.e. passed through the network again at current state) predictions, respectively.
where, Y′ is a set of predictions, and a and b are subscripts referring to samples from this set.
where, L is the total numbers of layers on which agents act on i.e. total number of agents.
These losses are preferably used with a weighted sum to get the total loss for achieving continual learning. Note that the exploration losses always use a small weight as too much exploration can hinder learning itself.
L
total
=L
T(XB,YB)+weLE(XB)+γLr(XB)+β[LT(XM,YM)+γLr(XM)]+αLC(Y′S,Y′R)+αpLC(Y′A,S,Y′A,B)+wp[LP(Y′B)+LP(Y′R)]
where, the subscripts M, and B refer to samples from memory and buffer, respectively. Subscripts S, A, and R refer to saved (i.e. from memory), agent (i.e. from the agent subnetworks), and replayed predictions (i.e. when memory sample is sent through network again). X is a batch of images, Y is a batch of labels, Y′ is a batch of predictions. Agent predictions refers to the channel-attention vector on which probability layer is applied.
Additionally, the computer-implemented method according to the invention employs a warmup stage of training for the first few epochs of the first task, where only task losses on current samples are used for learning, to give the agents a better search space when they start searching for a solution (after warmup stage).
Induced sparsity at a convolutional layer can be seen in
Results on Sequential-CIFAR10 can be seen in Table 1, where the computer-implemented method according to the invention outperforms several state-of-the-art methods and performs close to DER++ while only using part of the capacity for each input, as evidenced by
Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other.
Typical application areas of the invention include, but are not limited to:
Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other. Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the method of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the appended claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment.
Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.
Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, ALGOL, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.
Number | Date | Country | Kind |
---|---|---|---|
2032027 | May 2022 | NL | national |