This application claims priority to and the benefit of Netherlands Patent Application No. 2032686, titled “A Computer-implemented Method and a System for a Biologically Plausible Framework for Continual Learning in Artificial Neural Network”, filed on Aug. 4, 2022, and the specification and claims thereof are incorporated herein by reference.
The invention relates to a computer-implemented method and a system for a Biologically Plausible Framework for continual learning in an artificial neural network.
Catastrophic forgetting is the tendency of an artificial neural network to completely and abruptly forget previously learned information upon learning new information. Continual Learning (also known as Incremental Learning, Life-long Learning) is a concept to learn a model for a large number of tasks sequentially without forgetting knowledge obtained from the preceding tasks, where the data in the old tasks are not available anymore during training new ones.
The human brain excels at continually learning from a dynamically changing environment whereas standard artificial neural networks (ANNs) are inherently designed for training from stationary data. The sequential learning of tasks in continual learning (CL) violates this strong assumption, resulting in catastrophic forgetting. While ANNs are inspired by biological neurons [14], they omit numerous details of the design principles, and learning mechanisms in the brain. These fundamental differences may account for the mismatch in performance and behavior.
The ability to continuously learn and adapt to an ever-changing environment is essential for any learning agent (deep neural network) deployed in the real world. For instance, an autonomous car needs to continually adapt to different road, weather and lighting conditions, learn new traffic signs and lane marking as we move from one place to another.
Biological neural networks are characterized by considerably more complex synapses and dynamic context-dependent processing of information where each individual neuron has a specific role. Each presynaptic neuron has an exclusively excitatory or inhibitory impact on its postsynaptic
partners as postulated by Dale's principle [37]. Furthermore, the distal dendritic segments in pyramidal neurons, which account for most excitatory cells in the neocortex, receive additional context information and enable context-dependent processing of information. This, in conjunction with inhibition, allows the network to learn task-specific patterns and avoid catastrophic forgetting [5, 23, 42]. Additionally, the replay of sparse non-overlapping neural activities of past experiences in the neocortex and hippocampus is considered to play a critical role in memory formation, consolidation, and retrieval [30, 41]. To protect information from erasure, the brain employs synaptic consolidation whereby the rates of plasticity are selectively decreased in proportion to strengthened synapses [10].
Standard ANNs, however, lack adherence to Dale's principle as neurons contain both positive and negative output weights, and the signs can change while learning. Furthermore, Standard ANNs are based on a point neuron model which is an oversimplified model of biological computations and lacks the sophisticated nonlinear and context-dependent behavior of pyramidal cells. While studies have attempted to address these shortcomings individually, there is a lack of a biologically plausible framework which incorporates all these biologically plausible components and enables studying the effect and interactions of different mechanisms inspired by the brain.
This application refers/cites to a number of published references. Discussion of such references are given for a more complete background and is not to be construed as an admission that such references are prior art for purposes of determining patentability.
It is an object of the current invention to correct the shortcomings of the prior art and to mitigate catastrophic forgetting in DNNs whereby the network forgets previously learned information when learning a new task which requires a delicate balance between the stability (ability to retain previous information) and the plasticity (flexibility to learn new information) of the model. This and other objects which will become apparent from the following disclosure, are provided with a computer-implemented method for general continual learning in artificial neural networks, a data processing system, and a computer-readable medium, having the features of one or more of the appended claims.
In biological neural networks, dendritic segments are tree-like extensions at the periphery of a neuron that help increase the surface area of the neuron body. These tiny protrusions receive information from other neurons and transmit electrical stimulation to the neuron body. They can integrate postsynaptic signals nonlinearly and filter out insignificant background information. Similarly, in an artificial neural network, dendritic segments of artificial neurons are elements to funnel weighted synaptic inputs to the artificial neurons. Accordingly, they have the potential to mimic the integrative properties of their biological counterparts.
In one embodiment of the present invention, a computer-implemented method for learning in an artificial neural network comprises the step of providing a network comprising a plurality of layers, wherein each layer comprises a population of exclusively excitatory neurons and a population of exclusively inhibitory neurons, wherein the population of exclusively excitatory neurons is larger than the population of exclusively inhibitory neurons and wherein all synaptic weights of said network are exclusively positive, i.e. the signs of the output weights of said neurons do not change while learning. In this method of the invention which is applied for general continual learning in an artificial neural network, the method comprises the steps of:
Furthermore, the method comprises the step of providing excitatory connections between the layers, excitatory projection to the inhibitory neurons and inhibitory projections within the layers, as synaptic weights of the network.
These features improve avoiding catastrophic forgetting and provide a biologically plausible framework where, like biological networks, the feedforward neurons adhere to Dale's principle and the excitatory neurons mimic the integrative properties of active dendrites for context-dependent processing of stimulus.
To enable context dependent processing of information, one instantiation of the context signal to the dendrites needs to be evaluated, therefore, the method comprises the step:
Alternatively, the method comprises the steps:
The learnable context network can be a Multi-Layer Perceptron (MLP) or a convolutional neural network (ConvNet) and it has the advantage of being able to provide different signals as context to the dendritic segments depending on the task to be solved.
Furthermore, the method comprises the step of selecting, during inference, the closest prototype vector to each test sample as the context vector using Euclidean distance among all task prototypes stored in memory.
To provide an efficient mechanism for controlling the sparsity in activations, the method comprises the step of using a k-Winners-Take-All function for selecting the dendritic segment with the highest response to the context vector.
Additionally, the method comprises the step of maintaining a constant sparsity in connections by randomly setting a percentage of weights to zero at initialization, wherein said percentage of weights is between 0 and 100%.
The context-dependent processing of information in conjunction with sparse activation patterns can effectively reduce the overlap of representations which leads to less interference between the tasks and thereby less forgetting. Therefore, the method comprises the steps of:
These features encourage the model to learn the new task by utilizing neurons that have been less active for previous tasks
For a biologically plausible ANN, it is important to not only incorporate the design elements of biological neurons, but also the learning mechanisms it employs. Lifetime plasticity in the brain generally follows the Hebbian principle: a neuron that consistently contributes to making another neuron fire will build a stronger connection to that neuron. Therefore, the method of the current invention comprises the step of strengthening connections between a context input and a dendritic segment corresponding to said context input, by applying a Hebbian update on said dendritic segments for each supervised parameter update with backpropagation.
Advantageously, the method comprises the step of using Oja's rule for adding weight decay to the Hebbian update.
Additionally, the method comprises the step of employing synaptic consolidation comprising the steps of:
In addition to their integrative properties, dendrites also play a key role in retaining information and providing protection from erasure. The new spines that are formed on different sets of dendritic branches in response to learning different tasks are protected from being eliminated through mediation in synaptic plasticity and structural changes which persist when learning a new task. Hence, the method comprises the step of adjusting an importance estimate of each synapse to account for disparities, caused by the population of inhibitory neurons, in the degree to which updates to different parameters affect an output of a layer.
Additionally, the method comprises the steps of:
The replay mechanism in hippocampus has inspired a series of rehearsal-based approaches which have proven to be effective in challenging continual learning scenarios. Therefore, to replay samples from the previous tasks, the method comprises the step of maintained an episodic memory buffer by using Reservoir Sampling.
Suitably, the method comprises the step of matching a distribution of an incoming stream by assigning to each new sample equal probabilities for being represented in the episodic memory buffer.
More suitably, the method comprises the steps of:
In a second embodiment of the invention, the computer-readable medium is provided with a computer program wherein when said computer program is loaded and executed by a computer, said computer program causes the computer to carry out the steps of the computer-implemented method according to any one of aforementioned steps.
In a third embodiment of the invention, the data processing system comprise a computer loaded with a computer program wherein said program is arranged for causing the computer to carry out the steps of the computer-implemented method according to any one of aforementioned steps.
Objects, advantages and novel features, and further scope of applicability of the present invention will be set forth in part in the detailed description to follow, taken in conjunction with the accompanying drawings, and in part will become apparent to those skilled in the art upon examination of the following, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, serve to explain the principles of the invention. The drawings are only for the purpose of illustrating one or more embodiments of the invention and are not to be construed as limiting the invention. In the drawings:
Whenever in the figures the same reference numerals are applied, these numerals refer to the same parts.
Biological neural networks differ from their artificial counterparts in the complexity of the synapsesand the role of individual units. Notably, most neurons in the brain adhere to Dale's principle which posits that presynaptic neurons can only have an exclusively excitatory or exclusively inhibitoryimpact on their postsynaptic partners [37]. Several studies show that the balanced dynamics [32, 39]of excitatory and inhibitory populations provide functional advantages, including efficient predictivecoding [8] and pattern learning [22]. Furthermore, inhibition is hypothesized to play a role in alleviating catastrophic forgetting [5]. Standard ANNs, however, lack adherence to Dale's principle as neurons contain both positive and negative output weights, and the signs can change while learning.
Cornford et al. incorporate Dale's principle into ANNs (referred to as DANNs) which take intoaccount the distinct connectivity patterns of the excitatory and inhibitory neurons [] and performcomparable to standard ANNs in benchmark object recognition task. Each layer l comprises a separate population of excitatory, hel∈R+ne, and inhibitory hlJ∈R+ni neurons, where ne>>ni and synaptic weights are strictly non-negative. Similar to biological networks, while both populations receive excitatory projections from the previous layer (hl−1), only the excitatory neurons project between layers, whereas the inhibitory neurons inhibit the activity of the excitatory units of the samelayer. Cornford et al. characterized these properties by three sets of strictly positive weights:
The output of the excitatory units is impacted by the subtractive inhibition from the inhibitory units:
z
l=(Weel−WeiiWiel)hel−1+bl (1)
where bl∈Rn
The method of the current invention employs DANNs as the feedforward neurons that performs comparable to standard ANNs in the challenging CL setting and provides a biologically plausible framework for further studying the role of inhibition in alleviating catastrophic forgetting.
The brain employs specific structures and mechanisms for context-dependent processing and routing of information. The prefrontal cortex, playing an important role in cognitive control [31], receives sensory inputs as well as contextual information, which enables it to choose sensory features most relevant to the present task to guide actions [15, 29, 36, 44]. Of particular interest are the pyramidalcells which represent the most populous members of the excitatory family of neurons in the brain [7].The dendritic spines in pyramid cells exhibit highly non-linear integrative properties which are considered important for learning
task-specific patterns [42]. Pyramidal cells integrate a range of diverse inputs on multiple independent dendritic segments whereby contextual inputs on active dendrites can modulate a neuron's response, making it more likely to fire. Standard ANNs are however based on a point neuron model which is an oversimplified model of biological computations and lacks the sophisticated nonlinear and context-dependent behavior of pyramidal cells. Iyer et al. model these integrative properties of dendrites by augmenting each neuron with a set of dendritic segments. Multiple dendritic segments receive additional contextual information which is processed using separate set of weights. The resultant dendritic output modulates the feedforward activation which is computed by a linear weighted sum of the feedforward inputs. Thiscomputation results in a neuron where the magnitude of the response to a given stimulus is highly context dependent. To enable task-specific processing of information, the prototype vector cτ for task τ is evaluated by taking the element-wise mean of tasks samples, Dτ at the beginning of the task and then subsequently providing said prototype vector as context during training,
During inference, the closest prototype vector to each test sample, x′, is selected as the context using Euclidean distance among all the task prototypes, C, stored in memory.
The method of the current invention comprises the step of augmenting the excitatory units in each layer with dendritic segments (
where σ is the sigmoid function and k-VVTA(.) is the k-Winner-Take-All activation function [2] which propagates only the top k neurons and sets the rest to zero. This provides a biologicallyplausible framework where, like biological networks, the feedforward neurons adhere to Dale's principle and the excitatory neurons mimic the integrative properties of active dendrites for context dependent processing of stimulus.
Neocortical circuits are characterized by high levels of sparsity in neural connectivity and activations [6, 16]. This is in stark contrast to the dense and highly entangled connectivity in the standard ANNs. Particularly for continual learning, sparsity provides several advantages: sparse non-overlapping representations can reduce interference between tasks [1, 3, 23], can lead to the natural emergence of task-specific modules [17], and sparse connectivity can further ensure fewer task-specific parameters [28].
The method according to the invention provides an efficient mechanism for setting different levels of activation sparsity by varying the ratio of active neurons in k-winners-take-all (k-VVTA) activations [2] and constant sparsity in connections by setting a percentage of weights at random to 0 at initialization. Sparsity in activations effectively reduces interference by reducing the overlap in representations. Furthermore, it allows having different levels of sparsity in different layers which can further improve performance. As the earlier layers learn general features, having a higher ratio of active neurons can enable higher reusability and forward transfer. For the later layers, a smaller ratio of active neurons can reduce the interference between task-specific features.
The context-dependent processing of information in conjunction with sparse activation patterns can effectively reduce the overlap of representations which leads to less interference between the tasks and thereby less forgetting. To further encourage the model to learn non-overlapping representations, the method of the current invention employs Heterogeneous dropout [1]. During training, the frequency of activations for each neuron in a layer for a given task is tracked, and in the subsequent tasks, the probability of a neuron being dropped is set to be inversely proportional to its activation counts. This encourages the model to learn the newtask by utilizing neurons that have been less active for previous tasks.
where ρ controls the strength of enforcement of non-overlapping representations with larger values leading to less overlap. This provides us with an efficient mechanism for controlling the degree of overlap between the representations of different tasks and hence the degree of forward transfer and interference based on the task similarities. It also allows having different dropout ρ for each layer (with lower ρ for earlier layers to encourage reusability and higher ρ for later layers to reduce interference between task-representations). Heterogeneous dropout provides a simple mechanism for balancing the reusability and interference of features depending on the similarity of tasks.
For a biologically plausible ANN, it is important to not only incorporate the design elements of biological neurons, but also the learning mechanisms it employs. Lifetime plasticity in the brain generally follows the Hebbian principle: a neuron that consistently contributes to making another neuron fire will build a stronger connection to that neuron [21].
Therefore, the method of the current invention proposes to complement error-based learning with Hebbian update to strengthen the connections between the contextual information and dendritic segments (
where ηh is learning rate, κ is the index ofthe winning dendrite with weight uκ and modulating signald=uTc for context signal c.
In addition to their integrative properties, dendrites also play a key role in retaining information and providing protection from erasure [10, 43]. The new spines that are formed on different sets ofdendritic branches in response to learning different tasks are protected from being eliminated through mediation in synaptic plasticity and structural changes which persist when learning a new task [43].
The method of the invention employs synaptic consolidation by incorporating Synaptic Intelligence which maintains an importance estimate of each synapse in an online manner during training and subsequently reduces the plasticity of synapses which are considered important for learned tasks. Notably, the method of the invention comprises the step of adjusting the importance estimate to account for the disparity in the degree to which updates to different parameters affect the layer's output
because of the inhibitory interneuron architecture in DANN layers [11]. The importance estimate of the excitatory connections to the inhibitory units and the intra-layer inhibitory connections are upscaled to further penalize changes to these weights.
Replay of past neural activation patterns in the brain is considered to play a critical role in memory formation, consolidation, and retrieval [30, 41]. The replay mechanism in hippocampus has inspired a series of rehearsal-based approaches [4, 9, 26, 27] which have proven to be effective in challenging continual learning scenarios [12, 17]. Therefore, to replay samples from the previous tasks, the computer-implemented method according to the current invention comprises the step of utilizing a small episodic memory buffer which is maintained through Reservoir sampling [40]. The method further comprises the step of approximately matching the distribution of the incoming stream by assigning equal probabilities to each new sample for being represented in the buffer. While training, samples from the current task, (xb, yb)˜Dτ, are interleaved with the memory buffer samples, (xm, ym)˜M to approximate the joint distribution of tasks seen so far. Furthermore, to mimic the replay of activationpatterns that accompanied the learning event in brain, the output logits, zm, are saved across the training trajectory and a consistency loss is enforced when replaying the samples from the episodic memory. Concretely, the loss is given by:
=
cls(f(xb;θ), yb)+α
cls(f(xm; θ), ym)+β(f(xm; θ)−zm)2 (7)
where f(⋅; θ) is the model parameterized by θ, Lcis is the standard cross-entropy loss, and α and β control the strength of interleaved training and consistency constraint respectively.
In
In
A computer-implemented method according to an embodiment of the present invention preferably comprises the step of incorporating the aforementioned aspects into a biologically plausible framework for CL, referred to as Bio-ANN. Table 1 shows that the different components complement each other and consistently improve the performance of the model. The empirical results suggest that employing multiple complementary components and learning mechanisms, like the brain, can be an effective approach to enable continual learning in ANNs.
Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other.
Typical application areas of the invention include, but are not limited to:
Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the method of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the append-ed claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment.
Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.
Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, ALGOL, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field
Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.
Trends in neurosciences, 35(6):345-355, 2012.
Number | Date | Country | Kind |
---|---|---|---|
2032686 | Aug 2022 | NL | national |