Computer-Implemented Method and a System for a Biologically Plausible Framework for Continual Learning in Artificial Neural Network

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to and the benefit of Netherlands Patent Application No. 2032686, titled “A Computer-implemented Method and a System for a Biologically Plausible Framework for Continual Learning in Artificial Neural Network”, filed on Aug. 4, 2022, and the specification and claims thereof are incorporated herein by reference.

BACKGROUND OF THE INVENTION
Field of the Invention

The invention relates to a computer-implemented method and a system for a Biologically Plausible Framework for continual learning in an artificial neural network.

Background Art

Catastrophic forgetting is the tendency of an artificial neural network to completely and abruptly forget previously learned information upon learning new information. Continual Learning (also known as Incremental Learning, Life-long Learning) is a concept to learn a model for a large number of tasks sequentially without forgetting knowledge obtained from the preceding tasks, where the data in the old tasks are not available anymore during training new ones.

The human brain excels at continually learning from a dynamically changing environment whereas standard artificial neural networks (ANNs) are inherently designed for training from stationary data. The sequential learning of tasks in continual learning (CL) violates this strong assumption, resulting in catastrophic forgetting. While ANNs are inspired by biological neurons [14], they omit numerous details of the design principles, and learning mechanisms in the brain. These fundamental differences may account for the mismatch in performance and behavior.

The ability to continuously learn and adapt to an ever-changing environment is essential for any learning agent (deep neural network) deployed in the real world. For instance, an autonomous car needs to continually adapt to different road, weather and lighting conditions, learn new traffic signs and lane marking as we move from one place to another.

Biological neural networks are characterized by considerably more complex synapses and dynamic context-dependent processing of information where each individual neuron has a specific role. Each presynaptic neuron has an exclusively excitatory or inhibitory impact on its postsynaptic

partners as postulated by Dale's principle [37]. Furthermore, the distal dendritic segments in pyramidal neurons, which account for most excitatory cells in the neocortex, receive additional context information and enable context-dependent processing of information. This, in conjunction with inhibition, allows the network to learn task-specific patterns and avoid catastrophic forgetting [5, 23, 42]. Additionally, the replay of sparse non-overlapping neural activities of past experiences in the neocortex and hippocampus is considered to play a critical role in memory formation, consolidation, and retrieval [30, 41]. To protect information from erasure, the brain employs synaptic consolidation whereby the rates of plasticity are selectively decreased in proportion to strengthened synapses [10].

Standard ANNs, however, lack adherence to Dale's principle as neurons contain both positive and negative output weights, and the signs can change while learning. Furthermore, Standard ANNs are based on a point neuron model which is an oversimplified model of biological computations and lacks the sophisticated nonlinear and context-dependent behavior of pyramidal cells. While studies have attempted to address these shortcomings individually, there is a lack of a biologically plausible framework which incorporates all these biologically plausible components and enables studying the effect and interactions of different mechanisms inspired by the brain.

This application refers/cites to a number of published references. Discussion of such references are given for a more complete background and is not to be construed as an admission that such references are prior art for purposes of determining patentability.

BRIEF SUMMARY OF THE INVENTION

It is an object of the current invention to correct the shortcomings of the prior art and to mitigate catastrophic forgetting in DNNs whereby the network forgets previously learned information when learning a new task which requires a delicate balance between the stability (ability to retain previous information) and the plasticity (flexibility to learn new information) of the model. This and other objects which will become apparent from the following disclosure, are provided with a computer-implemented method for general continual learning in artificial neural networks, a data processing system, and a computer-readable medium, having the features of one or more of the appended claims.

In biological neural networks, dendritic segments are tree-like extensions at the periphery of a neuron that help increase the surface area of the neuron body. These tiny protrusions receive information from other neurons and transmit electrical stimulation to the neuron body. They can integrate postsynaptic signals nonlinearly and filter out insignificant background information. Similarly, in an artificial neural network, dendritic segments of artificial neurons are elements to funnel weighted synaptic inputs to the artificial neurons. Accordingly, they have the potential to mimic the integrative properties of their biological counterparts.

In one embodiment of the present invention, a computer-implemented method for learning in an artificial neural network comprises the step of providing a network comprising a plurality of layers, wherein each layer comprises a population of exclusively excitatory neurons and a population of exclusively inhibitory neurons, wherein the population of exclusively excitatory neurons is larger than the population of exclusively inhibitory neurons and wherein all synaptic weights of said network are exclusively positive, i.e. the signs of the output weights of said neurons do not change while learning. In this method of the invention which is applied for general continual learning in an artificial neural network, the method comprises the steps of:

- calculating a feedforward activity in each layer by calculating a linear weighted sum of feedforward inputs wherein outputs of the excitatory neurons are impacted by a subtractive inhibition from the inhibitory neurons;
- providing dendritic segments in the excitatory neurons;
- augmenting said excitatory neurons with said dendritic segments;
- feeding a context vector into said dendritic segments;
- selecting the dendritic segment with the highest response to the context vector;
- modulating the feedforward activity of the excitatory neurons by the selected dendritic segment.

Furthermore, the method comprises the step of providing excitatory connections between the layers, excitatory projection to the inhibitory neurons and inhibitory projections within the layers, as synaptic weights of the network.

These features improve avoiding catastrophic forgetting and provide a biologically plausible framework where, like biological networks, the feedforward neurons adhere to Dale's principle and the excitatory neurons mimic the integrative properties of active dendrites for context-dependent processing of stimulus.

To enable context dependent processing of information, one instantiation of the context signal to the dendrites needs to be evaluated, therefore, the method comprises the step:

- evaluating a prototype vector for a current task by calculating an element-wise mean of tasks samples at the beginning of the current task; and
- providing said prototype vector as context vector during training.

Alternatively, the method comprises the steps:

- feeding an input image into a learnable context network for providing a prototype vector; and
- providing said prototype vector as context vector during training.

The learnable context network can be a Multi-Layer Perceptron (MLP) or a convolutional neural network (ConvNet) and it has the advantage of being able to provide different signals as context to the dendritic segments depending on the task to be solved.

Furthermore, the method comprises the step of selecting, during inference, the closest prototype vector to each test sample as the context vector using Euclidean distance among all task prototypes stored in memory.

To provide an efficient mechanism for controlling the sparsity in activations, the method comprises the step of using a k-Winners-Take-All function for selecting the dendritic segment with the highest response to the context vector.

Additionally, the method comprises the step of maintaining a constant sparsity in connections by randomly setting a percentage of weights to zero at initialization, wherein said percentage of weights is between 0 and 100%.

- for a current task, tracking activation counts of neurons in each layer; and
- for subsequent tasks, setting a probability of dropping said neurons inversely proportional to the activation counts of said neurons.

These features encourage the model to learn the new task by utilizing neurons that have been less active for previous tasks

For a biologically plausible ANN, it is important to not only incorporate the design elements of biological neurons, but also the learning mechanisms it employs. Lifetime plasticity in the brain generally follows the Hebbian principle: a neuron that consistently contributes to making another neuron fire will build a stronger connection to that neuron. Therefore, the method of the current invention comprises the step of strengthening connections between a context input and a dendritic segment corresponding to said context input, by applying a Hebbian update on said dendritic segments for each supervised parameter update with backpropagation.

Advantageously, the method comprises the step of using Oja's rule for adding weight decay to the Hebbian update.

Additionally, the method comprises the step of employing synaptic consolidation comprising the steps of:

- determining an importance estimate of each synapse in an online manner during training;
- identifying synapses that are important for learned tasks; and
- reducing plasticity of the identified synapses.

In addition to their integrative properties, dendrites also play a key role in retaining information and providing protection from erasure. The new spines that are formed on different sets of dendritic branches in response to learning different tasks are protected from being eliminated through mediation in synaptic plasticity and structural changes which persist when learning a new task. Hence, the method comprises the step of adjusting an importance estimate of each synapse to account for disparities, caused by the population of inhibitory neurons, in the degree to which updates to different parameters affect an output of a layer.

Additionally, the method comprises the steps of:

- upscaling the importance estimate of the excitatory connections to the inhibitory neurons; and
- upscaling the intra-layer inhibitory connections.

The replay mechanism in hippocampus has inspired a series of rehearsal-based approaches which have proven to be effective in challenging continual learning scenarios. Therefore, to replay samples from the previous tasks, the method comprises the step of maintained an episodic memory buffer by using Reservoir Sampling.

Suitably, the method comprises the step of matching a distribution of an incoming stream by assigning to each new sample equal probabilities for being represented in the episodic memory buffer.

More suitably, the method comprises the steps of:

- interleaving samples from a current task with samples from the episodic memory buffer, while training;
- saving output logits, across a training trajectory; and
- enforcing a consistency loss when replaying the samples from the episodic memory buffer.

In a second embodiment of the invention, the computer-readable medium is provided with a computer program wherein when said computer program is loaded and executed by a computer, said computer program causes the computer to carry out the steps of the computer-implemented method according to any one of aforementioned steps.

In a third embodiment of the invention, the data processing system comprise a computer loaded with a computer program wherein said program is arranged for causing the computer to carry out the steps of the computer-implemented method according to any one of aforementioned steps.

Objects, advantages and novel features, and further scope of applicability of the present invention will be set forth in part in the detailed description to follow, taken in conjunction with the accompanying drawings, and in part will become apparent to those skilled in the art upon examination of the following, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, serve to explain the principles of the invention. The drawings are only for the purpose of illustrating one or more embodiments of the invention and are not to be construed as limiting the invention. In the drawings:

FIG. 1 shows a schematic diagram for the computer-implemented method according to an embodiment of the present invention; and

FIG. 2 shows a schematic diagram for the computer-implemented method according to an embodiment of the present invention.

Whenever in the figures the same reference numerals are applied, these numerals refer to the same parts.

DETAILED DESCRIPTION OF THE INVENTION
Dale's Principle

Biological neural networks differ from their artificial counterparts in the complexity of the synapsesand the role of individual units. Notably, most neurons in the brain adhere to Dale's principle which posits that presynaptic neurons can only have an exclusively excitatory or exclusively inhibitoryimpact on their postsynaptic partners [37]. Several studies show that the balanced dynamics [32, 39]of excitatory and inhibitory populations provide functional advantages, including efficient predictivecoding [8] and pattern learning [22]. Furthermore, inhibition is hypothesized to play a role in alleviating catastrophic forgetting [5]. Standard ANNs, however, lack adherence to Dale's principle as neurons contain both positive and negative output weights, and the signs can change while learning.

Cornford et al. incorporate Dale's principle into ANNs (referred to as DANNs) which take intoaccount the distinct connectivity patterns of the excitatory and inhibitory neurons [] and performcomparable to standard ANNs in benchmark object recognition task. Each layer l comprises a separate population of excitatory, h_e^l∈R₊^ne, and inhibitory h_l^J∈R₊ⁿⁱneurons, where n_e>>n_iand synaptic weights are strictly non-negative. Similar to biological networks, while both populations receive excitatory projections from the previous layer (h^l−1), only the excitatory neurons project between layers, whereas the inhibitory neurons inhibit the activity of the excitatory units of the samelayer. Cornford et al. characterized these properties by three sets of strictly positive weights:

excitatory connections between layers
W text missing or illegible when filed

∈

excitatory projection to the inhibitory units
W text missing or illegible when filed

∈

inhibitory projections within layers
W text missing or illegible when filed

∈

indicates data missing or illegible when filed

The output of the excitatory units is impacted by the subtractive inhibition from the inhibitory units:

z
^l=(W_ee^l−W_eiⁱW_ie^l)h_e^l−1+b^l (1)

where b^l∈Rⁿ^eis the bias term. FIG. 1 shows the interactions and connectivity between excitatory pyramidal cells (triangle symbol) and inhibitory neurons (denoted by i).

The method of the current invention employs DANNs as the feedforward neurons that performs comparable to standard ANNs in the challenging CL setting and provides a biologically plausible framework for further studying the role of inhibition in alleviating catastrophic forgetting.

Active Dendrites

The brain employs specific structures and mechanisms for context-dependent processing and routing of information. The prefrontal cortex, playing an important role in cognitive control [31], receives sensory inputs as well as contextual information, which enables it to choose sensory features most relevant to the present task to guide actions [15, 29, 36, 44]. Of particular interest are the pyramidalcells which represent the most populous members of the excitatory family of neurons in the brain [7].The dendritic spines in pyramid cells exhibit highly non-linear integrative properties which are considered important for learning

task-specific patterns [42]. Pyramidal cells integrate a range of diverse inputs on multiple independent dendritic segments whereby contextual inputs on active dendrites can modulate a neuron's response, making it more likely to fire. Standard ANNs are however based on a point neuron model which is an oversimplified model of biological computations and lacks the sophisticated nonlinear and context-dependent behavior of pyramidal cells. Iyer et al. model these integrative properties of dendrites by augmenting each neuron with a set of dendritic segments. Multiple dendritic segments receive additional contextual information which is processed using separate set of weights. The resultant dendritic output modulates the feedforward activation which is computed by a linear weighted sum of the feedforward inputs. Thiscomputation results in a neuron where the magnitude of the response to a given stimulus is highly context dependent. To enable task-specific processing of information, the prototype vector c_τfor task τ is evaluated by taking the element-wise mean of tasks samples, D_τat the beginning of the task and then subsequently providing said prototype vector as context during training,

$\begin{matrix} c_{τ} = \frac{1}{❘ D_{τ} ❘} \sum_{x \in D_{τ}} x & (2) \end{matrix}$

During inference, the closest prototype vector to each test sample, x′, is selected as the context using Euclidean distance among all the task prototypes, C, stored in memory.

$\begin{matrix} c^{'} = \underset{c_{τ}}{\arg \min} { x^{'} - C_{τ} }_{2} & (3) \end{matrix}$

The method of the current invention comprises the step of augmenting the excitatory units in each layer with dendritic segments (FIG. 1(a)).The feedforward activity of excitatory units is modulated by the dendritic segments which receivea context vector. Given the context vector, each dendritic segment j computes u_j^Tc, given weight u_j∈R^dand context vector c∈R^dwhere d is the dimensions of the input image. For excitatory neurons, the dendritic segment with the highest response to the context (maximum absolute value with the sign retained) is selected to modulate the output activity,

$\begin{matrix} h_{e}^{l} = k - WTA (z_{l} \times σ (u_{k}^{T} c)), where κ = \underset{j}{\arg \max} ❘ u_{j}^{T} c ❘ & (4) \end{matrix}$

where σ is the sigmoid function and k-VVTA(.) is the k-Winner-Take-All activation function [2] which propagates only the top k neurons and sets the rest to zero. This provides a biologicallyplausible framework where, like biological networks, the feedforward neurons adhere to Dale's principle and the excitatory neurons mimic the integrative properties of active dendrites for context dependent processing of stimulus.

Sparsity in Activations and Connections

Neocortical circuits are characterized by high levels of sparsity in neural connectivity and activations [6, 16]. This is in stark contrast to the dense and highly entangled connectivity in the standard ANNs. Particularly for continual learning, sparsity provides several advantages: sparse non-overlapping representations can reduce interference between tasks [1, 3, 23], can lead to the natural emergence of task-specific modules [17], and sparse connectivity can further ensure fewer task-specific parameters [28].

The method according to the invention provides an efficient mechanism for setting different levels of activation sparsity by varying the ratio of active neurons in k-winners-take-all (k-VVTA) activations [2] and constant sparsity in connections by setting a percentage of weights at random to 0 at initialization. Sparsity in activations effectively reduces interference by reducing the overlap in representations. Furthermore, it allows having different levels of sparsity in different layers which can further improve performance. As the earlier layers learn general features, having a higher ratio of active neurons can enable higher reusability and forward transfer. For the later layers, a smaller ratio of active neurons can reduce the interference between task-specific features.

Heterogeneous Dropout

The context-dependent processing of information in conjunction with sparse activation patterns can effectively reduce the overlap of representations which leads to less interference between the tasks and thereby less forgetting. To further encourage the model to learn non-overlapping representations, the method of the current invention employs Heterogeneous dropout [1]. During training, the frequency of activations for each neuron in a layer for a given task is tracked, and in the subsequent tasks, the probability of a neuron being dropped is set to be inversely proportional to its activation counts. This encourages the model to learn the newtask by utilizing neurons that have been less active for previous tasks. FIG. 1 shows that neurons which have been more active (darker shade) are more likely to be dropped before k-WTA is applied.Concretely, let [a_t^l]_jdenote the activation counter for neuron j in layer l after learning t tasks. For learning task t+1, the probability of this neuron being retained is given by:

$\begin{matrix} {[p_{t + 1}^{l}]}_{j} = \exp (\frac{- {[a_{t}^{l}]}_{j}}{{\max_{j} [a_{t}^{l}]}_{j}} ρ) & (5) \end{matrix}$

where ρ controls the strength of enforcement of non-overlapping representations with larger values leading to less overlap. This provides us with an efficient mechanism for controlling the degree of overlap between the representations of different tasks and hence the degree of forward transfer and interference based on the task similarities. It also allows having different dropout ρ for each layer (with lower ρ for earlier layers to encourage reusability and higher ρ for later layers to reduce interference between task-representations). Heterogeneous dropout provides a simple mechanism for balancing the reusability and interference of features depending on the similarity of tasks.

Hebbian Learning

Therefore, the method of the current invention proposes to complement error-based learning with Hebbian update to strengthen the connections between the contextual information and dendritic segments (FIG. 1(b)).Each supervised parameter update with backpropagation is followed by a Hebbian update step on thedendritic segments to strengthen the connections between the context input and the corresponding dendritic segment which is activated. To constrain the parameters, the method of the current invention comprises the step of using Oja's rule which adds weight decay to Hebbian learning [33], text missing or illegible when filed

where η_his learning rate, κ is the index ofthe winning dendrite with weight u_κ and modulating signald=u^Tc for context signal c.

Synaptic Consolidation

In addition to their integrative properties, dendrites also play a key role in retaining information and providing protection from erasure [10, 43]. The new spines that are formed on different sets ofdendritic branches in response to learning different tasks are protected from being eliminated through mediation in synaptic plasticity and structural changes which persist when learning a new task [43].

The method of the invention employs synaptic consolidation by incorporating Synaptic Intelligence which maintains an importance estimate of each synapse in an online manner during training and subsequently reduces the plasticity of synapses which are considered important for learned tasks. Notably, the method of the invention comprises the step of adjusting the importance estimate to account for the disparity in the degree to which updates to different parameters affect the layer's output

because of the inhibitory interneuron architecture in DANN layers [11]. The importance estimate of the excitatory connections to the inhibitory units and the intra-layer inhibitory connections are upscaled to further penalize changes to these weights.

Experience Replay

Replay of past neural activation patterns in the brain is considered to play a critical role in memory formation, consolidation, and retrieval [30, 41]. The replay mechanism in hippocampus has inspired a series of rehearsal-based approaches [4, 9, 26, 27] which have proven to be effective in challenging continual learning scenarios [12, 17]. Therefore, to replay samples from the previous tasks, the computer-implemented method according to the current invention comprises the step of utilizing a small episodic memory buffer which is maintained through Reservoir sampling [40]. The method further comprises the step of approximately matching the distribution of the incoming stream by assigning equal probabilities to each new sample for being represented in the buffer. While training, samples from the current task, (x_b, y_b)˜D_τ, are interleaved with the memory buffer samples, (x_m, y_m)˜M to approximate the joint distribution of tasks seen so far. Furthermore, to mimic the replay of activationpatterns that accompanied the learning event in brain, the output logits, z_m, are saved across the training trajectory and a consistency loss is enforced when replaying the samples from the episodic memory. Concretely, the loss is given by:

custom-character =_cls(f(x_b;θ), y_b)+α_cls(f(x_m; θ), y_m)+β(f(x_m; θ)−z_m)² (7)

where f(⋅; θ) is the model parameterized by θ, L_cisis the standard cross-entropy loss, and α and β control the strength of interleaved training and consistency constraint respectively.

In FIG. 1, the architecture of one hidden layer in the biologically plausible framework is shown. Each layer consists of separate populations of exclusively excitatory pyramidal cells and inhibitory neurons which adheres to Dale's principle. (a) The pyramidal cells are augmented with dendritic segments which receive an additional context signal c and modulate the output activity of the feedforward neurons for context-dependent processing of information. (b) The Hebbian update step further strengthens the association between the context and the winning dendritic segment with maximum absolute value. Finally, Heterogeneous dropout keeps the activation count of each pyramidal cell and drops the neurons which were most active for the previous task to enforce non-overlapping representations. The top-k remaining cells then project to the next layer.

In FIG. 2, a full schematic of method of the current invention is shown: ach hidden layer consists of excitatory and inhibitory neurons and the excitatory neurons are augmented with dendritic segments. Each dendritic segment in different layers receives the same context signal which is either computed from the task samples (equation 2) or learned with another MLP network in an end-to-end manner. The details of each hidden layer are provided in FIG. 1. Furthermore, experience replay is utilized for interleaved training of current samples with samples from the previous task. Additionally, the model maintains a running estimate of the importance of each parameter which are then utilized to penalize changes in important parameters from the values at the end of previous tasks.

A computer-implemented method according to an embodiment of the present invention preferably comprises the step of incorporating the aforementioned aspects into a biologically plausible framework for CL, referred to as Bio-ANN. Table 1 shows that the different components complement each other and consistently improve the performance of the model. The empirical results suggest that employing multiple complementary components and learning mechanisms, like the brain, can be an effective approach to enable continual learning in ANNs.

TABLE 1

Effect of each component of the biologically plausible framework on different datasets with varying number of

tasks. We first show the effect of utilizing feed forward neurons adhering to Dale's principle in conjunction with Active

Dendrites to form the framework within which we evaluate the individual effect of brain-inspired mechanisms before

combining them all together to forge Bio-ANN. We provide the average task performance and 1 std of three runs.

Rot-MNIST
Perm-MNIST

Method
5
10
20
5
10
20
Seq-MNIST

Active
92.45 ± 0.27
70.85 ± 0.60
48.13 ± 0.73
95.53 ± 0.10
94.37 ± 0.26
91.76 ± 0.39
20.06 ± 0.36

Dendrites

+Dale's
92.28 ± 0.27
70.78 ± 0.23
48.79 ± 0.27
95.77 ± 0.33
95.06 ± 0.29
92.40 ± 0.38
19.81 ± 0.03

Principle

+Hebbian
92.68 ± 0.36
71.42 ± 0.94
49.26 ± 0.58
95.97 ± 0.16
94.96 ± 0.14
92.69 ± 0.19
19.85 ± 0.04

Update

+SC
93.40 ± 0.86
75.87 ± 1.35
64.78 ± 3.43
96.67 ± 0.23
96.36 ± 0.10
95.61 ± 0.10
20.26 ± 0.56

+ER
95.15 ± 0.37
90.86 ± 0.52
83.42 ± 0.44
96.75 ± 0.15
96.01 ± 0.14
94.50 ± 0.16
86.88 ± 0.83

+ER + CR
96.67 ± 0.06
93.85 ± 0.24
89.38 ± 0.16
97.34 ± 0.03
97.03 ± 0.04
96.12 ± 0.04
89.23 ± 0.48

Bio-ANN
96.82 ± 0.14
94.64 ± 0.23
91.32 ± 0.26
97.33 ± 0.04
97.07 ± 0.05
96.51 ± 0.03
89.26 ± 0.42

Algorithm 1 Bio-ANN: A biologically plausible framework for CL

Input: Data stream custom-character

; Learning rates η, ηw text missing or illegible when filed

, ηw

: Hebbian learning rate η_k: Heterogeneous

dropout ρ: Synaptic consolidation weights λ, λw text missing or illegible when filed

, λw

, γ: Experience replay weights α, β

Initialize:

Model weights θ, Reference weights θ_c= { }, Task prototypes C_r− { }

Heterogeneous dropout: Overall activation counts A_r− 0, Keep probabilities P_r− 1

Memory buffer custom-character

← { }

Synaptic Intelligence: ω = 0, Ω = 0

custom-character

Sample task from data stream

1: for custom-character

_r∈ {

₁,

₂, . . . , custom-character

_T} do

Task context

2: Evaluate context vector (Eq. custom-character

c_{r} = \frac{1}{[D_{r}]} Σ_{a \in D_{r}} x

3: Update the set of prototypes:

C_r← {C_r, c_r}

custom-character

Train on task text missing or illegible when filed

4: while Training do

5: Sample data: (x text missing or illegible when filed

, y

) ~

_rand (x_m, y_m, z_m) ~ custom-character

Task specific loss

6: Get the model output and activation counts on the current task batch:

z_b, a_b− F(x_b, c_r; θ, P_r) # Apply Heterogeneous dropout

7: Calculate task loss:

custom-character

_r−

_cfx(2

, y

)

8: Update overall activation counts:

A_r← UpdateActivationCounts(a_i)

custom-character

Experience replay

9: Infer context for buffer samples (Eq. custom-character

c_{m} = \underset{a_{r}}{argmin} { x^{'} - C_{r} }_{2}

10: Get model output on buffer samples:

z − F(x_m, c_m; θ) # Disable Heterogeneous dropout

11: Calculate replay loss:

custom-character

_cx= α

_cis(z, y_m) + β(z − z_m)²

custom-character

Synaptic regularization

12: Calculate SI loss:

custom-character

= Ω_adj(θ − θ_c)²

13: Calculate overall loss and clip the gradient between 0 and 1:

custom-character

_r+

_cr+

∇_θ

= Clip(∇_θ custom-character

, 0, 1)

Update Models

14: SGD update: θ = UpdateModel(∇_θ custom-character

, ηw

)

15: Hebbian update on dendritic segments: U = HebbianStep({c_r, c_m}, U)

16:

17: Update small omega: ω = ω + η∇_θ² custom-character

)

Update SI parameter

18: custom-character

← Reservoir( custom-character

, (x_b, y_b, z_b)) custom-character

Update memory buffer (Algorithm custom-character

)

19: end while

20: custom-character

Task Boundary

21: Update keep Probabilities (Eq custom-character

P_{r} = \exp (\frac{- A_{r}}{\max A_{r}} ρ)

22: Update SI Omega and reference weights and reset small omega:

Ω = Ω + \frac{ω}{{(θ - θ_{r})}^{2} + γ}

ω = 0

θ_c− θ

23: Scale up importance for inhibitory weights

Ω_adj= ScaleUpInhib(Ω, λw text missing or illegible when filed

, λw

)

24: end for

return θ

text missing or illegible when filed

indicates data missing or illegible when filed

Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other.

Typical application areas of the invention include, but are not limited to:

- Road condition monitoring
- Road signs detection
- Parking occupancy detection
- Defect inspection in manufacturing
- Insect detection in agriculture
- Aerial survey and imaging

Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the method of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the append-ed claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment.

Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.

Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, ALGOL, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field

Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.

REFERENCES

- [1] Ali Abbasi, Parsa Nooralinejad, Vladimir Braverman, Hamed Pirsiavash, and Soheil Kolouri. Sparsity and heterogeneous dropout for continual learning in the null space of neural activations. arXiv preprint arXiv:2203.06514, 2022.
- [2] Subutai Ahmad and Luiz Scheinkman. How can we be so dense? the benefits of using highly sparse representations. arXiv preprint arXiv:1903.11257, 2019.
- [3 ] Rahaf Aljundi, Marcus Rohrbach, and Tinne Tuytelaars. Selfless sequential learning. arXiv preprint arXiv:1806.05421, 2018.
- [4] Elahe Arani, Fahad Sarfraz, and Bahram Zonooz. Learning fast, learning slow: A general continual learning method based on complementary learning system. In International Conference on Learning Representations, 2022.
- [5 ] Helen C Barron, Tim P Vogels, Timothy E Behrens, and Mani Ramaswami. Inhibitory engrams in perception and memory. Proceedings of the National Academy of Sciences, 114(26):6666-6674, 2017.
- [6] Alison L Barth and James FA Poulet. Experimental evidence for sparse firing in the neocortex.

Trends in neurosciences, 35(6):345-355, 2012.

- [7 ] John M Bekkers. Pyramidal neurons. Current biology, 21(24):R975, 2011.
- [8] Martin Boerlin, Christian K Machens, and Sophie Deneve. Predictive coding of dynamical variables in balanced spiking networks. PLoS computational biology, 9(11):e1003258, 2013.
- [9 ] Arslan Chaudhry, Marc'Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
- [10] Joseph Cichon and Wen-Biao Gan. Branch-specific dendritic ca 2+ spikes cause persistent synaptic plasticity. Nature, 520(7546):180-185, 2015.
- [11] Jonathan Cornford, Damjan Kalajdzievski, Marco Leite, Amélie Lamarquette, Dimitri Michael Kullmann, and Blake Aaron Richards. Learning to live with dale's principle: Anns with separate excitatory and inhibitory units. In International Conference on Learning Representations, 2020.
- [12] Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733, 2018.
- [13] Timo Flesch, David G Nagy, Andrew Saxe, and Christopher Summerfield. Modelling continual learn- ing in humans with hebbian context gating and exponentially decaying task signals. arXiv preprint arXiv:2203.11560, 2022.
- [14] Kunihiko Fukushima. A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern., 36:193-202, 1980.
- [15] Joaquin Fuster. The prefrontal cortex. Academic press, 2015.
- [16] Daniel J Graham and David J Field. Sparse coding in the neocortex. Evolution of nervous systems, 3:181-187, 2006.
- [17] Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks. Trends in cognitive sciences, 24(12):1028-1040, 2020.
- [18] Jun Han and Claudio Moraga. The influence of the sigmoid function parameters on the speed of back-propagation learning. In International workshop on artificial neural networks, pages 195-201. Springer, 1995.
- [19] Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience-inspired artificial intelligence. Neuron, 95(2):245-258, 2017.
- [20] Tyler L Hayes, Gin P Krishnan, Maxim Bazhenov, Hava T Siegelmann, Terrence J Sejnowski, and Christopher Kanan. Replay in deep learning: Current approaches and missing biological elements. Neural Computation, 33(11):2908-2950,2021.
- [21] Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology Press, 2005.
- [22] Alessandro Ingrosso and LF Abbott. Training dynamically balanced excitatory-inhibitory networks. PloS one, 14(8):e0220547, 2019.
- [23] Abhiram Iyer, Karan Grewal, Akash Velu, Lucas Oliveira Souza, Jeremy Forest, and Subutai Ahmad. Avoiding catastrophe: Active dendrites enable multi-task learning in dynamic environments. arXiv preprint arXiv:2201.00042,2021.
- [24] Dharshan Kumaran, Demis Hassabis, and James L McClelland. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences, 20(7):512-534, 2016.
- [25] Louis Lapique. Recherches quantitatives sur I'excitation electrique des nerfs traitee comme une polarization. Journal of Physiology and Pathololgy, 9:620-635, 1907.
- [26] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935-2947, 2017.
- [27] David Lopez-Paz and Marc'Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in neural information processing systems, pages 6467-6476,2017.
- [28] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67-82,2018.
- [29] Valerio Mante, David Sussillo, Krishna V Shenoy, and William T Newsome. Context-dependent computa-tion by recurrent dynamics in prefrontal cortex. nature, 503(7474):78-84, 2013.
- [30] James L McClelland, Bruce L McNaughton, and Randall C O'Reilly. Why there are complementary learn-ing systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
- [31] Earl K Miller and Jonathan D Cohen. An integrative theory of prefrontal cortex function. Annual review of neuroscience, 24(1):167-202, 2001.
- [32] Brendan K Murphy and Kenneth D Miller. Balanced amplification: a new mechanism of selective amplification of neural activity patterns. Neuron, 61(4):635-648, 2009.
- [33] Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267-273, 1982.
- [34] Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli, et al. A deep learning framework for neuroscience. Nature neuroscience, 22(11):1761-1770, 2019.
- [35] Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, et al. Brain-score: Which artificial neural network for object recognition is most brain-like? BioRxiv, page 407007,2020.
- [36] Markus Siegel, Timothy J Buschman, and Earl K Miller. Cortical information flow during flexible sensorimotor decisions. Science, 348(6241):1352-1355, 2015.
- [37] Piergiorgio Strata, Robin Harvey, et al. Dale's principle. Brain research bulletin, 50(5):349-350, 1999.
- [38] Robin Tremblay, Soohyun Lee, and Bernardo Rudy. Gabaergic interneurons in the neocortex: from cellular properties to circuits. Neuron, 91(2):260-292, 2016.
- [39] Carl Van Vreeswijk and Haim Sompolinsky. Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293):1724-1726, 1996.
- [40] Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37-57, 1985.
- [41] Matthew P Walker and Robert Stickgold. Sleep-dependent learning and memory consolidation. Neuron, 44(1):121-133, 2004.
- [42] Guang Yang, Cora Sau Wan Lai, Joseph Cichon, Lei Ma, Wei Li, and Wen-Biao Gan. Sleep promotes branch-specific formation of dendritic spines after learning. Science, 344(6188):1173-1178, 2014.
- [43] Guang Yang, Feng Pan, and Wen-Biao Gan. Stably maintained dendritic spines are associated with lifelong memories. Nature, 462(7275):920-924,2009.
- [44] Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364-372, 2019.
- [45] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987-3995. PMLR, 2017.
- [46] Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762-3773. PMLR, 2020
- [47] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, JoelVeness, Guillaume Desjardins, Andrei A Rusu, KieranMilan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521-3526, 2017
- [48] Hippolyt Ritter, Aleksandar Botev, and David Barber. On-line structured laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems, pages 3738-3748, 2018
- [49] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins,Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Raz-van Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016
- [50] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung JuHwang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017
- [51] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123-146, 1995
- [52] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu,lrina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018
- [53] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, GeorgSperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages2001-2010, 2017
- [54] Pietro Buzzega, Matteo Boschini, Angelo Porrello, DavideAbati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. arXiv preprintarXiv:2004.07211, 2020
- [55] Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning. arXiv preprintarXiv:1805.09733, 2018
- [56] Nitin Kamra, Umang Gupta, and Yan Liu. Deep generative dual memory network for continual learning. ArXiv preprint arXiv:1710.10368, 2017.

Computer-Implemented Method and a System for a Biologically Plausible Framework for Continual Learning in Artificial Neural Network

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)