METHOD AND SYSTEM FOR KERNEL CONTINUING LEARNING

TECHNICAL FIELD

The present disclosure is directed at methods, systems, and techniques for kernel continual learning.

BACKGROUND

Dataset classification can be performed by computer-implemented machine learning models, such as artificial neural networks. For example, when a dataset comprises images of objects, a computer may be used to implement an object classifier configured to classify each of the objects using an artificial neural network by determining which of several different types of objects each of the depicted objects most closely resembles. The artificial neural network performs feature extraction and classification of the depicted objects.

A classifier can be used to perform multiple tasks, such as classifying different types of objects. Using the same classifier for multiple tasks can, in some circumstances, result in a problem known as “catastrophic forgetting” which may generally refer to catastrophic loss of previously learned information. For example, this may include the tendency of an artificial neural network to forget past knowledge and learned information upon learning information.

SUMMARY

According to a first aspect, there is provided a method comprising: obtaining a dataset corresponding to a classification task; performing feature extraction on the dataset using an artificial neural network; and constructing a kernel using features extracted during the feature extraction for use in performing the classification task.

The dataset may be a current task dataset and the classification task may be a current classification task, and the method may further comprise selecting a coreset dataset from the current task dataset, wherein the feature extraction is performed on the coreset dataset, and wherein the kernel is constructed using the features extracted from the coreset dataset.

The method may further comprise performing the current classification task by applying the kernel to features extracted from the current task dataset.

The feature extraction may also be performed on elements of the current task dataset other than the coreset dataset, and performing the current classification task may comprise applying the kernel to features extracted from elements of the current task dataset other than the coreset.

The coreset dataset may be selected uniformly between existing classes of the current task dataset.

The dataset may be an input query dataset, and the method may further comprise: obtaining a task identifier that corresponds to the input query dataset; retrieving, using the task identifier, a coreset dataset corresponding to a classification task to be performed on the input query dataset, wherein the feature extraction is performed on the coreset dataset and on the input query dataset, and wherein the kernel is constructed using the features extracted from the coreset dataset; and classifying the input query dataset by applying the kernel to the features extracted from the input query dataset.

The dataset may comprise an image.

Constructing the kernel may comprise applying kernel ridge regression.

The artificial neural network may comprise at least one of a convolutional neural network and a multilayer perceptron.

The method may further comprise determining random Fourier features from the coreset dataset, and the kernel may be constructed using the random Fourier features.

The coreset dataset may be selected uniformly between existing classes of the input query dataset.

The feature extraction may be performed using a backbone network shared across multiple classification tasks.

According to another aspect, there is provided a system comprising: a processor; a non-transitory computer readable medium communicatively coupled to the processor and having stored thereon computer program code that is executable by the processor and that, when executed by the processor, causes the processor to perform the method of any of the foregoing aspects or suitable combinations thereof.

The system may also comprise a memory communicatively coupled to the processor for storing the coreset dataset.

A non-transitory computer readable medium having stored thereon computer program code that is executable by a processor and that, when executed by the processor, causes the processor to perform the method of any of the foregoing aspects or suitable combinations thereof.

This summary does not necessarily describe the entire scope of all aspects. Other aspects, features and advantages will be apparent to those of ordinary skill in the art upon review of the following description of specific embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

In the accompanying drawings, which illustrate one or more example embodiments:

FIG. 1A is a block diagram of an example method for kernel continual learning during a training phase.

FIG. 1B is a block diagram of an example method of kernel continual learning during a testing phase.

FIG. 2 is a block diagram of an example computer system that may be used to perform kernel continual learning.

FIG. 3 is a block diagram providing an overview of kernel continual learning, according to an example embodiment.

FIGS. 4A-C are graphs depicting effectiveness of kernel continual learning across various tasks, according to various example embodiments.

FIGS. 5A-C are graphs depicting influence of number of tasks on accuracy of kernel continual learning, according to various example embodiments.

FIG. 6 is a graph of accuracy vs. coreset size illustrating the benefit of variational random features for kernel continual learning, according to an example embodiment.

FIG. 7 is a graph of accuracy vs. coreset size showing how enlarging the coreset size improves performance of kernel continual learning, according to an example embodiment.

FIG. 8 is a graph of accuracy vs. number of bases showing how increasing the number of random Fourier bases may improve performance of kernel continual learning, according to an example embodiment.

FIGS. 9A-C are graphs depicting influence of number of tasks on accuracy of kernel continual learning according to various example embodiments and in contrast to various conventional classifiers.

FIGS. 10A-C depict block diagrams of a system for kernel continual learning as applied to perform tasks on various datasets, according to additional example embodiments.

FIG. 11 is a block diagram of a system for kernel continual learning, according to another example embodiment.

DETAILED DESCRIPTION

Artificial intelligence agents are known to suffer from catastrophic forgetting when learning over non-stationary data distributions. Continual learning, also known as life-long learning, was introduced to deal with catastrophic forgetting. It refers to an agent able to continually learn to solve a sequence of non-stationary tasks by accommodating new information, while remaining able to complete past experienced tasks with minimal performance reduction. The fundamental challenge in continual learning is catastrophic forgetting, which is caused by the interference among tasks from heterogeneous data distributions.

Task interference is almost unavoidable when model parameters, like the feature extractor and the classifier, are shared by all tasks. At the same time, it is practically infeasible to keep separate sets of model parameters for each individual task when learning with an arbitrary long sequence of tasks. Moreover, knowledge tends to be more shared and transferred in the lower layers than higher layers across tasks in deep neural networks. This motivates non-parametric classifiers that automatically avoid task interference without sharing any parameters across tasks. Kernel methods provide a well-suited tool due to their non-parametric nature, which have proven to be a powerful technique in the machine learning toolbox. Kernels were shown to be effective in the scenarios of incremental and multi-task learning with support vector machines. Recently, they have also demonstrated to be strong learners in tandem with deep neural networks, especially when learning from limited data. Inspired by the success of kernels in machine learning, in at least some example embodiments herein, there are provided methods and systems to decouple the feature extractor from the classifier and introduce task-specific classifiers based on kernels for continual learning.

“Kernel continual learning” is used herein to deal with catastrophic forgetting in continual learning. Specifically, non-parametric classifiers are learned based on kernel ridge regression. To do so, an episodic memory is deployed to store a subset of samples from the training data per task, the “coreset dataset” (hereinafter simply referred to as the “coreset”), and to learn the classifier based on kernel ridge regression. Using kernels in this fashion may, in at least some embodiments, be beneficial for several reasons. The direct interference of classifiers is naturally avoided as kernels are established in a non-parametric way per task and no classifier parameters are shared across tasks. Moreover, in contrast to conventional memory replay methods, kernel continual learning does not need to replay data from previous tasks for training the current task, which averts task inference while enabling more efficient optimization. In order to achieve adaptive kernels per task, random Fourier features are used to learn kernels in a data-driven manner. To be more specific, kernel continual learning is formalized with random Fourier features as a variational inference problem, where the random Fourier basis is treated as a latent variable and inferred from the coreset of each task. The variational inference formulation naturally induces the regularization term that encourages the model to learn adaptive kernels per task from the coreset only. Consequently, a more compact memory is achieved, which alleviates the storage overhead.

The technical problem solved by at least some embodiments of kernel continual learning herein is catastrophic forgetting in classifiers due to task interference. In continual learning for visual object recognition tasks, the classifier parameters of different tasks are interfered along learning process, causing forgetting knowledge of previous tasks. At least some embodiments of kernel continual learning herein are directed at non-parametric classifiers based on kernels. No classifier parameters are shared among tasks, therefore avoiding interference in classifiers. This enables kernel continual learning to continually solve recognition tasks while being able to solve previously learned tasks without significant performance drop.

As described further below, experiments in accordance with at least some embodiments are performed on four benchmark datasets: Rotated MNIST, Permuted MNIST, Split CIFAR100 and miniImageNet. The results demonstrate the effectiveness of kernel continual learning.

Conventional Techniques for Address Catastrophic Forgetting

Conventional methods differ in the way they deal with catastrophic forgetting, which are briefly reviewed below in terms of regularization, dynamic architectures and experience replay.

Regularization methods determine the importance of each model's parameter per task, which prevents the parameters from being updated for new tasks. For example, each weight's performance may be specified with the Fisher information matrix. Alternatively, parameter importance may be determined by gradient magnitude. These methods can be explored from the lens of Bayesian optimization. For instance, a regularization technique, inspired by variational inference, may be used to protect against forgetting. Bayesian or not, regularization methods address catastrophic forgetting by adding a regularization term to the main loss function. The penalty term proposed in such methods are unable to prevent drift in the loss landscape of previous tasks. While alleviating forgetting, the penalty also prevents the plasticity to absorb new information from future tasks learned over a long timescale.

Dynamic architectures allocate a subset of the model parameters per task. This is achieved by a gating mechanism, or by incrementally adding new parameters to the model. Incrementally learning and pruning is another possibility. Given an over-parameterized model with the ability to learn quite a few tasks, model expansion can also be achieved by pruning the parameters not contributing to the performance of the current task, while keeping them available for future tasks. These methods are preferred when there is no memory usage constraint and the final model performance is prioritized. They offer an effective way to avoid task interference and catastrophic forgetting, at the expense of suffering from potentially unbounded model expansions and preventing positive knowledge transfer across tasks.

Experience replay methods assume it is possible to access data from previous tasks by having a fixed-size memory or a generative model able to produces samples from old tasks. A model may be augmented with fixed-size memory, which accumulates samples in the proximity of each class center. Alternatively, another memory-based model may be implemented by exploiting a reservoir sampling strategy in the raw input data selection phase. Rather than storing the original samples, certain other models accumulate the parameter gradients during task learning. Certain other models incorporate a generative model into a continual learning model to alleviate catastrophic forgetting by producing samples from previous tasks and retraining the model using data from previous tasks and the current task. These methods assume an extra neural network, such as a generative model or a memory is available. Otherwise, they cannot be exploited. Those replay-based methods benefit from a memory to retrain their model over previous tasks. In contrast, in at least some example embodiments, kernel continual learning only uses memory to store data as a task identifier proxy at inference time without the need of replay for training, which mitigates optimization cost in memory-based methods.

Kernel Continual Learning

In a traditional supervised learning setting, a model or agent f is learned to map input data from the input space to its target in the corresponding output space: custom-character where samples X ∈ are assumed to be drawn from the same data distribution. In case of an image classification problem, X are the images and Y are associated class labels. Instead of solving a single task, continual learning aims to solve a sequence of different tasks, T₁, T₂, ·T_n, from non-stationary data distributions, where n stands for the number of tasks, and each task is an individual classification problem. A continual learner is required to continually solve each t of those tasks once being trained on its labeled data, while remaining able to solve previous tasks with no or limited access to their data.

Generally, a continual learning model based on a neural network can be regarded as a feature extractor h_θ and a classifier f_c. The feature extractor is a convolutional architecture before the last fully-connected layer that is shared across tasks. The classifier is the last fully-connected layer. In at least some embodiments herein, a task-specific, non-parametric classifier is implemented based on kernel ridge regression.

The model is trained on the current task t. Given its training data custom-character _t, a subset of data is uniformly chosen between existing classes in current task t, which is called the “coreset” and denoted as: _t=(x_i, y_i)_i=1^N^c. The classifier f_cis constructed based on kernel ridge regression on the coreset. Assuming classifier with weight w, the loss function of kernel ridge regression takes the following form:

$\begin{matrix} ℒ_{krr} (w) = \frac{1}{2} \sum_{i} (y_{i} - w^{⊤} ψ (x_{i})) + \frac{1}{2} λ { w }^{2}, & (1) \end{matrix}$

where λ is the weight decay parameter. Based on the Representer theorem:

$\begin{matrix} w = f_{c}^{α^{t}} (\cdot) = \sum_{i = 1}^{N_{c}} α_{i} k (\cdot, ψ (x_{i})), & (2) \end{matrix}$

where k(*,*) is the kernel function. Then a can be calculated in a closed form:

α^t=Y(λI+ custom-character )⁻¹, (3)

where α^t=[α₁, . . . , α_i, . . . , α_N_C] and λ is considered to be a learnable hyper-parameter. The custom-character ∈R^N^c^×N^cmatrix for each task is computed as k(x_i,x_j)=ψ(x_i)ψ(x_j)^T. Here ψ(x_i) is the feature map of x_i∈_t, which can be obtained from the feature extractor h_θ.

To jointly learn the feature extractor h_θ, the total loss function is minimized over samples from the remaining set:

$\begin{matrix} \sum_{(x', y')} ℒ (f_{c}^{α^{t}} (ψ (x^{'})), y^{'}) . & (4) \end{matrix}$

Here custom-character (*) is the cross-entropy loss function and the predicted output {tilde over (y)}′ is computed by

{tilde over (y)}′=f
_c
^α
^t(ψ(x′))=Softmax(α{tilde over (K)}), (5)

where {tilde over (K)}=ψ(X)ψ(x′)^T, ψ(X) denotes the feature maps of samples in the coreset, and Softmax(*) is the sofmax function applied to the output of kernel ridge regression.

Any semi-positive definite kernel, e.g., a radial basis function (RBF) kernel or a dot product linear kernel, may be used to construct the classifier. In at least some example embodiments, random Fourier features are introduced to train data-driven kernels, which have previously demonstrated success in regular learning tasks. Data-driven kernels by random Fourier features provide an appealing technique to train strong classifiers with a relatively small memory footprint for continual learning based on episodic memory.

One of the ingredients when finding a mapping function in non-parametric approaches, such as kernel-ridge regression, is the kernel function. Translation-invariant kernels may be approximated using explicit feature maps, this approach is underpinned by Bochner's theorem, in which a continuous, real valued, symmetric and shift-invariant function k(x, x′)=k(x−x′) on custom-character ^dis a positive definite kernel if and only if it is the Fourier transform p(w) of a positive finite measure such that:

k(x,x′)= custom-character e^iω^T^(x−X′)dp(ω)=[ζ_ω(x)ζ_ω(x′)^*] (6)

where ζ_w(x)=e^iω^T^x.

With a sufficient number of samples ω drawn from p(ω), an unbiased estimation of k(x,x′) is ζ_w(x)ζ_w(x)*.

Based on Eq. (6), D sets of samples are drawn: {ω_i}_i=1^Dand {b_i}_i=1^Dfrom a normal distribution and uniform distribution (with a range of [0, 2π]), respectively, and the random Fourier features (RFFs) are constructed for each data point x using the formula:

$\begin{matrix} ψ (x) = \frac{1}{\sqrt{D}} [\cos (ω_{1}^{⊤} x + b_{1}), \dots, \cos (ω_{D}^{⊤} x + b_{D})] . & (7) \end{matrix}$

Having the random Fourier features, the kernel matrix is determined as k(x, x′)=ψ(x)ψ(x′)^T.

Traditionally the shift-invariant kernel is constructed based on random Fourier features, where the Fourier basis is drawn from a Gaussian distribution transformed from a pre-defined kernel. This results in kernels that are agnostic to the task. In continual learning, however, tasks are provided sequentially from non-stationary data distributions, which makes it sub-optimal to share the same kernel function across tasks. To address this problem, task-specific kernels are trained in a data-driven manner. This is suitable for continual learning as it is desirable to train informative kernels by using a coreset of a minimum size. This is formulated as a variational inference problem, where the random basis co is treated as latent variable.

From a probabilistic perspective, it is desirable to maximize the following conditional predictive log-likelihood for the current task t:

$\begin{matrix} \max_{p} \sum_{(x, y) \in 𝒟_{t} \ 𝒞_{t}} \ln p (y | x, 𝒟_{t} \ t), & (8) \end{matrix}$

which amounts to making maximally accurate predictions on x based on custom-character _t\_t.

Introducing the random Fourier base ω in Eq. (8) that is treated as a latent variable, results in:

$\begin{matrix} \max_{p} \sum_{(x, y) \in 𝒟_{t} \ 𝒞_{t}} \ln \int p (y | x, ω, 𝒟_{t} \ t) p_{γ} (ω | 𝒟_{t} \ t) d ω . & (9) \end{matrix}$

Data is used to infer the distribution over the latent variable ω whose prior is conditioned on the data. The data and co are combined to generate kernels to classify x based on kernel ridge regression. An uninformative prior of a standard Gaussian distribution can be placed over the latent variable ω, as described further below in respect of the experiments.

It is intractable to directly solve for the true posterior p(ω|x,y, custom-character _t\_t) over ω; therefore a variational posterior q_ϕ(ω|_t) is introduced and conditioned solely on the coreset _tbecause the coreset will be stored as episodic memory for the inference of each corresponding task.

By incorporating the variational posterior into Eq. (9) and applying Jensen's inequality, the evidence lower bound (ELBO) is established as follows:

$\begin{matrix} \ln p (y | x, 𝒟_{t} \ t) \geq 𝔼_{q_{ϕ} (ω | t)} [\ln p (y | x, ω, 𝒟_{t} \ t)] - D_{KL} [q_{ϕ} (ω | t) ❘ ❘ p_{γ} (ω | 𝒟_{t} \ t)] = ℒ_{ELBO} . & (10) \end{matrix}$

Therefore, maximizing the ELBO amounts to maximizing the conditional log-likelihood in Eq. (8).

In the continual learning setting, the model is able to make predictions based solely on the coreset custom-character _tthat is stored in memory. That is, the conditional log-likelihood is conditioned on the coreset only. Based on the ELBO in Eq. (10) the following empirical objective function is established that is minimized by the overall training procedure:

$\begin{matrix} {\tilde{ℒ}}_{ELBO} = \frac{1}{T} \sum_{t = 1}^{T}   [\sum_{(x, y) \in 𝒟_{t} \ 𝒞_{t}}  \frac{1}{L} \sum_{ℓ = 1}^{L} [\ln p (y | x, ω^{(ℓ)}, t)]  - D_{KL} [q_{ϕ} (ω | t) ❘ ❘ p_{γ} (ω | 𝒟_{t} \ t)]], & (11) \end{matrix}$

where in the first term, the Monte Carlo method is used to draw samples from the variational posterior q(ω| custom-character _t) to estimate the log-likelihood, and L is the number of Monte Carlo samples. In the second term, the conditional prior serves as a regularizer that ensures the inferred random Fourier basis should always be relevant to the current task. Minimizing the KL divergence enforces the distribution of random Fourier bases, as inferred from the coreset, to be close to the one from the training set. Moreover, the KL term enables generation of informative kernels adapted to each task by using a relatively small memory.

In practice, the conditional distributions q_ϕ(ω| custom-character _t) and p_γ(ω|_t\_t) are assumed to be Gaussian and may be implemented by using the amortization technique. That is, multiple-layer perceptrons are used to generate the distribution parameters, μ and σ, by taking the conditions as input. In experiments, two separate amortization networks are deployed, referred to as the inference network f_ϕ, for the variational posterior and the prior network f_γ for the prior. In addition, to demonstrate the effectiveness of data-driven kernels, a variant of variational random features is implemented by replacing the conditional prior in Eq. (11) with an uninformative one, i.e., an isotropic Gaussian distribution custom-character (0,I). In this case, kernels are also learned in a data-driven way from the coreset without being regulated by the training data from the task.

FIG. 3 is a graphical depiction of an example of kernel continual learning with variational random features annotated with the above-described notation. For each task t, the coreset custom-character _tis used to infer the random Fourier base, which generates kernel matrix _t. The classifier for this task is constructed based on kernel ridge regression using _t. h_θ denotes the feature extraction network, parameterized by θ, which is shared and updated during training on the task sequence. f_ϕ is the inference network, parameterized with ϕ for random Fourier bases, which is also shared across tasks and updated across learning. Memory custom-character stores the coreset from each task and is used for inference only. h_θ and f_ϕ are jointly learned end-to-end.

Referring now to FIG. 1A, there is shown a block diagram of an example method 100 for kernel continual learning during a training phase. The method 100 makes use of a training database 102, which stores a sequence of n tasks (T₁, T₂, . . . , T_n), with each task having a corresponding dataset (D₁, D₂, . . . , D_n). Each dataset may comprise any suitable type of data, such as image data containing images for classification. The method 100 processes the tasks sequentially, with each task being processed once as follows.

The method 100 selects a representative task at block 104. More particularly, at block 104, a subset of one of the datasets, D_t, obtained from the training database 102, is randomly and uniformly chosen. This subset of D_tis stored in memory 114 for subsequent use at inference time and is excluded from D_t, which resulting dataset is denoted as D_t\C_therein. This subset of the dataset D_tis the coreset, C_t.

After selecting the coreset, the method 100 comprises performing feature extraction at block 106 on the coreset C_tand dataset D_t\C_tto respectively map features from the coreset C_tand dataset D_t\C_tto a feature space. An artificial neural network, such as a convolutional neural network or multilayer perceptron, may be used to perform this feature extraction.

Extracted features are mapped to the Hilbert space at block 108. Advantageously and in at least some implementations, mapping features from a cartesian space to a Hilbert space allows computing a kernel for a task in a more efficient manner. Also at block 108, using the features extracted from the coreset C_tat block 106, random Fourier features are computed. At block 110, using the random Fourier features determined at block 108 over C_t, a task specific kernel is determined. And, at block 112, the features determined over D_t\C_tat block 106 are classified and their corresponding labels are predicted. By having the predicted label and ground truth, the performance of the method is evaluated and penalized accordingly. After label prediction, the predicted label is compared with the ground truth and the cross-entropy loss function is determined. By backpropagating the loss and error, the feature extractor used at block 106 and/or the random feature generation performed at block 108 may be improved and ideally optimized. Thus, at a high level and in some implementations, the training phase includes receiving a task data set regarding the current task; representing the current task with the task data set using a representative data set which is a subset of data from the task dataset (coreset dataset); extracting features representative and using the features extracted to perform random feature generation for generating the random features for computing the kernel (based on the features generated) to construct a classifier. Notably, for each task observed, a kernel is computed to represent each task and may be reviewed in the Hilbert space.

Referring now to FIG. 1B, there is shown a block diagram of an example method 115 for kernel continual learning during a testing (or inference) phase. As noted, when a representative task is selected, it may be stored in a memory for use in the testing or inference stage. A testing database 116 in FIG. 1B stores input query images X and the corresponding task ID (more generally, classification may be performed on an input query dataset of which the input query images are an example). By knowing the task ID, at block 118 the coreset C_tthat corresponds to the task ID is retrieved from the memory 114. Analogous to FIG. 1A, feature extraction on the coreset C_tand on the input query images X are performed at block 106. At block 108, using features extracted from the coreset C_tat block 106, random Fourier features are determined. At block 110, for the current queried task, the random Fourier features estimated at block 108 over the coreset C_tare used to construct a task specific kernel. Then, at block 112, the features determined over the input query images X at block 106 are classified using the kernel determined at block 110 and their corresponding labels are predicted. In at least some embodiments, the combination of the method 100 and 115, avoids catastrophic forgetting by storing coresets from each tasks (including storing a coreset or subset of samples from the previous task), whereby the coreset can be used to generate the classifier, such that the coreset from the previous task is used in the training of the current task, thereby not forgetting prior learned knowledge.

Referring now to FIG. 2, there is shown a block diagram of an example computer system 200 that can be used to perform a method of kernel continual learning, as described above (e.g. with reference to methods 100 and 115 in FIGS. 1A and 1B). The computer system 200 comprises a processor 202 that controls the system's 200 overall operation. The processor 202 is communicatively coupled to and controls subsystems comprising user input devices 204, which may comprise any one or more user input devices such as a keyboard, mouse, touch screen, and microphone; random access memory (“RAM”) 206, which stores computer program code that is executed at runtime by the processor 202; non-volatile storage 208 (e.g., a solid state drive or magnetic spinning drive), which stores the computer program code loaded into the RAM 404 for execution at runtime and other data; a display controller 210, which may be communicatively coupled to and control a display 212; graphical processing units (“GPUs”) 214, used for parallelized processing as is not uncommon in vision processing tasks and related artificial intelligence operations; and a network interface 216, which facilitates network communications with a network and other devices that may be connected thereto (not shown). Any one or more of the methods for kernel continuing learning as described herein, such as those depicted in FIGS. 1A and 1B, may be implemented as computer program code and stored in the non-volatile storage 208 for loading into the RAM 206 and execution by the processor 202, thereby causing the system 200 to perform classification using kernel continuing learning. Notably, in at least some embodiments the system may provide a non parametric classifier based on kernels such as to reduce likelihood of forgetting by storing a subset of data (coreset) from prior tasks and learning therefrom for training the current tasks.

Experiments

Experiments are conducted on four benchmark datasets for continual learning. Ablation studies are performed to demonstrate the effectiveness of kernels for continual learning as well as the benefit of variational random features in learning data-driven kernels. Four different datasets are used: Permuted MNIST, Rotated MNIST, Split CIFAR100, and Split minilmageNet.

Permuted MNIST: Following Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, J., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences, 114(13):3521-3526, 2017, 20 different MNIST datasets are generated. Each dataset is created by a special pixel permutation of the input images, without changing their corresponding labels. Each dataset has its own permutation by owning a random seed.

Rotated MNIST: Similar to Permuted MNIST, Rotated MNIST has 20 tasks as in Mirzadeh, S. I., Farajtabar, M., Pascanu, R., and Ghasemzadeh, H., Understanding the role of training regimes in continual learning, arXiv preprint arXiv:2006.06958, 2020. Each task's dataset is a specific random rotation of the original MNIST dataset (e.g., task 1, task 3, and task 3 are the main MNIST dataset, 10 degree rotation, and 20 degree rotation, respectively). Each task's dataset is accordingly a ten degree rotation of the previous task's dataset.

Split CIFAR100: As described in Zenke, F., Poole, B., and Ganguli, S., Continual learning through synaptic intelligence, Proceedings of machine learning research, 70:3987, 2017b, this benchmark is generated by dividing the CIFAR100 dataset into 20 sections. Each section represents 5 our of 100 labels (without replacement) from CIFAR100. Hence, it contains 20 tasks and each task is a 5-way classification problem.

Split minilmageNet: Similar to Split CIFAR100, the minilmageNet benchmark as described in Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D., Matching Networks for One Shot Learning, arXiv:1606.04080v2 [cs.LG], 2017 contains 100 classes, a subset of the original ImageNet dataset in Russkovsky, 0., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., and Fei-Fei, L., ImageNet Large Scale Visual Recognition Challenge, arXiv:1409.0575v3 [cs.CV]. It has 20 disjoint tasks, and in each task there exists 5 classes.

The “average accuracy” and “average forgetting” metrics are used to evaluate performance, as described below.

Average Accuracy: This score shows the model accuracy after training t consecutive tasks are finished. That is:

$\begin{matrix} A_{t} = \frac{1}{t} \sum_{i = 1}^{t} a_{t, i}, & (12) \end{matrix}$

where a_t,irefers to model performance on task i after it is being trained on task t.

Average Forgetting: This metric measures the decline in accuracy per task between their highest accuracy and the final accuracy reached after model training is finished.

$\begin{matrix} F = \frac{1}{T - 1} \sum_{i = 1}^{T - 1} \max_{1, \dots, T - 1} (a_{t, i} - a_{T, i}) . & (13) \end{matrix}$

Taken together, the two metrics allow the assessment of how well a continual learner achieves its classification target while overcoming forgetting.

In at least some example embodiments, the system for kernel continual learning comprises three networks: a shared backbone h_θ, a posterior network f_ϕ, and a prior network f_γ. An overview of the system 1100 for kernel continual learning is depicted in FIG. 11. In FIG. 11, h_θ is the backbone network shared across different tasks to extract general features. f_ϕ, and f_γ are two amortized networks to estimate the posterior and prior distributions over ω. q and p refer to posterior and prior distributions. r_xare features extracted over samples drawn from D\C . These features are 12-normalized as well as average pooled over samples in the batch.

On the left of FIG. 11, the posterior and the priors generated in the sequence of tasks are shown. On the right, the inference model is depicted. To predict a label for a given query sample, first, the input images and their corresponding coreset are forwarded through h_θ and their features are computed: r_x^tand r_c^t. Next, r_c^tis fed through f_ϕ and the posterior distribution over q_ϕ(ω| custom-character _t) is estimated. Then, the random Fourier bases ω_tare created by drawing samples from estimated posterior distribution. Having random bases for the current task t, ω_t, as well as r_x^tand r_c^t, random Fourier features related the query input ψ(r_x^t) and coreset ψ(r_c^t) are estimated. Each kernel, K and {tilde over (K)}, is estimated using its corresponding random Fourier features and k(x, x′)=ψ(x)ψ(x′)^T. Based on Eq. (5), these two estimated kernels are used to predict the output labels for given query samples.

In at least some embodiments, the system 1100 of FIG. 11 may be varied by removing the prior network and setting the prior to a standard Gaussian distribution. In still other embodiments, by using linear, polynomial, and radial basis function kernels, neither the prior network nor the posterior network is used.

For the Permuted MNIST and Rotated MNIST benchmarks, h_θ contains only two-hidden layers that each have 256 neurons, followed by a ReLU activation function. For Split CIFAR100, a ResNet18 architecture similar to Mirzadeh, S. I., Farajtabar, M., Pascanu, R., and Ghasemzadeh, H., Understanding the role of training regimes in continual learning, arXiv preprint arXiv:2006.06958, 2020 is used, and for minilmageNet, a ResNet18 architecture similar to Chaudry, A., Khan, N., Dokania, P., and Ton, P. H. S., Continual Learning in Low-rank Orthogonal Subspaces, arXiv:2010.11635v2 [cs.LG] is used. With regard to the f_γ and f_ϕ, networks, three-hidden layers followed by an ELU activation function are used. The number of neurons in each layer depends on the benchmark. On Permuted MNIST and Rotated MNIST, there are 256 neurons per layer, and 160 and 512 are used for Split CIFAR100 and minilmageNet, respectively. To make fair comparisons, the model is trained for only one epoch per task, namely, each sample in the dataset is observed only once, and the batch size is assigned to be 10. Other optimization techniques such as weight-decay, learning rate decay and dropout are set to the same values as in Mirzadeh, S. I., Farajtabar, M., Pascanu, R., and Ghasemzadeh, H., Understanding the role of training regimes in continual learning, arXiv preprint arXiv:2006.06958, 2020. The model is implemented in Pytorch.

To demonstrate the effectiveness of kernels for continual learning, classifiers based on kernel ridge regression using commonly-used linear, polynomial, radial basis function (RBF) kernels, and the above-described variational random Fourier features are established. Results are reported on Split CIFAR100, where 5 different random seeds are sampled. For each random seed, the model is trained over different kernels. Finally, the result for each kernel is estimated by averaging over their corresponding random seeds. For fair comparison, all kernels are computed using the same coreset of size 20.

The results are shown in Table 1, below. All kernels perform well: the radial basis function (RBF) obtains a modest average accuracy in comparison to other basic kernels such as the linear and polynomials kernels. The linear and polynomial kernels perform similarly. The kernels obtained from variational random features (VRF) achieve the best performance in comparison to other kernels, and they work better than its uninformative counterpart. This emphasizes that the prior incorporated in VRF is more informative because its prior is data-driven.

TABLE 1

Effectiveness of kernels on Split CIFAR100. All kernels perform well,

and the simple linear kernel performs better the RBF kernel. The

adaptive kernels based on the random Fourier features achieve the

best performance, indicating the advantage of data-driven kernels.

Split CIFAR100

Kernel
Accuracy
Forgetting

RBF
56.86_{± 1.67}
0.03_{± 0.008}

Linear
60.88_{± 0.64}
0.05_{± 0.007}

Polynomial
60.96_{± 1.19}
0.03_{± 0.004}

VRF (uninformative prior)
62.46_{± 0.93}
0.05_{± 0.004}

VRF
62.70_{± 0.89}
0.06_{± 0.008}

Regarding VRF, FIGS. 4A-C respectively demonstrate the change of each task's accuracy on Permuted MNIST, Rotated MNIST and Split CIFAR100. The classifiers based on those kernels are non-parametric, which systematically avoids task interference in classifiers. Thanks to the non-parametric nature of the classifier based on kernels, the method according to at least some example embodiments is flexible and able to naturally deal with a more challenging setting with a different numbers of classes (referred to as “varied ways”). To show this, experiments with a varying numbers of classes in each task using variational random features are conducted. The results on Split CIFAR100 and Split miniIMageNet are shown in Table 2. Kernel continual learning results in slightly lower accuracy on Split CIFAR100, and leads to an improvement over the traditional fixed ways evaluation on Split minilmageNet.

TABLE 2

Effectiveness of VRF kernel in variable-way scenario on Split

CIFAR100 and Split miniImageNet. In this scenario, instead

of covering a fixed number of 5 classes per task in Split

CIFAR100 and Split miniImageNet benchmarks, a task is able

to cover more flexible number of classes in range [3, 15].

By doing so, the experimental setting looks more realistic.

Even in this case, our proposed method is effective, in miniImageNet

benchmarks, the model performance is improved.

Split CIFAR100
Split miniImageNet

Accuracy
Forgetting
Accuracy
Forgetting

Fixed Ways
64.02
0.05
51.89
0.06

Varied Ways
61.00_{± 1.80}
0.05_{± 0.01}
53.90_{± 2.95}
0.05_{± 0.01}

FIGS. 5A-C depict the average accuracy of kernel continual learning by variational random features for 20 tasks for three different coreset sizes are illustrated on Split CIFAR100 (FIG. 5A) and Split minilmageNet (FIG. 5B) benchmarks. Moreover, in FIG. 5C the average accuracy of two VRF and RBF kernels (RBF vs. VRF on Split CIFAR100) is shown. As shown in all figures, in at least some embodiments the method and system for kernel continual learning is improved by observing more tasks and in some embodiments, learning from prior tasks by incorporating coresets from the previous tasks into the training of the current tasks.

More particularly, the robustness of kernel continual learning when the number of tasks increases is considered in FIGS. 5A-C. Results for three different coreset sizes on Split CIFAR100 and Split minilmageNet are shown in FIGS. 5A and 5B, respectively. FIGS. 5A and 5B evidence that kernel continual learning achieves better and better performance as the number of tasks increases, indicating knowledge is transferred forward from previous tasks to future tasks. The observed positive transfer is likely due to the shared parameters in the feature extractors and amortization networks, as they allow knowledge to be transferred across tasks. A comparison between variational random features and a predefined RBF kernel is shown in FIG. 5C. The performance for variational random features increases faster than the RBF kernel with observing more tasks. This may be due to the shared amortization network among tasks, which enables knowledge to be transferred across tasks as well, indicating the benefit of learning data-driven kernels by variational random features.

FIGS. 9A-C respectively compare, for Permuted MNIST, Rotated MNIST, and Split CIFAR100, kernel continual learning according to at least some embodiments herein to various conventional methods by variational random features over 20 consecutive tasks in terms of average accuracy. Kernel continual learning consistently performs better than other methods with less accuracy drop on Rotated and Permuted MNIST and the performance even starts to increase when observing more tasks on the challenging Split CIFAR100 dataset.

To further demonstrate the memory benefit of data-driven kernel learning, variational random features with a predefined RBF kernel in FIG. 6. FIG. 6 shows that to achieve similar performance, variational random features need a smaller coreset size compared to RBF kernels, showing the benefit of variational random features for kernel continual learning. Five different coreset sizes are considered. Variational random features exceed the RBF kernel consistently. With a smaller coreset of 20, variational random features can achieve similar performance as the RBF kernel with a larger coreset of 40. This demonstrates that learning task-specific kernels in a data driven way enables use of a smaller memory than with a pre-defined kernel.

Since kernel continual learning does not need to replay and only uses memory for inference, the coreset size plays a crucial role. Its influence is therefore ablates on Rotated MNIST, Permuted MNIST, and Split CIFAR100 by varying the coreset sizes with 1, 2, 5, 10, 20, 30, 40, and 50. Here, the number of random bases is set to be 1024 for Rotated MNIST and Permuted MNIST, and 2048 for Split CIFAR100. The results in FIG. 7 show that by increasing the coreset size from 1 to 5 results in a steep accuracy increase for all datasets, and this continues depending on the difficulty of the dataset. For Split CIFAR100 results start to saturate after a coreset size of 20. This is expected as increasing the number of samples in a coreset allows the random Fourier bases to be better inferred with more data from the task, therefore resulting in more representational and descriptive kernels. In the remaining experiments a coreset size of 20 for Rotated MNIST, Permuted MNIST and Split CIFAR100, and a coreset size of 30 for miniImageNet are used. The effect of the coreset size is ablated on time complexity in Table 3. Indeed, it shows that increasing the coreset size only comes with a limited cost increase at inference time.

TABLE 3

5
10
20
40

Time (s)
0.0017
0.0017
0.0017
0.0018

When approximating VRF kernels the number of random Fourier bases is a hyperparameter. In principle, a larger number of random Fourier bases achieves better approximation of kernels, leading to better classification accuracy. Here its effect on the continual learning accuracy is investigated. Results with different numbers of bases are shown in FIG. 8 on RotatedMNIST, PermutatedMNIST and Split CIFAR100. As expected, performance increases with a larger number of random Fourier bases, but with a relatively small number of 256 bases, kernel continual learning already performs well on all datasets. In general, a larger number of random Fourier bases consistently improves performance on all benchmarks. With a relatively small number of 256 bases, variational random features can deliver good performance. Table 4 further shows the impact of the number of random bases on time complexity. It highlights that increasing the number of random bases comes with an increasing computation time for the model at inference time.

TABLE 4

256
512
1024
2048

Time (s)
0.0014
0.0015
0.0015
0.0017

Kernel continual learning is compared with alternative methods on four benchmarks. The accuracy and forgetting scores in Table 5, below, for Rotated, Permuted MNIST and Split CIFAR100 are all adopted from Mirzadeh, S. I., Farajtabar, M., Pascanu, R., and Ghasemzadeh, H., Understanding the role of training regimes in continual learning, arXiv preprint arXiv:2006.06958, 2020, and results for minilmageNet are from Chaudry, A., Khan, N., Dokania, P., and Torr, P. H. S., Continual Learning in Low-rank Orthogonal Subspaces, arXiv:2010.11635v2 [cs.LG]. The column “if” indicates whether a model utilizes a memory, and if so, the column “when” denotes whether the memory data are used during training time or test time. Our method achieves better performance in terms of average accuracy and average forgetting. Moreover, as compared to memory-based methods such as A-GEM and ER-Reservoir, which replay over previous tasks (when =Train), kernel continual learning does not require replay, enabling kernel continual learning of at least some embodiments to be efficient during training time. Also for the most challenging minilmageNet dataset kernel continual learning performs better than other methods, both in terms of accuracy and forgetting. In FIGS. 9A-C, kernel continual learning by variational random features is compared with other methods in terms of average accuracy over 20 consecutive tasks. Kernel continual learning performs consistently better. It is worth noticing that on the relatively challenging Split CIFAR100 dataset, the accuracy of kernel continual learning drops a bit at the beginning but starts to increase when observing more tasks. This indicates a positive forward transfer from previous tasks to future tasks. All hyperparameters to reproduce the results in FIGS. 9A-C and Table 5 are provided in Table 6, below:

TABLE 5

Comparison to Alternative Methods

Memory
Permuted MNIST
Rotated MNIST
Split CIFAR100
Split miniImageNet

Method
If
When
Accuracy
Forgetting
Accuracy
Forgetting
Accuracy
Forgetting
Accuracy
Forgetting

Lower Bound:
No
—
44.4 ± 2.46
0.53 ± 0.03
46.3 ± 1.37
0.52 ± 0.01
40.4 ± 2.83
0.31 ± 0.02
36.1 ± 1.31
0.24 ± 0.03

Naive- SGD

EWC
No
—
70.7 ± 1.74
0.23 ± 0.01
4.85 ± 1.24
0.48 ± 0.01
42.7 ± 1.89
0.28 ± 0.03
34.8 ± 2.34
0.24 ± 0.04

AGEM
Yes
Train
65.7 ± 0.51
0.29 ± 0.01
55.3 ± 1.47
0.42 ± 0.01
50.7 ± 2.32
0.19 ± 0.04
42.3 ± 1.42
0.17 ± 0.01

ER- Reservoir
Yes
Train
72.4 ± 0.42
0.16 ± 0.01
69.2 ± 1.10
0.21 ± 0.01
46.9 ± 0.76
0.21 ± 0.03
49.8 ± 2.92
0.12 ± 0.01

Stable SGD
No
—
80.1 ± 0.51
0.09 ± 0.01
70.8 ± 0.78
0.10 ± 0.02
59.9 ± 1.81
0.08 ± 0.01
—
—

Kernel Continual
Yes
Test
85.5 ± 0.78
0.02 ± 0.00
81.8 ± 0.60
0.01 ± 0.00
62.7 ± 0.89
0.06 ± 0.01
53.3 ± 0. 57
0.04 ± 0.00

Learning

Upper Bound: multi-
No
—
86.5 ± 0.21
0.0
87.3 ± 0.47
0.0
64.8 ± 0.72
0.0
65.1
0.0

task learning

TABLE 6

Hyperparameters

Permuted
Rotated
Split
Split

Method
MNIST
MNIST
CIFAR100
miniImageNet

Batch Size
10
10
10
10

Learning Rate
0.1
0.1
0.3
0.3

(LR)

LR Decay
0.8
0.8
0.95
0.95

Factor

Momentum
0.8
0.8
0.8
0.8

Dropout
0.5
0.5
0.02
0.02

Coreset Size
20
20
20
30

Number of
1024
1024
2048
2048

Bases

Number of
20
20
20
20

Tasks

Tau
0.01
0.01
0.01
0.01

FIGS. 10A-C depict example applications of kernel continual learning to classify an audio and/or video signal. More particularly, FIGS. 10A-C respectively depict classification of a SplitCIFAR100 dataset, FashionMNIST dataset, and minilmageNET dataset. The signal is acquired using a sensor such as a camera 1002, which is communicatively coupled to the system 200 for kernel continuing learning of FIG. 2. In each of FIGS. 10A-C, various tasks { Task_t−2, Task_t−1, Task_t, Task_t+1, Task_t+2} and their corresponding datasets {D_t−2, D_t−1, D_t, D_t+1, D_t+2} are captured by the camera 1002 and processed by the system 200. In accordance with FIGS. 1A and 1B, task specific kernels {K_t−2, K_t−1, K_t, K_t+1, K_t+2} are determined for each and used for classifying the objects comprising each of the tasks.

As another example application, applying a kernel-based classifier as described herein is used to perform recognition of hand-written digits in different rotation angles. Each rotation angle corresponds to a task, and those tasks are analyzed sequentially. Once trained on the current task of a certain angle, the kernel-based classifier recognizes digits in various different angles that it has previously been trained on without a need to retrain the model.

As described herein, kernel continual learning is a simple but effective variation of continual learning with kernel-based classifiers. To mitigate catastrophic forgetting, instead of using shared classifiers across tasks, task-specific classifiers are trained based on kernel ridge regression. Specifically, an episodic memory is used to store a subset of training samples for each task, which is referred to as the coreset. Kernel learning is formulated as a variational inference problem by treating random Fourier bases as the latent variable to be inferred from the coreset. By doing so, an adaptive kernel is generated for each task while requiring a relatively small memory size.

The processor used in the foregoing embodiments may comprise, for example, a processing unit (such as a processor, microprocessor, or programmable logic controller) or a microcontroller (which comprises both a processing unit and a non-transitory computer readable medium). Examples of computer readable media that are non-transitory include disc-based media such as CD-ROMs and DVDs, magnetic media such as hard drives and other forms of magnetic disk storage, semiconductor based media such as flash media, random access memory (including DRAM and SRAM), and read only memory. As an alternative to an implementation that relies on processor-executed computer program code, a hardware-based implementation may be used. For example, an application-specific integrated circuit (ASIC), field programmable gate array (FPGA), system-on-a-chip (SoC), or other suitable type of hardware implementation may be used as an alternative to or to supplement an implementation that relies primarily on a processor executing computer program code stored on a computer medium.

The embodiments have been described above with reference to flow, sequence, and block diagrams of methods, apparatuses, systems, and computer program products. In this regard, the depicted flow, sequence, and block diagrams illustrate the architecture, functionality, and operation of implementations of various embodiments. For instance, each block of the flow and block diagrams and operation in the sequence diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified action(s). In some alternative embodiments, the action(s) noted in that block or operation may occur out of the order noted in those figures. For example, two blocks or operations shown in succession may, in some embodiments, be executed substantially concurrently, or the blocks or operations may sometimes be executed in the reverse order, depending upon the functionality involved. Some specific examples of the foregoing have been noted above but those noted examples are not necessarily the only examples. Each block of the flow and block diagrams and operation of the sequence diagrams, and combinations of those blocks and operations, may be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. Accordingly, as used herein, the singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise (e.g., a reference in the claims to “a challenge” or “the challenge” does not exclude embodiments in which multiple challenges are used). It will be further understood that the terms “comprises” and “comprising”, when used in this specification, specify the presence of one or more stated features, integers, steps, operations, elements, and components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and groups. Directional terms such as “top”, “bottom”, “upwards”, “downwards”, “vertically”, and “laterally” are used in the following description for the purpose of providing relative reference only, and are not intended to suggest any limitations on how any article is to be positioned during use, or to be mounted in an assembly or relative to an environment. Additionally, the term “connect” and variants of it such as “connected”, “connects”, and “connecting” as used in this description are intended to include indirect and direct connections unless otherwise indicated. For example, if a first device is connected to a second device, that coupling may be through a direct connection or through an indirect connection via other devices and connections. Similarly, if the first device is communicatively connected to the second device, communication may be through a direct connection or through an indirect connection via other devices and connections. The term “and/or” as used herein in conjunction with a list means any one or more items from that list. For example, “A, B, and/or C” means “any one or more of A, B, and C”.

It is contemplated that any part of any aspect or embodiment discussed in this specification can be implemented or combined with any part of any other aspect or embodiment discussed in this specification.

The scope of the claims should not be limited by the embodiments set forth in the above examples, but should be given the broadest interpretation consistent with the description as a whole.

It should be recognized that features and aspects of the various examples provided above can be combined into further examples that also fall within the scope of the present disclosure. In addition, the figures are not to scale and may have size and shape exaggerated for illustrative purposes.

METHOD AND SYSTEM FOR KERNEL CONTINUING LEARNING

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims