The invention relates to the deep neural network field, more particularly to a self-supervised based continual learning (CL) approach for mitigating catastrophic forgetting.
Autonomous agents interacting in the real world are exposed to continuous streams of information and thus are required to learn and remember multiple tasks sequentially. Continual Learning (CL) [17, 18] deals with learning from continuous stream of data with the goal of gradually extending the acquired knowledge to solve multiple tasks sequentially. CL is also referred to as Lifelong learning, sequential learning or incremental learning [17]. CL over continuous stream of data remains one of the long-standing challenges in deep neural networks (DNNs) as they are prone to catastrophic forgetting i.e., a tendency of deep neural networks to lose knowledge pertaining to previous tasks when the information relevant to the current task is incorporated. Catastrophic forgetting often leads to abrupt drop in performance, or in the worst case, old knowledge is completely overwritten by new information [18].
Several approaches have been proposed in the literature to address the problem of catastrophic forgetting in CL. Replay-based methods [2, 3] store and replay a subset of samples belonging to previous tasks along with the current batch of samples. The performance of replay-based methods is commensurate with the buffer size. Therefore, these methods leave a large memory footprint. Regularization-based methods [4, 5] insert a regularization term to consolidate the previous knowledge when training on new tasks. These methods avoid using memory buffer altogether alleviating the memory requirements. Parameter isolation methods [6] allocate distinct set of parameters to distinct tasks thereby minimizing the interference.
Although the aforementioned approaches have been partially successful in mitigating catastrophic forgetting, they still suffer from several shortcomings. Since CL methods rely extensively on cross-entropy loss for classification tasks, they are prone to lack of robustness to noisy labels [25] and the possibility of poor margins [26] affecting their ability to generalize across tasks. Furthermore, the optimization objective in cross-entropy loss encourages learning of representations optimal for the current task sidelining the representations that might be necessary for the future tasks, resulting in prior information loss [1]. Also, the representations of the observed tasks drift when new tasks appear in the incoming data stream exacerbating the backward interference. The inventors assume that task-specific learning is the root cause of several other problems and is not well-equipped to deal with catastrophic forgetting.
There have been efforts to combine self-supervised learning into CL. Gallardo et al (2021) [7] empirically showed that self-supervised pre-training yields representations that generalize better across tasks than supervised pre-training in CL. In many real-world CL scenarios however, the data distribution of the future tasks is not known beforehand. Pre-training on a different data distribution often leads to domain shift subsequently reducing the generalizability of the learned representations. Furthermore, longer task sequences diminish the effect of self-supervised pre-training as the learned representations are repeatedly overwritten to maximize the performance on the current task.
Owing to additional computational effort, some of the approaches (e.g. [8, 9]) relinquished pre-training altogether and employed auxiliary pretext task to boost task-agnostic learning. However, these approaches only show a marginal improvement over the baseline methods. To mitigate catastrophic forgetting further, it is pertinent to learn task-agnostic representations. However, an effective approach integrating SSL into CL is still missing.
Discussion of any references or publications herein is given for more complete background and is not to be construed as an admission that such references or publications are prior art for patentability determination purposes.
It is an object of the current invention to correct the shortcomings of the prior art by improving forward facilitation while reducing backward interference in continual learning. This and other objects which will become apparent from the following disclosure, are provided with a self-supervised learning method for continual learning in deep neural networks having the features of one or more of the appended claims.
According to a first aspect of the invention, the proposed method for continual learning over non-stationary data streams comprises a number of sequential tasks wherein for each task a training budget is allocated and wherein said training budget is divided into a task-agnostic learning phase and a task-specific learning phase.
Suitably, the task-agnostic learning phase is followed by the task-specific learning phase. These features help bridge the aforementioned shortcomings without the need for SSL pre-training.
The task-agnostic learning phase comprises a self-supervised learning phase. Self-supervised learning offers the advantages of learning task-agnostic, robust representations generalizable across multiple tasks.
Solving pretext tasks created from known information can help in learning representations useful for downstream tasks. The task-agnostic learning phase is, therefore, cast as an instance-level discrimination task wherein the task-agnostic learning phase comprises the steps of:
In order to align the task-agnostic representations to the current task, the task-specific learning phase comprises a supervised learning phase.
The interplay between task-agnostic and task-specific learning objectives can lead to sharp drift in the feature space and erode the generic representations learned during task-agnostic learning. Multi-objective learning offers a viable solution to address this trade-off. Multi-objective learning can be thought of as a form of inductive transfer and is known to improve generalization. It is also data efficient as multiple objectives are learned simultaneously using shared representations.
The task-specific learning phase comprises the step of training classification head with cross-entropy objective. To further restrict the deviation from the learned representations in the self-supervised phase, the supervised learning phase comprises of rotation prediction as a task-agnostic auxiliary loss function. Employing rotation prediction as an auxiliary loss preserves the task-agnostic features.
Objects, advantages and novel features, and further scope of applicability of the present invention will be set forth in part in the detailed description to follow, taken in conjunction with the accompanying drawings, and in part will become apparent to those skilled in the art upon examination of the following, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, serve to explain the principles of the invention. The drawings are only for the purpose of illustrating one or more embodiments of the invention and are not to be construed as limiting the invention. In the drawings:
Continual learning normally consists of T sequential tasks. During each task, the input samples and the corresponding labels (xt,yt) are drawn from the task specific data distribution Dt. Our continual learning model consists of a backbone network fθ and three heads (hθssl,hθcls,hθrot). The continual learning model gθ={fθ,hθssl,hθcls,hθrot} is sequentially optimized on one task at a time up to the current one t∈1, . . . , Tc. The objective function is therefore as follows:
L
T
=Σt=1T(x
where σ is a softmax function and lce is a classification loss, generally a cross-entropy loss. Continual learning is especially challenging since the data from the previous tasks are unavailable i.e. at any point during training, the model gθ has access to the current data distribution Dt alone. As the objective function in Eq. (1) is solely optimized for the current task, it leads to overfitting on the current task and catastrophic forgetting of older tasks. Replay-based methods sought to address this problem by storing a subset of training data from previous tasks and replaying them alongside current task samples. For replay-based methods, Eq. (1) can thus be rewritten as:
L
cls
=L
T
+(x
where Dr represents the distribution of samples stored in the buffer. Although cross-entropy loss is widely used for classification tasks in continual learning, it suffers from several shortcomings such as lack of robustness to noisy labels and the possibility of poor margins, affecting the ability to generalize across tasks. Self-supervised learning offers an alternative by learning task-agnostic, robust, and generalizable representations. Therefore, a two-stage training consisting of task-agnostic learning followed by task-specific learning can help bridge the aforementioned shortcomings without the need for pre-training.
The task-agnostic learning phase of the disclosed method, as depicted in
Contrastive learning in Eq. (3) learns visual representations by contrasting semantically similar (positive) and dissimilar (negative) pairs of data samples such that similar pairs have the maximum agreement via a contrastive loss in the latent space through Noise Contrastive Estimation (NCE). Given a limited training time for each task, it is pertinent to learn task-agnostic features that are in line with the class boundaries to avoid interference in the downstream tasks. Eq. 3 is then adapted to leverage label information. Within each minibatch, normalized embeddings belonging to the same class are pulled together while those belonging to other classes are pushed away in the latent space.
where P(i) is a set of indices of samples belonging to the same class as the positive pair and |P(i)| is its cardinality. While Eq. 4 is a simple extension to contrastive loss, it eliminates the need for hard negative mining.
To align the task-agnostic representations to the current task, the classification head hθcls is trained with cross-entropy objective defined in Eq. 2. However, the interplay between task-agnostic and task-specific learning objectives can lead to sharp drift in the feature space and erode the generic representations learned during task-agnostic learning.
Multi-objective learning offers a viable solution to address this trade-off. Multi-objective learning can be thought of as a form of inductive transfer and is known to improve generalization. It is also data efficient as multiple objectives are learned simultaneously using shared representations.
Furthermore, a rotation prediction is employed as an auxiliary loss to preserve the task-agnostic features. During task-specific stage of each task, input samples x∈Dt∪Dr are rotated by a fixed angle in addition to other transformations. The learning objective is to match task-specific ground truths y∈Dt∪Dr as well as auxiliary ground truths ya∈{0°,90°,180°,270°}, i.e.
L
mo
=αL
cls+βx∈D
where α and β are hyperparameters for adjusting the magnitudes of two losses.
Algorithm 1 summarizes the proposed method in detail.
and ratio 0 < 7 < 1, model
] do
Task-agnostic Learning
t ∪
r do
┘:
do
Task-specific Learning
t ∪
r do
r
Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other. Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the method of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the appended claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment.
Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.
Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, ALGOL, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.