Generally, the field is continuous learning algorithms. More specifically, the field of the embodiments is continuous learning (CL) algorithms that utilize self-supervision's ability to generalize to new classes with a recursive, classifier-centric approach to incremental learning.
Continuous learning (CL) is a machine learning (ML) field concerned with training methods that can learn from a stream of never-ending data. AI agents need CL algorithms to be able to learn new classes and improve upon their old classes when given additional information. This paper examines methods to improve CL algorithms so that AI agents can learn in real-time and adapt to their dynamic environments.
One of the main objectives in CL is to develop methods that can learn new tasks sequentially. Standard optimizers like stochastic gradient descent (SGD) require access to an entire dataset upfront so that that they can stochastically shuffle the dataset's classes into training batches that never continually favor one class. The reason this is necessary is because these standard optimizers have no long-term memory and update their weights based only on the information in a batch. For incremental learning applications where only the new task data is available, this leads these optimizers to overfit on the current task with often significant performance degradation on the previously learned tasks.
This degradation is known as catastrophic forgetting (CF) and is the chief obstacle facing incremental learning algorithms. The lack of an accepted solution to this sequential learning problem, typically results in models having to be retrained over all their classes whenever a new class needs to be added or the statistics of the data change. This retraining approach adds considerable recompute time and data storage requirements since all the previous classes have to be stored for future class updates. This approach is not sustainable for AI agents in the field that need to continuously update their weights with new training data in real-time and with limited memory and compute resources.
Incremental learning is a subfield of CL that focuses on the task of sequential learning while trying to mitigate catastrophic forgetting (CF). Most incremental learning methods assume the solution requires both finetuning a network's feature extraction backbone and its new classifier on the new class data while simultaneously trying to limit the performance degradation on previously learned classes. This is very challenging since the optimizer is updating the feature weights upon which previous classifiers were trained.
Many different incremental learning approaches have been explored and most involve some method of attempting to incorporate the memory of prior classes into the network's backbone update for the new class to reduce CF on previously learned class features. For example, rehearsal methods like iCARL store exemplars from previous classes and replay them during new class training to preserve the memory of the prior classes and mitigate CF. Parametric isolation methods like PathNet mitigate CF by training models or sub-models on one task, freezing those model's weights or sub-weights and then training a new model or sub-model parameter on the new task. Still other approaches use regularization methods like Elastic Weight Consolidation (ECW) to penalize weight updates that interfere with important features from previous classes. The importance of class features is determined using a Fisher Information matrix. Other regularization approaches like ‘Learning without Forgetting’ use the soft labels of previously trained models run on the new class data to preserve memory of the prior classes during the model's update on the new class data.
All these methods make simplifying assumptions that preclude other desirable CL training objectives. For example, all these methods assume that when a new class is added, all the training data for that new class is available. This means that as each new class is added it has enough information to be trained until the class is completely learned. This assumption allows replay methods to be assured of finding the most representative exemplars for a given class. It allows ensemble methods to assume that the class is learned to completion so that it can freeze its weights during the training of future classes. It allows the regularization methods to assume that prior classes have been sufficiently learned so that the statistically important features of those classes can be determined for new class training penalties.
However, the assumption of all new class training data being available at the onset of new class training is incompatible with the CL goal of learning from a stream of increasing information content with mixed classes. It also ignores the very desirable CL goal of being able to revisit tasks with additional training data. And the CL objective of being able to train on data that is not clearly segmented into one-class increments or in other words being able to train without clear task boundaries. This assumption also glosses over the goal of rapid online learning, thereby allowing solutions that require offline, multi-episodic passes over all a class's training examples.
Another common approach in these incremental learning techniques is the use of multiple single-class prediction heads as opposed to one multi-class prediction head. This use of separate single task prediction heads prevents these techniques from jointly optimizing their weights across the entire set of classes. It also increases their training times and inference times. Incremental learning methods, such as rehearsal and replay whose memory banks grow as additional data and classes are added, also ignore the CL goal of having a constant memory component for learning over long streams of data. And perhaps the most important limitation of all these incremental learning methods is that they can neither justify theoretically nor consistently demonstrate the ability to eliminate CF during sequential training.
In first exemplary embodiment, a system for incrementally training a classifier to continuously learn new classes and classify incoming data includes: a processing component for formatting incoming data for feature extraction; a feature extraction backbone for (i) multiplying formatted data with a positional embedding; (ii) transforming, by an encoder, formatted data with positional embedding to produce an encoded vector; (iii) appending a class token to the encoded vector; a single head incremental classifier trained on one or more known classes for (iv) receiving, by a classification weight matrix wk, where k denotes the kth training update, the encoded vector and determining that the incoming data is in a new class; (v) augmenting the classification weight matrix with a new null-class weight vector Δwk; and (vi) training the incremental classifier on training data having feature samples corresponding to the incoming data directed to the new class.
In a second exemplary embodiment, a non-transitory computer-readable storage medium having computer-executable instructions stored thereon for classifying incoming data, which when executed by one or more processors, cause the one or more processors to perform operations including: formatting incoming data for feature extraction; multiplying formatted data with a positional embedding; encoding formatted data with a positional embedding to produce an encoded vector; appending a class token to the encoded vector; receiving the encoded vector at an incremental classifier and determining by a classification weight matrix that the incoming data is in a new class; augmenting the classification weight matrix with a new null-class weight vector Awk; and training the incremental classifier on training data having feature samples corresponding to the incoming data directed to the new class.
Example embodiments will become more fully understood from the detailed description given herein below and the accompanying drawings, wherein like elements are represented by like reference characters, which are given by way of illustration only and thus are not limitative of the example embodiments herein.
The present embodiments lie at the intersection of self-supervision and incremental learning technologies. The embodied approach directly transfers the frozen feature weights from a pretrained, self-supervised model to a downstream classifier for incremental learning. The self-supervised backbones produce class-agnostic features that generalize well to a wide array of classes, simplifying CL algorithm development so that it can focus on a classifier-centric approach. With this approach, it is no longer necessary to incrementally finetune a network's backbone for each new class.
extending Rapid Class Augmentation (“XRCA”) is based on a recursive least-squares (RLS) solution that has been modified for incremental learning. It uses the inherent RLS memory that incorporates information about all the past training examples into each optimization step by recursively updating an inverse feature covariance matrix (IFCM) using the Matrix Inversion Lemma. This IFCM acts as XRCA's primary memory component and effectively resolves all the typical forgetting issues that arise for standard optimization techniques like SGD that retain no memory or informational context outside of the current batch.
Its recursive approach to memory means that the order with which the samples arrive or the mixture of samples within a batch is unimportant, and its incremental solution will have the same jointly optimal accuracy across all trained classes as its non-incremental solution. In other words, this recursively computed solution eliminates the CF that occurs with standard batch-based optimization methods.
This makes XRCA distinct from other incremental learning approaches. For example, XRCA does not require the replay method's storage of prior class features or the ensemble method's storage of multiple class models or sub-models. Nor does it use the optimization constraints and cost penalties that characterize the regularization approaches.
This recursive approach allows an XRCA classifier to optimally learn in an online manner. It allows seamless class updates and task revisit capabilities. It also allows training on incrementally mixed class batches with no requirement for distinct task boundaries. In addition, XRCA's IFCM memory does not grow as additional training samples are added. This gives it a constant memory component which enables learning over very long data streams with limited memory.
A key algorithmic distinction of XRCA is that, instead of initializing the new class weights randomly, it recursively computes and maintains a null-class weight vector that is used to initialize any new class. This null-class weight vector is updated recursively on all the training samples using a negative label. The idea is to view the initialization of a new class as simply a class for which there has been no prior positive training examples. In other words, by training a null-class weight vector over all the previously seen training data using a negative label we have the optimal least-squares initialization for any new class.
This allows the XRCA method to incrementally train a multi-class classifier so that it is jointly optimized over both old and new classes. To our knowledge XRCA is the only incremental learning method that eliminates CF while jointly optimizing over the set of old and new classes. It also means XRCA learns new classes orders of magnitude faster and with greatly reduced memory and compute resources than a standard SGD classifier that needs to store and retrain over all of the prior classes to perform an update and achieve optimal performance.
The transformer encoder 20 uses a series of transformer blocks, each transformer block including a multi-head self-attention layer and a feed-forward layer. Each multi-head self-attention layer contains multi-head attention (MHA) modules used to find the important correlations between the layer's separate token embeddings. The MHA outputs from each layer are summarized using a multi-layer perceptron (MLP) module and the MLP outputs serve as the new token embeddings for the next layer. The final layer's class token is used as the feature vector 22 that summarizes the image and is fed into the network's classifier 24.
In the image classification domain, each self-attention layer calculates attention weights for each pixel in the image based on its relationship with all other pixels, while the feed-forward layer applies a non-linear transformation to the output of the self-attention layer.
The resulting class-token embedding 22 is the feature vector upon which the XRCA or R3CA classifier 24 operates to generate output scores 26 and assign a class prediction 28. During incremental training, when a new image class becomes available, the R3CA classification matrix is augmented with a new null-class vector and then trained on feature samples corresponding to the new image and label. During inference, the R3CA classifier operates on a feature embedding and outputs a classification score 26 that is jointly optimized over all classes in the training set.
Additional details describing the use of multi-head attention layers in transformer encoders can be found in Vaswani et al., Attention Is All You Need, NIPS (2017), which is incorporated herein by reference. Additional details describing the use of transformer architectures for computer vision can be found in Dosovitsy et al., An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, rXiv:2010.11929v2 [cs.CV] 3 Jun. 2021 which is incorporated herein by reference.
In the example shown in
The XRCA method is based on a modified RLS to address the incremental learning task and is described in detail in co-owned U.S. application Ser. No. 17/083,969. An XRCA model consists of 3 core elements: 1) a classification weight matrix, wk, 2) an inverse feature covariance matrix (IFCM), Mk; and 3) a null class vector Δwk where the subscript k denotes the kth training update. It's linear, classification weight matrix wk maps a network's backbone features to its classes. Its IFCM Mk stores previously seen feature correlation data and acts as the model's primary memory. The null-class vector Δwk is used to initialize a new class using all the prior class information of what it is not. It is computed in a recursive manner by updating its weights using feature data with all negative labels. This is seen to provide the optimal LS initialization for any new class.
The components of an XRCA model are initialized using a feature matrix containing the base model's classes: X0∈N×F, of size N samples×F feature dimensions and a label matrix T0∈
N×C composed of stacked signed, one-hot labels of shape N×C, where C is the number of classes. Eq. (1) shows XRCA's initialization equations:
These three regularized elements: M0∈F×F, w0∈
F×C, Δwk∈
F×1 make up the components of an XRA base model and will be recursively updated as the additional data is presented. The TNeg term appearing in the null class's Δwk equation consists of a vector of −1's of dimension N×1, representing negative labels for all base model examples.
For each new batch of additional training data, the XRCA algorithm first looks to see if any samples in the batch contain new class labels. If a batch contains only existing class data (i.e., all class labels are less than the number of columns in the current classifier), the XRCA algorithm computes the RLS updates and the new update for the null-class initialization vector Δwk, as seen below in Eq. (2)
If a batch contains a new class label (i.e., the label is greater than the number of columns in the current classifier), the XRCA algorithm first augments the existing classification matrix wk, with a null-class vector Δwk as shown in Eq. (3).
The new augmented classifier wk is then passed to the update equations in Eq. (3) and trained with the new xk+1 training data sample or batch.
The experimental results summarized herein show incremental learning algorithms no longer need to individually finetune their network's feature extraction backbones on new class data given the advent of self-supervised vision methods and their excellent ability to generalize well to a wide variety of downstream tasks. Specifically, the experiments use the XRCA incremental learning classifier with features from various VIT's pretrained using supervision, self-supervision, finetuning and domain adapted techniques. The XRCA's incremental learning ability is compared with prior art SGD in terms of CF and augmentation training times to exemplify the improvements.
In a first experiment, the XRCA system and process of
The utility of the differently trained features is measured using an all-class accuracy performance metric which measures the accuracy over the entire test dataset which in this case contains 10 classes. By applying this metric after each successive new class augmentation, we can observe the growing classification capacity of the model. The experiment measures the all-class accuracy of these pretrained models on the ImageNette dataset which is a 10-class subset of the original ImageNet data.
XRCA's all-class accuracy using the features from the two pretrained backbones is shown in
The results empirically confirm the theory that an optimal least-squares solution can be recursively obtained using the Matrix Inversion Lemma and XRCA's null-class vector for new class initialization.
Experiment 2 explores the central theme of this paper by examining the potential of self-supervised, pretrained vision backbones to transfer directly without finetuning to previously unseen downstream tasks.
Experiment 2 reuses the S-VIT and SS-VIT from Experiment 1 above and directly transfers their features to the new downstream task of car classification. In this experiment, the XRCA classifier is initialized on the first 50 cars in the Stanford CARS196 dataset and then incrementally trained over the remaining 146 cars using just the new class data.
To these metrics a new train-class metric is also added. The train-class metric measures the classifier's accuracy over previously trained classes. It provides a clearer measurement of an incremental learning algorithm's ability to avoid significant degradation on previous classes as new classes are added. The train-class metric shows that the classification accuracy on previously learned classes stays roughly constant and does not experience any significant degradation. Notably, it converges to the ideal non-incremental learning accuracy as the final classes are incorporated.
These results show an approximately 25% increase in the overall classification performance for an XRCA classifier using the self-supervised features as opposed to the supervised features. The strong separation between the self-supervised features and the supervised features supports our hypothesis that the self-supervised features can generalize better to new classes than the supervised features that were trained explicitly on a different set of tasks.
On the one hand, the supervised features' relatively poor incremental learning classification accuracy seems to motivate the approach of most incremental learning approaches to improve performance by sequentially finetuning the backbone on the additional classes. On the other hand, the relatively good incremental learning accuracy of the self-supervised features asks the question if this finetuning approach is still necessary, especially given the potential of even better generalization once future self-supervised models are pretrained over the larger and more abundant data sources.
In other words, given the success of self-supervision to produce features that generalize well to downstream tasks, the sequential finetuning approaches pursued by most incremental learning algorithms might become irrelevant. These findings increase the utility of classifier-focused incremental learning, like XRCA that are well-positioned to leverage these self-supervised techniques for continuous learning objectives.
These findings are further highlighted in
Experiment 3 seeks to quantify the upper bound for the classification task if we first could finetune the VIT backbone on the downstream CARS196 classes and then incrementally learn those classes using XRCA. This sort of finetuning approach is often not possible for many CL applications that do not have the luxury of pretraining their backbones over known future classes. This limitation is probably what originally motivated the other incremental learning techniques to take on the difficult challenge of finetuning incrementally on the new class data.
Experiment 4 uses a simple form of domain adaptation to see if it is advantageous to finetune a pretrained backbone over similar but different car classes. The hope is that it will help the feature extraction backbone learn important car features that will generalize well to future car classes. This scenario represents a more realistic use case where the genre of future classes is more likely known and there are a limited set of representative genre training examples available but the exact future classes are unknown.
The experiment is divided into two parts. First, a VIT model is finetuned over the first 50 cars in the CARS196 dataset to learn the domain-adaptation to the CARS genre. Second, that domain-adapted backbone is used to extract the features over the remaining set of 146 classes, that it has not previously seen. The performance of an incremental learning XRCA classifier operating on these 146 classes of domain-adapted features will also be compared to the supervised and self-supervised counterparts.
In this example, the XRCA classifier is initialized on the first 100 of these 146 classes and then incrementally trained on the remaining 46 classes. The results show that now all the performance rankings are flipped, with the self-supervised features doing much better than the supervised features, which in turn perform better than the domain-adapted features. These results illustrate the difficulty of using finetuning over similar classes for domain-adaption to a given genre. They indicate that the finetuned backbone can easily be overfit to the smaller domain relevant dataset and in the process lose its ability to transfer well to unseen new classes.
This result also highlights the challenge for conventional incremental learning algorithms to avoid overfitting on the current new class while finetuning the model on the incrementally available new class data. It supports the idea that pretraining a self-supervised model on a large dataset may have a better chance of producing features that generalize to unseen future classes than trying to finetune a model on a limited data set to improve domain relevance. These results once again support the use of incremental learning algorithms that can leverage these pretrained, self-supervised features for downstream classification.
Experiment 5 compares XRCA's incremental learning with a standard SGD optimizer in an incremental learning application to highlight XRCA's ability to eliminate CF. Both classifiers are trained with the features produced by a small vision transformer (SS-VIT) architecture that has a feature dimension of 386 and is pretrained using DINO on ImageNet and transferred to the CARS 196 dataset. Both classifiers are initialized with the same base model over the first 195 cars and have the same train-class performance.
The experiment begins by augmenting both classifiers with the final 196th CAR class and updating each classifier's weights using just that class data.
The final experiment conducts a timing comparison that demonstrates the significantly faster augmentation rates enabled by the XRCA incremental learning algorithm. We use the same FT-VIT feature extraction backbone with a feature dimension of 768 previously pretrained on CARS196 for both the XRCA and SGD classifier. Each classifier was initially trained on 195 car classes, and both achieved classification accuracies of ˜83% on these classes. Each method was then timed for the period it took to augment and train its classifier with a new class.
The XRCA classifier was augmented using just the new class training data. It learned the new class with a test accuracy of 100% in 0.007 seconds while preserving its old class accuracy. The SGD classifier took 13.878 s to learn its new class while training over all the classes to maintaining previous class performance. This is 1900× slower than XRCA. These times are shown in Table 1.
The reason SGD is so much slower is that it must be trained using batches containing roughly equal mixtures of all its 196 classes. This means that the new class data is lightly sprinkled into its training batches which significantly increases the amount of time necessary update the SGD. Furthermore, as the number of previously learned classes grows, the longer it takes for SGD and other non-incremental methods to learn the new class while also avoiding CF.
Note this time does not include the time to extract the features from the images and so in some sense represents the best time case for SGD. If feature extraction time is included the XRCA classifier can be updated in under 1.5 seconds while the SGD model takes over 8000 seconds to extract all the image features over the 50 training epochs.
The embodiments herein describe the advantages of using the XRCA incremental learning technique with a self-supervised VIT backbone. The results of the disclosed experiments show that a self-supervised backbone produces features that generalize significantly better to new classes and ultimately yielded improved incremental learning over the pretrained supervised features. The results also show that care must be taken when using supervision to domain-adapt a backbone to similar but different classes. XRCA's rapid learning and ability to eliminate catastrophic forgetting was compared with an SGD progressive learning implementation and shown to be over 1000× faster.
Exemplary hardware (also referenced herein as “chip(s)”) and hardware functions for implementing the embodiments described herein are well known and understood to those skilled in the art. Chips for use with the present embodiments include logic functionality implemented through semiconductor devices, e.g., millions or billions of transistors (MOSFET) (also called “nodes”) and electrical interconnects, for creating basic logic gates to perform basic logical operations. These basic logic gates are combined to perform complex high volume, parallel computing required for the training and inference of the models of the embodiments described herein. Chips may also include memory capabilities for storing the data on which the logic functionality is implemented. Exemplary memory capabilities include dynamic random-access memory (DRAM), NAND flash memory and solid-state hard drives.
As referenced above, the training and inference examples described herein were run on an NVIDIA Tesla T4 GPU with 16 GB of GPU memory. Specifications for the NVIDIA Turing GPU architecture can be found in the “NVIDIA Turing GPU Architecture” white paper WP-09183-001_v01 (2018) available on-line which is incorporated herein by reference in its entirety.
One skilled in the art will appreciate that this is but one specific example of a chip which may implement the training and inference embodiments described herein Exemplary chip types include graphics processing units (GPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs). FPGAs include logic blocks (i.e. modules that each contain a set of transistors) whose interconnections can be reconfigured by a programmer after fabrication to suit specific algorithms, while ASICs include hardwired circuitry customized to specific algorithms. The selection of particular hardware includes factors such as computational power, energy efficiency, cost, compatibility with existing hardware and software, scalability, and task (e.g. optimized for training or inference). For a detailed description of AI chip technology, see Khan et al., “AI Chips: What They Are and Why They Matter And AI Chips Reference,” CSET center for Security and Emerging Technology (April 2020) which is incorporated herein by reference in its entirety.
Certain embodiments are directed to a computer program product (e.g., nonvolatile memory device), which includes a machine or computer-readable medium having stored thereon instructions which may be executed by a computer (or other electronic device) to perform these operations/activities.
Although several embodiments have been described above with a certain degree of particularity, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the spirit of the present disclosure. It is intended that all matter contained in the above description or shown in the accompanying drawings shall be interpreted as illustrative only and not limiting. Changes in detail or structure may be made without departing from the present teachings. The foregoing description and following claims are intended to cover all such modifications and variations.
Various embodiments are described herein of various apparatuses, systems, and methods. Numerous specific details are set forth to provide a thorough understanding of the overall structure, function, manufacture, and use of the embodiments as described in the specification and illustrated in the accompanying drawings. It will be understood by those skilled in the art, however, that the embodiments may be practiced without such specific details. In other instances, well known operations, components, and elements have not been described in detail so as not to obscure the embodiments described in the specification. Those of ordinary skill in the art will understand that the embodiments described and illustrated herein are non-limiting examples, and thus it can be appreciated that the specific structural and functional details disclosed herein may be representative and do not necessarily limit the scope of the embodiments, the scope of which is defined solely by the appended claims.
Reference throughout the specification to “various embodiments,” “some embodiments,” “one embodiment,” “an embodiment,” or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases “in various embodiments,” “in some embodiments,” “in one embodiment,” “in an embodiment,” or the like, in places throughout the specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, the particular features, structures, or characteristics illustrated or described in connection with one embodiment may be combined, in whole or in part, with the features structures, or characteristics of one or more other embodiments without limitation.
Any patent, publication, or other disclosure material, in whole or in part, which is said to be incorporated by reference herein is incorporated herein only to the extent that the incorporated materials do not conflict with existing definitions, statements, or other disclosure material set forth in this disclosure. As such, and to the extent necessary, the disclosure as explicitly set forth herein supersedes any conflicting material incorporated herein by reference. Any material, or portion thereof, that is said to be incorporated by reference herein, but which conflicts with existing definitions, statements, or other disclosure material set forth herein will only be incorporated to the extent that no conflict arises between that incorporated material and the existing disclosure material.
The present application claims the benefit of priority to U.S. Provisional Patent Application No. 63/579,144 entitled JOINTLY OPTIMAL INCREMENTAL LEARNING WITH SELF-SUPERVISED VISION TRANSFORMERS filed Aug. 28, 2023, which is incorporated herein by reference in its entirety. Cross-reference is made to commonly-owned U.S. application Ser. No. 17/083,969 entitled DEEP RAPID CLASS AUGMENTATION filed Oct. 29, 2020, U.S. application Ser. No. 17/840,238 entitled METHOD AND SYSTEM FOR ACCELERATING RAPID CLASS AUGMENTATION FOR OBJECT DETECTION IN DEEP NEURAL NETWORKS filed Jun. 14, 2022 and U.S. Provisional Patent Application No. 63/579,151 entitled RIDGE REGRESSION FOR RAPID CLASS AUGMENTATION filed Aug. 28, 2023, which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
63579144 | Aug 2023 | US |