This disclosure relates generally to machine learning and, more particularly, to methods and apparatus for source-free active adaptation to distributional shifts for machine learning.
Machine learning is a subfield of artificial intelligence. In machine learning, instead of providing explicit instructions, programmers supply data to a model in a process called training. Training allows a machine learning model to infer outputs that were previously unknown to the model.
Training data is supplied to the model to adapt, test, and validate the machine learning model. Training data can be statistically described by its data distribution. A data distribution specifies possible output values for a random variable and the relative frequency with which the output values occur. Machine learning models trained on an underlying distribution may accurately infer output when provided new input data samples from the same underlying distribution.
Machine learning models are often trained and executed with data of different distributions. Furthermore, when a trained machine learning model is deployed, it may be provided data from a continuously shifting data distribution. Accordingly, methods of adapting machine learning models to shifting data distributions is an area of intense industrial and research interest.
In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts. The figures are not to scale. Instead, the thickness of the layers or regions may be enlarged in the drawings. Although the figures show layers and regions with clean lines and boundaries, some or all of these lines and/or boundaries may be idealized. In reality, the boundaries and/or lines may be unobservable, blended, and/or irregular.
Unless specifically stated otherwise, descriptors such as “first,” “second,” “third,” etc., are used herein without imputing or otherwise indicating any meaning of priority, physical order, arrangement in a list, and/or ordering in any way, but are merely used as labels and/or arbitrary names to distinguish elements for ease of understanding the disclosed examples. In some examples, the descriptor “first” may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as “second” or “third.” In such instances, it should be understood that such descriptors are used merely for identifying those elements distinctly that might, for example, otherwise share a same name.
As used herein, “approximately” and “about” modify their subjects/values to recognize the potential presence of variations that occur in real world applications. For example, “approximately” and “about” may modify dimensions that may not be exact due to real world imperfections as will be understood by persons of ordinary skill in the art. For example, “approximately” and “about” may indicate a tolerance range of +/- 10% unless otherwise specified in the below description. As used herein “substantially real time” refers to occurrence in a near instantaneous manner recognizing there may be real world delays for computing time, transmission, etc. Thus, unless otherwise specified, “substantially real time” refers to real time +/- 1 second.
As used herein, the phrase “in communication,” including variations thereof, encompasses direct communication and/or indirect communication through one or more intermediary components, and does not require direct physical (e.g., wired) communication and/or constant communication, but rather additionally includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or one-time events.
As used herein, “processor circuitry” is defined to include (i) one or more special purpose electrical circuits structured to perform specific operation(s) and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors), and/or (ii) one or more general purpose semiconductor-based electrical circuits programmable with instructions to perform specific operations and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors). Examples of processor circuitry include programmable microprocessors, Field Programmable Gate Arrays (FPGAs) that may instantiate instructions, Central Processor Units (CPUs), Graphics Processor Units (GPUs), Digital Signal Processors (DSPs), XPUs, or microcontrollers and integrated circuits such as Application Specific Integrated Circuits (ASICs). For example, an XPU may be implemented by a heterogeneous computing system including multiple types of processor circuitry (e.g., one or more FPGAs, one or more CPUs, one or more GPUs, one or more DSPs, etc., and/or a combination thereof) and application programming interface(s) (API(s)) that may assign computing task(s) to whichever one(s) of the multiple types of processor circuitry is/are best suited to execute the computing task(s).
Artificial intelligence (AI), including machine learning (ML), deep learning (DL), and/or other artificial machine-driven logic, enables machines (e.g., computers, logic circuits, etc.) to use a model to process input data to generate an output based on patterns and/or associations previously learned by the model via a training process. For instance, the model may be trained with data to recognize patterns and/or associations and follow such patterns and/or associations when processing input data such that other input(s) result in output(s) consistent with the recognized patterns and/or associations.
In general, implementing a ML/AI system involves two phases, a learning/training phase and an inference phase. In the learning/training phase, a training algorithm is used to train a model to operate in accordance with patterns and/or associations based on, for example, training data. In general, the model includes internal parameters that guide how input data is transformed into output data, such as through a series of nodes and connections within the model to transform input data into output data. Additionally, hyperparameters are used as part of the training process to control how the learning is performed (e.g., a learning rate, a number of layers to be used in the machine learning model, etc.). Hyperparameters are defined to be training parameters that are determined prior to initiating the training process.
Diverse types of training may be performed based on the type of ML/AI model and/or the expected output. For example, supervised training uses inputs and corresponding expected (e.g., labeled) outputs to select parameters (e.g., by iterating over combinations of select parameters) for the ML/AI model that reduce model error. As used herein, labelling refers to an expected output of the machine learning model (e.g., a classification, an expected output value, etc.) Alternatively, unsupervised training (e.g., used in deep learning, a subset of machine learning, etc.) involves inferring patterns from inputs to select parameters for the ML/AI model (e.g., without the benefit of expected (e.g., labeled) outputs).
Deep learning is a ML method that is based on learning data representations, as opposed to task-specific procedures. Deep learning models attempt to define a relationship between input data and associated output data. Deep learning is computationally intensive, requiring significant processing capabilities and resources. Deep learning models are often trained on data in a static environment that does not replicate the dynamic nature of real-world data. Due to the differences between a live environment and a controlled training environment, accuracy of the machine learning model may degrade.
Distributional shift refers to a shift in data that is provided to a model over time. Also called covariate shift, distributional shift often occurs when the distribution of input data shifts between the training environment and the deployed environment (e.g., real-world, post-training, production, etc.). Catastrophic interference (e.g., catastrophic forgetting) is a tendency of a neural network to forget previously learned information when the network is trained on new information.
Solutions to reduce catastrophic forgetting often use memory-based approaches and/or experience replay based approaches. Some memory-based approaches store a subset of source data that is used to constrain optimization of a machine learning model. In this way, loss associated with the subset of source data is reduced. Experience replay describes techniques in which a subset of samples from a source data is stored and used for retraining the model. Prior solutions based on memory and/or experience replay use a subset of past data during fine-tuning (e.g., retraining, partial retraining, etc.) of the model. However, saving past samples may negatively affect user privacy.
Examples disclosed herein include uncertainty-based approaches to select a subset of samples of a shifted data set which can be used for fine-tuning of a machine learning model. Some examples disclosed herein include neural network models and techniques for training (e.g., adapting, altering, modifying) the models to continually adapt to evolving data distributions while reducing and/or eliminating catastrophic forgetting.
Disclosed examples enable a machine learning model to detect shifts/drift in a data distribution based on uncertainty-aware techniques. Uncertainty-aware techniques identify a subset of data samples that can be used to fine-tune the machine learning model (e.g., adapt to an evolving distributional shift of underlying data). Therefore, examples disclosed herein provide a computer-implemented technical solution to the problem of continually evolving covariate shift in data. Some examples include a source-free active adaptation method to adjust (e.g., fine-tune, modify, adapt, etc.) a neural network to evolving data (e.g., domain drift or distributional shift) while avoiding catastrophic forgetting.
Pseudo-labeling involves training a model on a batch of labeled data. In pseudo-labeling, a trained model is used to obtain labels for unlabeled data based on model predictions. In examples disclosed herein, source-free active adaptation circuitry can select a subset of a shifted data set, and then pseudo-label the selected subset. Some examples disclosed herein provide the subset of data to a human annotator (e.g., a domain expert) to label the samples.
Examples disclosed herein may improve machine learning efficiency for a variety of fields by making machine learning models robust to shifts in data distribution. With techniques described herein, edge computing platforms may perform, for example, computer vision tasks in adverse weather conditions such as rain, fog, etc.
Disclosed examples are source-free. As described herein, source-free means that some or all of past data (e.g., data of a baseline data set, source data, etc.) is not stored. Source-free machine learning techniques may, for example, enhance data privacy. That is, after a first training with source data (e.g., a baseline data set), a second training (e.g., fine-tuning, adaptation, etc.) may be performed on second data set (e.g., a shifted data set), without access to the first (e.g., source, baseline, etc.) data set. In some examples, the second training (e.g., adaptation, fine-tuning) may be performed in a way that reduces and/or minimizes catastrophic forgetting of the first (e.g., source, baseline, etc.) data set.
Some examples utilize uncertainty estimation techniques. The uncertainty estimation techniques may identify a subset of informative samples for model adaptation (e.g., further training, fine-tuning). This reduces the compute burden of training the models and improves the training of the models. Thus, examples disclosed herein adapt a model to new data distributions (e.g., representative of data drift/shift) while not forgetting (e.g., maintaining performance on) past distributions using uncertainty estimation techniques.
In some examples, a cloud and/or a remote server performs a first training of a model and a subsequent fine-tuning of the model for shifted data. Communicating only the batch-norm statistics and/or all model parameters to the connected devices ensures that the models are up to date. Thus, examples that include edge devices can be updated without having to perform computationally intensive training at the edge device. Furthermore, as such examples are source-free, less storage is used when compared to prior solutions.
In some examples, a model may be trained on a cloud server and the cloud server may transmit the new parameters to the connected devices. Server-side training with transmission of parameters (e.g., batch-norm parameters) to edge devices helps edge devices adapt quickly while pushing computational workloads off the edge devices.
Examples disclosed herein address the technical problems above while enhancing data privacy for model adaptation. Some examples perform a first training of a neural network on a first data set (e.g., a baseline data set, a source data set, etc.) associated with a first data distribution (e.g., a distribution of the baseline data, a distribution of the source data, etc.), and then compare data of a second data set (e.g., a shifted data set) to a threshold uncertainty value. Data that satisfies the threshold uncertainty value is associated with a distributional shift between the first data set and the second (e.g., shifted) data set. By selecting data that satisfies the threshold uncertainty value, a third data set (e.g., a subset of data, a shifted data subset, etc.) can be generated that includes at least one item of the second (e.g., shifted) data set that satisfies the threshold uncertainty value. A second training of the neural network may then be performed on third data set (e.g., the shifted data subset).
Turning to the figures,
The neural network 104a is trained at the server 108 with the first training data 106a (e.g., baseline data, source data, etc.). An instance of the trained neural network is then transmitted to the example cellular phone 110, the example vehicle 112, and/or the example medical environment 114 for real-world inference and fine-tuning without use of source data.
The model 104a is a model that is trained on an initial dataset (e.g., a baseline data set, a source data set) and then deployed on various devices. After deployment, each device is capable of identifying shifted data using uncertainty estimation techniques, such as with the uncertainty estimation circuitry that will be described in association with
Information including neural network parameter values may be communicated from the environments 110, 112, and 114 to the server 108 via the network 116. At the server, the model 104a can be adapted (e.g., update model parameters) once a new (e.g., shifted) distribution is detected. The updated model parameters can be communicated to the models 104b-104d. Therefore, in some examples, a model may be distributed over a grouping of devices.
In other examples, the machine learning models 104a-d may all be on mobile devices and connected to the server 108. As any of the models 104b-104d may include data of a shifted data distribution, the models 104b-104d can be updated to adjust to the new shifted distribution. Such examples do not require storage of past samples, reducing storage needs of the deploying the model 104.
In the example system 100, only the batch normalization parameters of the models 104b-d are adjusted in finetuning for the shifted data. By only adjusting the batch normalization parameters, the computational workload is reduced when compared to updating the entirety of models 104b-d. Communicating only the updated batch normalization parameters to the models 104b-d ensures the fine-tuning process is performant and requires less compute and storage.
In the illustration of
The autonomous vehicle 112 may perform image classification, object detection, object trajectory projection, and/or any other neural network based classification on data obtained at the autonomous vehicle 112. The autonomous vehicle 112 may be updated and fine-tuned with the source-free active adaptation circuitry 102 without access to the source data (e.g., baseline data) and while reducing and/or eliminating catastrophic forgetting.
The source-free active adaptation circuitry 102 is also shown in the medical environment 114. The medical environment 114 includes the source-free active adaptation circuitry 102 to perform fine-tuning on a neural networks. Such neural networks may identify diseases, assist in medical imaging, predict patient outcomes, etc.
The examples of the cellular phone 110, the autonomous vehicle 112, and the medical environment 114 are examples of environments in which the source-free active adaptation circuitry 102 may operate. However, source-free active adaptation circuitry 102 may be disposed in any edge device to improve deep learning performance on datasets for which the underlying distribution may shift (e.g., smart homes, internet of things devices, etc.).
The server 108, the cellular phone 110, the autonomous vehicle 112, and/or the medical environment 114 may execute an instance of the source-free active adaptation circuitry 102. The source-free active adaptation circuitry 102 continually adapts to evolving data distributions while reducing and/or eliminating the catastrophic forgetting. The structure and function of the source-free active adaptation circuitry 102 will be described in association with
The server 108, the cellular phone 110, the autonomous vehicle 112, and the medical environment 114 are connected by the network 116. In some examples, the server 108 may train the neural network 104a on the first dataset 106a (e.g., the baseline dataset 106a, the source dataset 106a), and then transmit the trained neural network 104a to each of the cellular phone 110, the autonomous vehicle 112, and the medical environment 114. Then, at each of the cellular phone 110, the autonomous vehicle 112, and the medical environment 114, a respective instance of the neural network circuitry 102 may be fine-tuned with new datasets (e.g., the shifted dataset 106b, the shifted dataset 106c, etc.).
In the example of
The source-free active adaptation circuitry 102 includes the uncertainty estimation circuitry 202. The uncertainty estimation circuitry 202 performs one or more uncertainty estimation techniques that identify informative samples of data obtained for training of a machine learning model. In general, there are two types of uncertainty: 1) aleatoric uncertainty (e.g., input uncertainty) that is inherent within the input data samples (e.g., sensor noise or occlusion) and 2) epistemic uncertainty (e.g., model uncertainty). Aleatoric uncertainty cannot be reduced even if more data is provided. Epistemic uncertainty corresponds to inadequate knowledge of a model to explain certain data. The uncertainty estimation circuitry 202 may estimate one or both of aleatoric uncertainty and/or epistemic uncertainty.
The uncertainty estimation circuitry 202 determines an uncertainty estimation based on uncertainty values (e.g., an uncertainty estimate) for elements of a data set. The estimation uncertainty circuitry may provide results of the uncertainty estimation to the data drift detection circuitry 204.
The uncertainty estimation circuitry 202 may calculate uncertainty estimates based on, for example, predictive entropy and/or distance-based uncertainty score in a feature space. For example, the uncertainty estimation circuitry 202 may perform a predictive entropy analysis to quantify the uncertainty in the prediction of the model output according to Equation 1 below:
In Equation 1 above, D corresponds to data the model has been trained on, K corresponds to the total classes, and p(y = ck|x,w) corresponds to output from the neural network (e.g., the classifier) with weights w. The predictive entropy captures a combination of aleatoric and epistemic uncertainty. In general, the greater the entropy value, the more uncertain the model is about what class the data belongs to. In some examples, a subset of data may be ranked based on an uncertainty value, with elements of the subset that have the greatest uncertainty value ranked highest. Then, elements may be selected in descending order to generate a shifted data subset for subsequent fine-tuning of a machine learning model.
A second method that may be used alternatively and/or in addition to predictive entropy is a distance-based uncertainty score in the feature space, which explicitly captures the epistemic uncertainty (e.g., model uncertainty). An uncertainty score may be determined according to Equation 2 below:
Uncertainty estimation circuitry 202 may perform the operations of Equation 2 to determine an uncertainty score (e.g., Udissim), which measures a dissimilarity of an observed feature vector from the neural network (e.g., the neural network 104b) with respect to a training feature embedding. The uncertainty score measures a distance between features of samples observed after the model has been trained (z(x)). A corresponding training features embedding is (z(xt)). In some examples, features may be extracted from a penultimate layer of the machine learning model.
Predictive entropy consists of both the aleatoric and the epistemic uncertainty. The uncertainty estimation circuitry 202 captures the aleatoric and epistemic uncertainty for each input, which helps to detect shifts in the data distribution. Samples far from the learned distribution in feature space will be associated with a high epistemic uncertainty. Therefore, choosing a subset of samples that have high epistemic uncertainty and low aleatoric uncertainty helps to improve the model knowledge. Furthermore, enforcing bi-Lipschitz constraints on the model may improve model sensitivity to changes in input. In some examples, the uncertainty estimation circuitry 202 is instantiated by processor circuitry executing uncertainty estimation instructions and/or configured to perform operations such as those represented by the flowcharts of
In some examples, the source-free active adaptation circuitry 102 includes means for performing uncertainty estimation. For example, the means for performing uncertainty estimation may be implemented by the uncertainty estimation circuitry 202. In some examples, the uncertainty estimation circuitry 202 may be instantiated by processor circuitry such as the example processor circuitry 1112 of
The data drift detection circuitry 204 receives data elements of a production data set (e.g., a shifted data set) and associated uncertainty estimates. The data drift detection circuitry 204 then determines an uncertainty threshold using the dataset. The data drift detection circuitry 204 may determine an uncertainty threshold based on a baseline data set (e.g., source data, in-distribution data, etc.) by calculating uncertainty (e.g., calculating predictive entropy) for elements of the baseline data set (e.g., the training data set). The uncertainty threshold may then be set to a value that is at a tail end of the distribution, such that a significant portion of the baseline data (e.g., 95% of the data) is less than the uncertainty estimate. A tail end of the distribution refers to an area of a distribution (e.g., a normal distribution) that deviates significantly from a mean value of the distribution. For example, the tail end of the distribution may include values that do not lie within three standard deviations of the mean of the distribution.
In some examples, to determine the threshold uncertainty value, the data drift detection circuitry 204 may assign uncertainty values to items of the baseline data set and/or obtain uncertainty values of items of the baseline data set from the uncertainty estimation circuitry 202. The data drift detection circuitry 204 may then set the threshold uncertainty value to be greater than a majority (e.g., approximately 75%, approximately 95%, etc.) of the assigned uncertainty values of the items of the baseline data set.
In some examples, the uncertainty threshold corresponds to a predictive entropy value (e.g., a predictive entropy on the histogram of
In some examples, the data drift detection circuitry 204 performs an ordering of samples based on decreasing order of entropy to identify informative samples (e.g., samples with higher uncertainty values) that can be chosen for active labeling. Such a ranking may assist in detecting distributional shift. An example of determination of a distributional shift is shown in the density histogram of
In some examples, the data drift detection circuitry 204 compares data of a shifted data set to a threshold uncertainty value. The data drift detection circuitry 204 may then generate a shifted data subset including items of the shifted data set that satisfy the threshold uncertainty value. That is, the data drift detection circuitry 204 may obtain items of a data set (e.g., items of a shifted data set) wherein each item has been assigned an uncertainty value (e.g., by the uncertainty estimation circuitry 202). The data drift detection circuitry 204 may then determine if one or more values of the shifted data set satisfy the threshold uncertainty value. In some examples described herein, an item (e.g., an element of the data set, data of the data set, etc.) of a data set (e.g., the shifted data set) satisfies a threshold uncertainty value when an uncertainty value associated with the item is greater than or equal to the threshold uncertainty value. In some examples, the threshold uncertainty value may be satisfied when a difference between an uncertainty value associated with the item and the threshold uncertainty value is less than a threshold difference. In other examples, the threshold uncertainty value may be satisfied when the uncertainty value assigned to the item is within a specified range of the threshold uncertainty value (e.g., within 5% of the threshold uncertainty value).
In some examples, the data drift detection circuitry 204 is instantiated by processor circuitry executing uncertainty estimation instructions and/or configured to perform data drift detection operations such as those represented by the flowcharts of
In some examples, the source-free active adaptation circuitry 102 includes means for detecting data drift. For example, the means for detecting data drift may be implemented by data drift detection circuitry 204. In some examples, the data drift detection circuitry 204 may be instantiated by processor circuitry such as the example processor circuitry 1112 of
The batch normalization circuitry 206 may take information from the uncertainty estimation circuitry 202 and/or the data drift detection circuitry 204 and update a model (e.g., adapt a model) to data of a distribution that is different that an original data set without catastrophic forgetting.
In some examples, weights of the layers of a neural network are frozen after the initial training (e.g., frozen by the server 108 of
In some examples, the batch normalization circuitry 206 is instantiated by processor circuitry executing uncertainty estimation instructions and/or configured to perform data drift detection operations such as those represented by the flowcharts of
In some examples, additionally or alternatively to updating batch normalization parameters, the loss calculation circuitry 208 may ensure that the model does not forget (e.g., catastrophically forget) the past distribution information. The loss calculation circuitry 208 may calculate a first loss term and a second loss term to facilitate fine-tuning the model. In some examples, neural network circuitry 210 and/or the neural network training circuitry 212 are updated by the loss calculation circuitry 208 using the following loss functions of Equations 3-5 below:
In Equations 3, z corresponds to features from a penultimate layer of the model. L1 loss improves learning of the model for samples from a new distribution. The L2 loss helps prevent catastrophic forgetting and ensures that the feature embeddings in the model do not deviate too greatly during fine-tuning. In Equation 4, zb(x) corresponds to features obtained from the model before adaptation and za(x) correspond to features obtained from the model after adaptation.
Equation 5 is another method for detecting L2 loss, based on Kullback-Leibler divergence (e.g., KL divergence) instead of (e.g., or in addition to) cosine similarity. In Equation 5, DKL is KL divergence. p(x) represents features after adaptation and q(x) represents features before adaptation.
In some examples, the loss calculation circuitry 208 is instantiated by processor circuitry executing uncertainty estimation instructions and/or configured to perform loss calculation operations such as those represented by the flowcharts of
The example deep learning accelerator circuitry 102 includes the example neural network circuitry 210. The neural network circuitry 210 implements a convolutional neural network (e.g., a deep neural network) that may include various convolutional layers, max pooling layers, fixed embedding layers, global averaging layers, etc. In some examples, the example neural network circuitry 210 may include additional and/or alternative machine learning models to predict a class label for a given example input data. For example, the neural network circuitry 210 may interoperate with any other classification algorithm (e.g., logistic regression, naive bayes, k-nearest neighbors, decision tree, support vector machine) to provide improved classification results.
In some examples, neural network circuitry 210 is instantiated by processor circuitry executing neural network and/or configured to perform neural network operations such as those represented by the flowcharts of
The example source-free active adaptation circuitry 102 includes neural network training circuitry 212. In some examples, the neural network training circuitry 212 may initialize the neural network circuitry 210 with random weights. The neural network training circuitry 212 may then retrieve training data (e.g., labeled test data) and adjust the weights to produce results consistent with the labeled test data (e.g., minimizing a loss function determined by the loss calculation circuitry 208). The weights of the neural network circuitry 210 are adjusted by the neural network training circuitry 212 based on gradient descent. However, the neural network circuitry 210 may be adjusted based on any other suitable optimization algorithm.
The example neural network training circuitry 212 may retrieve training data from the example data storage 216 and use the retrieved data to train the example neural network circuitry 210. In some examples, the neural network circuitry 210 may perform pre-processing on the training data. In some examples, the neural network circuitry 210 may deduplicate elements of the training set before training.
In some examples, neural network training circuitry 212 is instantiated by processor circuitry executing neural network and/or configured to perform neural network operations such as those represented by the flowcharts of
The example deep learning accelerator circuitry 102 includes example communication circuitry 214. The example communication circuitry 214 transmits and/or receives information associated with the example source-free active adaptation circuitry 102. For example, a plurality of devices (e.g., the server 108, the cellular phone 110, the vehicle 112, and the medical facility 114 of
The example communication circuitry 214 additionally may coordinate communication between the uncertainty estimation circuitry 202, the data drift detection circuitry 204, the batch normalization circuitry 206, the loss calculation circuitry 208, the neural network circuitry 210, the neural network training circuitry 212, and/or a cloud server. Such communication may occur via the bus 218, for example. The source-free active adaptation circuitry 102 further includes a data storage 216 to store any data to facilitate the operations of the source-free active adaptation circuitry 102.
In some examples, communication circuitry 214 is instantiated by processor circuitry executing neural network and/or configured to perform communication operations such as those represented by the flowcharts of
At arrow 312, the neural network circuitry 210 is deployed as a deployed model 304. Then, at arrow 314, the neural network circuitry 210 obtains samples from an evolving data distribution 302 (e.g., a shifted data distribution). At arrow 316, the uncertainty estimation circuitry 202 of
The batch normalization circuitry 206 may obtain an activation 414 (e.g., an input), and then, based on the first learnable parameters 404 and/or the second learnable parameter 406, generate an output 416. For example, the batch normalization circuitry 206 may be a batch normalization layer in a neural network that receives input from a hidden layer of the neural network and generates a normalized output based on the input. In some examples, the output 416 (e.g., a normalized output) may be generated based on Equation 6 below:
For example, the uncertainty estimation circuitry 202 may determine predictive entropy values for data in a dataset. Then, the data drift detection circuitry 204 may identify in-distribution and out-of-distribution data of the dataset. In some examples, the data drift detection circuitry 204 determines the uncertainty threshold 502 by identifying a threshold value at which an accuracy versus uncertainty metric is greatest on in-distribution data. In some examples, the threshold 502 can be determined based on validation data drawn from the same distribution as the data model was trained for improved generalization.
Therefore, the illustration 600 is an example of improvements associated with the source-free active adaptation circuitry 102. The source-free active adaptation circuitry 102 allows a machine learning model to improve performance on a newly learned distribution as well as maintain performance on previously learned data distributions. In the illustration 600, initial training data (e.g., sunny day) is used for training. Then, as time passes and weather conditions change, the model adapts to the changing conditions using the techniques described herein. With the source-free active adaptation circuitry 102 of
Although the example illustration 600 of
The second illustration 704 illustrates accuracy of various methods on a clean CIFAR test data set, after the source-free active adaptation circuitry 102 of
The fourth graph 804 presents a comparison of accuracy of the methods after adapting to each corruption in the continually evolving setup for corrupted data. The fourth graph 8004 also indicates the number of samples corresponding to the shifted data (e.g., out of the 1000 samples chosen for updating the model). In the fourth graph 804, the source-free active adaptation circuitry 102 improved accuracy by 24.9% on average from baseline.
While an example manner of implementing the source-free active adaptation circuitry 102 of
Flowcharts representative of example machine readable instructions, which may be executed to configure processor circuitry to implement the source-free active adaptation circuitry 102 of
The machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data or a data structure (e.g., as portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc., in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and/or stored on separate computing devices, wherein the parts when decrypted, decompressed, and/or combined form a set of machine executable instructions that implement one or more operations that may together form a program such as that described herein.
In another example, the machine readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc., in order to execute the machine readable instructions on a particular computing device or other device. In another example, the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable media, as used herein, may include machine readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.
The machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example operations of
“Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc., may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended. The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, or (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B.
As used herein, singular references (e.g., “a”, “an”, “first”, “second”, etc.) do not exclude a plurality. The term “a” or “an” object, as used herein, refers to one or more of that object. The terms “a” (or “an”), “one or more”, and “at least one” are used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method actions may be implemented by, e.g., the same entity or object. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.
If so (Block 906: YES), then at block 908 the uncertainty estimation circuitry 202 of
At block 912, the data drift detection circuitry 204 of
If so (Block 914: YES), then at block 916 the batch normalization circuitry 206 of
At block 1002, the neural network training circuitry 212 of
At block 1010, the uncertainty estimation circuitry 202 of
At block 1011, the example uncertainty estimation circuitry 202 of
The processor platform 1100 of the illustrated example includes processor circuitry 1112. The processor circuitry 1112 of the illustrated example is hardware. For example, the processor circuitry 1112 can be implemented by one or more integrated circuits, logic circuits, FPGAs, microprocessors, CPUs, GPUs, DSPs, and/or microcontrollers from any desired family or manufacturer. The processor circuitry 1112 may be implemented by one or more semiconductor based (e.g., silicon based) devices. In this example, the processor circuitry 1112 implements the uncertainty estimation circuitry 202, the data drift detection circuitry 204, the batch normalization circuitry 206, the loss calculation circuitry 208, the neural network circuitry 210, the neural network training circuitry 212, the communication circuitry 214, and the data storage circuitry 216.
The processor circuitry 1112 of the illustrated example includes a local memory 1113 (e.g., a cache, registers, etc.). The processor circuitry 1112 of the illustrated example is in communication with a main memory including a volatile memory 1114 and a non-volatile memory 1116 by a bus 1118. The volatile memory 1114 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®), and/or any other type of RAM device. The non-volatile memory 1116 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 1114, 1116 of the illustrated example is controlled by a memory controller 1117.
The processor platform 1100 of the illustrated example also includes interface circuitry 1120. The interface circuitry 1120 may be implemented by hardware in accordance with any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) interface, a Bluetooth® interface, a near field communication (NFC) interface, a Peripheral Component Interconnect (PCI) interface, and/or a Peripheral Component Interconnect Express (PCIe) interface.
In the illustrated example, one or more input devices 1122 are connected to the interface circuitry 1120. The input device(s) 1122 permit(s) a user to enter data and/or commands into the processor circuitry 1112. The input device(s) 1122 can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, an isopoint device, and/or a voice recognition system.
One or more output devices 1124 are also connected to the interface circuitry 1120 of the illustrated example. The output device(s) 1124 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer, and/or speaker. The interface circuitry 1120 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry such as a GPU.
The interface circuitry 1120 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) by a network 1126. The communication can be by, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, an optical connection, etc.
The processor platform 1100 of the illustrated example also includes one or more mass storage devices 1128 to store software and/or data. Examples of such mass storage devices 1128 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, Blu-ray disk drives, redundant array of independent disks (RAID) systems, solid state storage devices such as flash memory devices and/or SSDs, and DVD drives.
The machine readable instructions 1132, which may be implemented by the machine readable instructions of
The cores 1202 may communicate by a first example bus 1204. In some examples, the first bus 1204 may be implemented by a communication bus to effectuate communication associated with one(s) of the cores 1202. For example, the first bus 1204 may be implemented by at least one of an Inter-Integrated Circuit (I2C) bus, a Serial Peripheral Interface (SPI) bus, a PCI bus, or a PCIe bus. Additionally or alternatively, the first bus 1204 may be implemented by any other type of computing or electrical bus. The cores 1202 may obtain data, instructions, and/or signals from one or more external devices by example interface circuitry 1206. The cores 1202 may output data, instructions, and/or signals to the one or more external devices by the interface circuitry 1206. Although the cores 1202 of this example include example local memory 1220 (e.g., Level 1 (L1) cache that may be split into an L1 data cache and an L1 instruction cache), the microprocessor 1200 also includes example shared memory 1210 that may be shared by the cores (e.g., Level 2 (L2 cache)) for high-speed access to data and/or instructions. Data and/or instructions may be transferred (e.g., shared) by writing to and/or reading from the shared memory 1210. The local memory 1220 of each of the cores 1202 and the shared memory 1210 may be part of a hierarchy of storage devices including multiple levels of cache memory and the main memory (e.g., the main memory 1114, 1116 of
Each core 1202 may be referred to as a CPU, DSP, GPU, etc., or any other type of hardware circuitry. Each core 1202 includes control unit circuitry 1214, arithmetic and logic (AL) circuitry (sometimes referred to as an ALU) 1216, a plurality of registers 1218, the local memory 1220, and a second example bus 1222. Other structures may be present. For example, each core 1202 may include vector unit circuitry, single instruction multiple data (SIMD) unit circuitry, load/store unit (LSU) circuitry, branch/jump unit circuitry, floating-point unit (FPU) circuitry, etc. The control unit circuitry 1214 includes semiconductor-based circuits structured to control (e.g., coordinate) data movement within the corresponding core 1202. The AL circuitry 1216 includes semiconductor-based circuits structured to perform one or more mathematic and/or logic operations on the data within the corresponding core 1202. The AL circuitry 1216 of some examples performs integer based operations. In other examples, the AL circuitry 1216 also performs floating point operations. In yet other examples, the AL circuitry 1216 may include first AL circuitry that performs integer based operations and second AL circuitry that performs floating point operations. In some examples, the AL circuitry 1216 may be referred to as an Arithmetic Logic Unit (ALU). The registers 1218 are semiconductor-based structures to store data and/or instructions such as results of one or more of the operations performed by the AL circuitry 1216 of the corresponding core 1202. For example, the registers 1218 may include vector register(s), SIMD register(s), general purpose register(s), flag register(s), segment register(s), machine specific register(s), instruction pointer register(s), control register(s), debug register(s), memory management register(s), machine check register(s), etc. The registers 1218 may be arranged in a bank as shown in
Each core 1202 and/or, more generally, the microprocessor 1200 may include additional and/or alternate structures to those shown and described above. For example, one or more clock circuits, one or more power supplies, one or more power gates, one or more cache home agents (CHAs), one or more converged/common mesh stops (CMSs), one or more shifters (e.g., barrel shifter(s)) and/or other circuitry may be present. The microprocessor 1200 is a semiconductor device fabricated to include many transistors interconnected to implement the structures described above in one or more integrated circuits (ICs) contained in one or more packages. The processor circuitry may include and/or cooperate with one or more accelerators. In some examples, accelerators are implemented by logic circuitry to perform certain tasks more quickly and/or efficiently than can be done by a general purpose processor. Examples of accelerators include ASICs and FPGAs such as those discussed herein. A GPU or other programmable device can also be an accelerator. Accelerators may be on-board the processor circuitry, in the same chip package as the processor circuitry and/or in one or more separate packages from the processor circuitry.
More specifically, in contrast to the microprocessor 1200 of
In the example of
The configurable interconnections 1310 of the illustrated example are conductive pathways, traces, vias, or the like that may include electrically controllable switches (e.g., transistors) whose state can be changed by programming (e.g., using an HDL instruction language) to activate or deactivate one or more connections between one or more of the logic gate circuitry 1308 to program desired logic circuits.
The storage circuitry 1312 of the illustrated example is structured to store result(s) of the one or more of the operations performed by corresponding logic gates. The storage circuitry 1312 may be implemented by registers or the like. In the illustrated example, the storage circuitry 1312 is distributed amongst the logic gate circuitry 1308 to facilitate access and increase execution speed.
The example FPGA circuitry 1300 of
Although
In some examples, the processor circuitry 1112 of
A block diagram illustrating an example software distribution platform 1405 to distribute software such as the example machine readable instructions 1132 of
From the foregoing, it will be appreciated that example systems, methods, apparatus, and articles of manufacture have been disclosed that perform for source-free active adaptation to distributional shifts. Disclosed systems, methods, apparatus, and articles of manufacture improve the efficiency of using a computing device by improve the functioning of a computer by reducing the processing required for a given training workload (e.g., achieving improved training results with fewer data samples). Disclosed systems, methods, apparatus, and articles of manufacture are accordingly directed to one or more improvement(s) in the operation of a machine such as a computer or other electronic and/or mechanical device.
Example methods, apparatus, systems, and articles of manufacture to perform source-free active adaptation to distributional shifts for machine learning are disclosed herein. Further examples and combinations thereof include the following:
Example 1 includes a system comprising interface circuitry, programmable circuitry, and instructions to cause the programmable circuitry to perform a first training of a neural network based on a baseline data set associated with a first data distribution, compare data of a shifted data set to a threshold uncertainty value, wherein the threshold uncertainty value is associated with a distributional shift between the baseline data set and the shifted data set, generate a shifted data subset including items of the shifted data set that satisfy the threshold uncertainty value, and perform a second training of the neural network based on the shifted data subset.
Example 2 includes the system of example 1, wherein the programmable circuitry is to perform the second training based on a first loss term and based on a second loss term.
Example 3 includes the system of example 1, wherein the programmable circuitry is to update a batch normalization layer of the neural network based on the shifted data subset.
Example 4 includes the system of example 1, wherein to determine the threshold uncertainty value, the programmable circuitry is to assign uncertainty values to items of the baseline data set, and set the threshold uncertainty value to be greater than a majority of the assigned uncertainty values of the items of the baseline data set.
Example 5 includes the system of example 4, wherein the uncertainty values assigned to the items of the baseline data set are predictive entropy values.
Example 6 includes the system of example 2, wherein the first loss term is a cross entropy loss and the second loss term is a cosine similarity of feature embeddings of the neural network before and after model adaptation.
Example 7 includes the system of example 2, wherein the first loss term is a cross entropy loss and the second loss term is a Kullback-Leibler divergence of feature embeddings of the neural network before and after model adaptation.
Example 8 includes the system of example 1, wherein the programmable circuitry is to update at least one of a scale parameter or a shift parameter of a batch normalization layer of the neural network.
Example 9 includes the system of example 1, wherein the threshold uncertainty value is determined based on predictive entropy.
Example 10 includes the system of example 1, wherein samples of the shifted data subset are ranked based on entropy to identify samples for active labeling.
Example 11 includes the system of example 1, wherein the threshold uncertainty value is an epistemic uncertainty value based on feature dissimilarity.
Example 12 includes a non-transitory computer readable medium comprising instructions which, when executed by programmable circuitry, cause the programmable circuitry to perform a first training of a neural network on a baseline data set associated with a first data distribution, compare data of a shifted data set to a threshold uncertainty value, wherein the threshold uncertainty value is associated with a distributional shift between the baseline data set and the shifted data set, generate a shifted data subset including items of the shifted data set that satisfy the threshold uncertainty value, and perform a second training of the neural network based on the shifted data subset.
Example 13 includes the non-transitory computer readable medium of example 12, wherein the instructions, when executed, cause the programmable circuitry to perform the second training based on a first loss term and based on a second loss term.
Example 14 includes the non-transitory computer readable medium of example 13, wherein the instructions, when executed, cause the programmable circuitry to update a batch normalization layer of the neural network based on the shifted data subset.
Example 15 includes the non-transitory computer readable medium of example 13, wherein to determine the threshold uncertainty value, the programmable circuitry is to assign uncertainty values to items of the baseline data set, and set the threshold uncertainty value to be greater than a majority of the assigned uncertainty values of the items of the baseline data set.
Example 16 includes the non-transitory computer readable medium of example 15, wherein the uncertainty values assigned to the items of the baseline data set are predictive entropy values.
Example 17 includes the non-transitory computer readable medium of example 13, wherein the first loss term is a cross entropy loss and the second loss term is a cosine similarity of feature embeddings of the neural network before and after model adaptation .
Example 18 includes the non-transitory computer readable medium of example 13, wherein the first loss term is a cross entropy loss and the second loss term is a Kullback-Leibler divergence of feature embeddings of the neural network before and after model adaptation .
Example 19 includes the non-transitory computer readable medium of example 12, wherein the instructions, when executed, cause the programmable circuitry to update at least one of a scale parameter or a shift parameter of a batch normalization layer of the neural network .
Example 20 includes the non-transitory computer readable medium of example 12, wherein the threshold uncertainty value is determined based on predictive entropy.
Example 21 includes the non-transitory computer readable medium of example 12, wherein samples of the shifted data subset are ranked based on entropy to identify samples for active labeling .
Example 22 includes the non-transitory computer readable medium of example 12, wherein the threshold uncertainty value is an epistemic uncertainty value determined based on feature dissimilarity.
Example 23 includes a method comprising performing, by executing an instruction with processor circuitry, a first training of a neural network on a baseline data set associated with a first data distribution, comparing, by executing an instruction with the processor circuitry, data of a shifted data set to a threshold uncertainty value, wherein the threshold uncertainty value is associated with a distributional shift between the baseline data set and the shifted data set, generating, by executing an instruction with the processor circuitry, a shifted data subset including at least one item of the shifted data set that satisfies the threshold uncertainty value, and performing, by executing an instruction with the processor circuitry, a second training of the neural network based on the shifted data subset.
Example 24 includes the method of example 23, further including performing the second training based on a first loss term and based on a second loss term.
Example 25 includes the method of example 23, further including updating a batch normalization layer of the neural network based on the shifted data subset.
Example 26 includes the method of example 23, further including assigning uncertainty values to items of the baseline data set, and setting the threshold uncertainty value to be greater than a majority of the assigned uncertainty values of the items of the baseline data set.
Example 27 includes the method of example 26, wherein the uncertainty values assigned to the items of the baseline data set are predictive entropy values.
Example 28 includes the method of example 24, wherein the first loss term is a cross entropy loss and the second loss term is a cosine similarity of feature embeddings of the neural network before and after model adaptation.
Example 29 includes the method of example 24, wherein the first loss term is a cross entropy loss and the second loss term is a Kullback-Leibler divergence of feature embeddings of the neural network before and after model adaptation.
Example 30 includes the method of example 23, further including updating at least one of a scale parameter or a shift parameter of a batch normalization layer of the neural network .
Example 31 includes the method of example 23, wherein the threshold uncertainty value is determined based on predictive entropy.
Example 32 includes the method of example 23, wherein samples of the shifted data subset are ranked based on entropy to identify samples for active labeling.
Example 33 includes the method of example 23, wherein the threshold uncertainty value is an epistemic uncertainty value determined based on feature dissimilarity.
The following claims are hereby incorporated into this Detailed Description by this reference. Although certain example systems, methods, apparatus, and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the claims of this patent.