Embodiments of the present disclosure relate to systems, methods, and computer readable media for analyzing underlying relationships in data for efficiently training a machine learning model
Despite the surge in self-supervised methods for representation learning that is then used to solve traditionally supervised tasks (e.g. classifying a picture as a cat or dog, segmenting objects in an image) without the need for labeled data, there still remains a large number of tasks where supervised learning is essential. This is for those cases when the task involves labels that cannot be inferred from available input data, notwithstanding the fact that in some cases, joint learning across media types (image and text) if such data is available, can be used to avoid explicitly labeling data. A task where supervised learning cannot be avoided is detecting the relation between two phrases in a sentence. This can be done in a self-supervised manner for a large number of cases, but requires a supervised model to learn complex relationships between the phrases.
In one aspect, a method includes selecting a first plurality of input candidates from a corpus of data; mapping the first plurality of input candidates onto a pretrained vector space of a pretrained model; clustering the first plurality of input candidates in the pretrained vector space; adding the first plurality of input candidates to a plurality of queues for labelling; and labelling the first plurality of input candidates.
In some embodiments, labelling the first plurality of input candidates is performed by humans.
In some embodiments, labelling the first plurality of input candidates is performed algorithmically.
In some embodiments, labelling includes identifying cluster centroids in the pretrained vector space.
In some embodiments, the pretrained vector space is created by mapping input to sparse/dense distributed representations.
In some embodiments, the pretrained vector space includes learned parameters of a probability distribution.
In some embodiments, the pretrained vector space is learned by performing density estimation.
In some embodiments, the pretrained model is selected from a group consisting of transformers, convolutional neural networks, recurrent neural networks, graph neural networks, and combinations thereof.
In some embodiments, the method further includes partitioning the labeled first plurality of input candidates into a train set, a development set, a test set, and an out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids in the pretrained vector space from the first plurality of input candidates to the train set; adding labeled cluster children in the pretrained vector space from the first plurality of input candidates to one of the development set and the test set; and adding labeled singletons in the pretrained vector space from the first plurality of input candidates to one of the train set and the out-of-distribution set.
In some embodiments, the method further includes creating a fine tuned model.
In some embodiments, wherein creating the fine tuned model includes using the pretrained model to create the fine tuned model.
In some embodiments, the method further includes assigning a first plurality of outputs using the fine tuned model.
In some embodiments, the fine tuned model is selected from a group consisting of transformers, convolutional neural networks, recurrent neural networks, graph neural networks, and combinations thereof.
In some embodiments, the method further includes evaluating performance of the fine tuned model on the test set, wherein evaluating the performance of the fine tuned model includes: mapping the test set onto a fine tuned vector space; clustering the test set in the fine tuned vector space; quantifying heterogeneity of test set clusters in the fine tuned vector space; and providing a confidence score for the fine tuned model.
In some embodiments, the method further includes labelling a second plurality of input candidates from the corpus of data, wherein labelling the second plurality of input candidates includes: mapping the train set and development set onto the pretrained vector space and the fine tuned vector space; clustering the train set and development set in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; selecting the second plurality of input candidates such that the second plurality of input candidates are near to at least one of the heterogeneous clusters and singletons in the fine tuned vector space; and labelling the second plurality of input candidates.
In some embodiments, the method further includes partitioning the labeled second plurality of input candidates into the train set, the development set, the test set, and the out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids from the second plurality of input candidates to the train set; adding labeled cluster children from the second plurality of input candidates to one of the development set and the test set; and adding labeled singletons from the second plurality of input candidates to the one of the train set and the out-of-distribution set.
In some embodiments, labelling of the second plurality of input candidates includes algorithmically labelling the second plurality of input candidates.
In some embodiments, the method further includes assigning the confidence score for the labelling of the second plurality of input candidates using a bipartite graph of the pretrained vector space and the fine tuned vector space.
In some embodiments, the method further includes evaluating performance of an ensemble of two or more fine tuned models on the test set, wherein evaluating the performance of the ensemble of two or more fine tuned models includes determining whether the ensemble of two or more fine tuned models concur on an output; mapping the train, development, and test sets onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using a bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.
In some embodiments, the method further includes selecting a third plurality of input candidates from the corpus of data; labeling the third plurality of input candidates using the fine tuned model; mapping the third plurality of input candidates onto the pretrained vector space and a fine tuned vector space; clustering the third plurality of input candidates in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; and assigning a confidence score for the labelling of the third plurality of input candidates using a bipartite graph of the pretrained vector space and the fine tuned vector space.
In some embodiments, the method further includes labeling a third plurality of input candidates using an ensemble of two or more fine tuned models on the third plurality of input candidates; determining whether the ensemble of two or more fine tuned models concur on labeling of the third plurality of input candidates; mapping the third plurality of input candidates onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using a bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.
In some embodiments, the method further includes selecting a plurality of failed inputs for examination, wherein the plurality of failed inputs are inputs of the third plurality of inputs that have a low confidence score; selecting a plurality of neighbors of each of the plurality of failed inputs; labelling the plurality of neighbors; and partitioning the plurality of neighbors onto the train set, the development set, the test set, and the out-of-distribution set.
In some embodiments, partitioning the plurality of neighbors includes adding labeled cluster centroids from the plurality of neighbors to the train set; adding labeled cluster children from the plurality of neighbors to one of the development set and the test set; and adding labeled singletons from the plurality of neighbors to one of the train set and the out-of-distribution set.
In one aspect, a non-transitory memory; and one or more hardware processors configured to read instructions from the non-transitory memory that, when executed cause the one or more hardware processors to perform operations including: selecting a first plurality of input candidates from a corpus of data; mapping the first plurality of input candidates onto a pretrained vector space of a pretrained model; clustering the first plurality of input candidates in the pretrained vector space; adding the first plurality of input candidates to a plurality of queues for labelling; and labelling the first plurality of input candidates.
In some embodiments, labelling the first plurality of input candidates is performed by humans.
In some embodiments, labelling the first plurality of input candidates is performed algorithmically.
In some embodiments, labelling includes identifying cluster centroids in the pretrained vector space.
In some embodiments, the pretrained vector space is created by mapping input to sparse/dense distributed representations.
In some embodiments, the pretrained vector space includes learned parameters of a probability distribution.
In some embodiments, the pretrained vector space is learned by performing density estimation.
In some embodiments, the pretrained model is selected from a group consisting of transformers, convolutional neural networks, recurrent neural networks, graph neural networks, and combinations thereof.
In some embodiments, the operations further include partitioning the labeled first plurality of input candidates into a train set, a development set, a test set, and an out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids in the pretrained vector space from the first plurality of input candidates to the train set; adding labeled cluster children in the pretrained vector space from the first plurality of input candidates to one of the development set and the test set; and adding labeled singletons in the pretrained vector space from the first plurality of input candidates to one of the train set and the out-of-distribution set.
In some embodiments, the operations further include creating a fine tuned model.
In some embodiments, creating the fine tuned model includes using the pretrained model to create the fine tuned model.
In some embodiments, the operations further include assigning a first plurality of outputs using the fine tuned model.
In some embodiments, the fine tuned model is selected from a group consisting of transformers, convolutional neural networks, recurrent neural networks, graph neural networks, and combinations thereof.
In some embodiments, the operations further include evaluating performance of the fine tuned model on the test set, wherein evaluating the performance of the fine tuned model includes: mapping the test set onto a fine tuned vector space; clustering the test set in the fine tuned vector space; quantifying heterogeneity of test set clusters in the fine tuned vector space; and providing a confidence score for the fine tuned model.
In some embodiments, the operations further include labelling a second plurality of input candidates from the corpus of data, wherein labelling the second plurality of input candidates includes: mapping the train set and development set onto the pretrained vector space and the fine tuned vector space; clustering the train set and development set in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; selecting the second plurality of input candidates such that the second plurality of input candidates are near to at least one of the heterogeneous clusters and singletons in the fine tuned vector space; and labelling the second plurality of input candidates.
In some embodiments, the operations further include partitioning the labeled second plurality of input candidates into the train set, the development set, the test set, and the out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids from the second plurality of input candidates to the train set; adding labeled cluster children from the second plurality of input candidates to one of the development set and the test set; and adding labeled singletons from the second plurality of input candidates to the one of the train set and the out-of-distribution set.
In some embodiments, labelling of the second plurality of input candidates includes algorithmically labelling the second plurality of input candidates.
In some embodiments, the operations further include assigning the confidence score for the labelling of the second plurality of input candidates using a bipartite graph of the pretrained vector space and the fine tuned vector space.
In some embodiments, the operations further include evaluating performance of an ensemble of two or more fine tuned models on the test set, wherein evaluating the performance of the ensemble of two or more fine tuned models includes determining whether the ensemble of two or more fine tuned models concur on an output; mapping the train, development, and test sets onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using a bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.
In some embodiments, the operations further include selecting a third plurality of input candidates from the corpus of data; labeling the third plurality of input candidates using the fine tuned model; mapping the third plurality of input candidates onto the pretrained vector space and a fine tuned vector space; clustering the third plurality of input candidates in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; and assigning a confidence score for the labelling of the third plurality of input candidates using a bipartite graph of the pretrained vector space and the fine tuned vector space.
In some embodiments, the operations further include labeling a third plurality of input candidates using an ensemble of two or more fine tuned models on the third plurality of input candidates; determining whether the ensemble of two or more fine tuned models concur on labeling of the third plurality of input candidates; mapping the third plurality of input candidates onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using a bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.
In some embodiments, the operations further include selecting a plurality of failed inputs for examination, wherein the plurality of failed inputs are inputs of the third plurality of inputs that have a low confidence score; selecting a plurality of neighbors of each of the plurality of failed inputs; labelling the plurality of neighbors; and partitioning the plurality of neighbors onto the train set, the development set, the test set, and the out-of-distribution set.
In some embodiments, partitioning the plurality of neighbors includes: adding labeled cluster centroids from the plurality of neighbors to the train set; adding labeled cluster children from the plurality of neighbors to one of the development set and the test set; and adding labeled singletons from the plurality of neighbors to one of the train set and the out-of-distribution set.
In one aspect non-transitory computer-readable medium storing instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to perform operations including: selecting a first plurality of input candidates from a corpus of data; mapping the first plurality of input candidates onto a pretrained vector space of a pretrained model; clustering the first plurality of input candidates in the pretrained vector space; adding the first plurality of input candidates to a plurality of queues for labelling; and labelling the first plurality of input candidates.
In some embodiments, labelling the first plurality of input candidates is performed by humans.
In some embodiments, labelling the first plurality of input candidates is performed algorithmically.
In some embodiments, labelling includes identifying cluster centroids in the pretrained vector space.
In some embodiments, the pretrained vector space is created by mapping input to sparse/dense distributed representations.
In some embodiments, the pretrained vector space includes learned parameters of a probability distribution.
In some embodiments, the pretrained vector space is learned by performing density estimation.
In some embodiments, the pretrained model is selected from a group consisting of transformers, convolutional neural networks, recurrent neural networks, graph neural networks, and combinations thereof.
In some embodiments, the operations further include partitioning the labeled first plurality of input candidates into a train set, a development set, a test set, and an out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids in the pretrained vector space from the first plurality of input candidates to the train set; adding labeled cluster children in the pretrained vector space from the first plurality of input candidates to one of the development set and the test set; and adding labeled singletons in the pretrained vector space from the first plurality of input candidates to one of the train set and the out-of-distribution set.
In some embodiments, the operations further include creating a fine tuned model.
In some embodiments, creating the fine tuned model includes using the pretrained model to create the fine tuned model.
In some embodiments, the operations further include assigning a first plurality of outputs using the fine tuned model.
In some embodiments, the fine tuned model is selected from a group consisting of transformers, convolutional neural networks, recurrent neural networks, graph neural networks, and combinations thereof.
In some embodiments, the operations further include evaluating performance of the fine tuned model on the test set, wherein evaluating the performance of the fine tuned model includes: mapping the test set onto a fine tuned vector space; clustering the test set in the fine tuned vector space; quantifying heterogeneity of test set clusters in the fine tuned vector space; and providing a confidence score for the fine tuned model.
In some embodiments, the operations further include labelling a second plurality of input candidates from the corpus of data, wherein labelling the second plurality of input candidates includes: mapping the train set and development set onto the pretrained vector space and the fine tuned vector space; clustering the train set and development set in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; selecting the second plurality of input candidates such that the second plurality of input candidates are near to at least one of the heterogeneous clusters and singletons in the fine tuned vector space; and labelling the second plurality of input candidates.
In some embodiments, the operations further include partitioning the labeled second plurality of input candidates into the train set, the development set, the test set, and the out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids from the second plurality of input candidates to the train set; adding labeled cluster children from the second plurality of input candidates to one of the development set and the test set; and adding labeled singletons from the second plurality of input candidates to the one of the train set and the out-of-distribution set.
In some embodiments, labelling of the second plurality of input candidates includes algorithmically labelling the second plurality of input candidates.
In some embodiments, the operations further include assigning the confidence score for the labelling of the second plurality of input candidates using a bipartite graph of the pretrained vector space and the fine tuned vector space.
In some embodiments, the operations further include evaluating performance of an ensemble of two or more fine tuned models on the test set, wherein evaluating the performance of the ensemble of two or more fine tuned models includes determining whether the ensemble of two or more fine tuned models concur on an output; mapping the train, development, and test sets onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using a bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.
In some embodiments, the operations further include selecting a third plurality of input candidates from the corpus of data; labeling the third plurality of input candidates using the fine tuned model; mapping the third plurality of input candidates onto the pretrained vector space and a fine tuned vector space; clustering the third plurality of input candidates in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; and assigning a confidence score for the labelling of the third plurality of input candidates using the bipartite graph of the pretrained vector space and the fine tuned vector space.
In some embodiments, the operations further include labeling a third plurality of input candidates using an ensemble of two or more fine tuned models on the third plurality of input candidates; determining whether the ensemble of two or more fine tuned models concur on labeling of the third plurality of input candidates; mapping the third plurality of input candidates onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using the bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.
In some embodiments, the operations further include selecting a plurality of failed inputs for examination, wherein the plurality of failed inputs are inputs of the third plurality of inputs that have a low confidence score; selecting a plurality of neighbors of each of the plurality of failed inputs; labelling the plurality of neighbors; and partitioning the plurality of neighbors onto the train set, the development set, the test set, and the out-of-distribution set.
In some embodiments, the operations further include adding labeled cluster centroids from the plurality of neighbors to the train set; adding labeled cluster children from the plurality of neighbors to one of the development set and the test set; and adding labeled singletons from the plurality of neighbors to one of the train set and the out-of-distribution set.
Any one of the embodiments disclosed herein may be properly combined with any other embodiment disclosed herein. The combination of any one of the embodiments disclosed herein with any other embodiments disclosed herein is expressly contemplated.
The objects and advantages will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:
It has been discovered that curation of labeled data sets for training a supervised model on tasks involving supervised learning is one of the practical challenges of machine learning, both in terms of the cost and time involved in curating those labeled data sets. Given there is no well defined methodology for optimal curation of data sets to date, a variety of methods are used in practice to reduce the labeling process such as weak supervised learning or attempts to automate the labeled data creation process with human assistance. However, while models are then evaluated on such data sets to assess their performance, there is no quantitative metric to assess the labeled data set itself, particularly the breadth of coverage of the training set or the methodology used to partition the data set into train and dev/test splits. This ad hoc training data set creation process has a direct bearing on model performance, when the labeled data set is small—a situation that is not so infrequent in practice, given the cost of creating labeled data sets. For instance, a suboptimal split of the labeled data set can not only prevent models from scoring high on the test set but also prevent models from maximally learning from the train set they learn from. Also the labeled data sets are purely used for training and testing the model—they are not used once the model is deployed to score the confidence of model outputs or determine out-of-distribution cases.
In addition to the challenge of optimal curation of labeled data to train supervised models, supervised models—particularly neural networks used for training models on high dimensional data, suffer from two problems during inference—often have the challenge of wrongly choosing a particular class over other classes with a high score (“confidently wrong”) as well as failure on out-of-distribution cases. A model's output on a “near OOD” input poses the additional challenge of determining if a model is successfully generalizing from its train set learning or is just wrong.
Out-of-distribution cases are inevitable in supervised learning since a supervised model only learns the conditional distribution of the labels given the input, which is typically a small subset of the available data, as opposed to learning the underlying distribution the input data comes from by making use of the entirety of available data. One of the primary reasons for models with strong performance on popular benchmarks, but exhibiting poor production performance, is in part due to the inadequate coverage of the input space by the training which in part can be addressed by leveraging a diverse labeled data set at inference time for ensembling or out-of-distribution estimation.
Using models that perform density estimation of the input space to address out-of-distribution scenarios, particularly using their generative capacity (density estimation models often directly or in most cases, with minor enhancements/modifications can be made generative), yields anomalous results, even if its learning of the training set—model generates representative samples of the training set that is far from the actual training set.
Furthermore, with models that learn rich representations in a self-supervised manner (e.g. an autoencoder model like BERT, contrastive and non-contrastive learning methods for images using transformers, Resnets etc.), the need for explicit density estimation is excessive in practice, particularly when such models can learn rich representations without density estimation and can be leveraged to both train a supervised model as well as assist in estimating out-of-distribution cases as described herein. However, the disclosed system and methods do not preclude the use of models that can perform both density estimation as well as learn rich representations (e.g. autoregressive models).
The system and methods described herein offer a working solution to all the problems listed above, including (1) optimal labeling and partitioning of data set for training a supervised model (2) making model outputs interpretable to some degree, and (3) improving model performance by reducing the “confidently wrong” cases and increase a model's Out of distribution (OOD) performance, both of which are done in an interpretable manner, in contrast to the opaque process by which neural networks map input to output. The methods described herein are made possible largely by self-supervised models (also referred to as “foundation models” for their utility to fine tune and direct use as is, without fine tuning as illustrated in the embodiments below) that can learn from an entire corpus as opposed to supervised models whose learning is constrained to labeled data which is limited given the need for labeling by humans.
In one aspect, a system and method is disclosed to improve the efficacy of supervised learning. Specifically the described system and method is described according to the following embodiments.
In some embodiments, the system and method outline a procedure to find candidates for labeling in an optimal manner to reduce the labeling effort which involves humans.
In some embodiments, the system and method outline a procedure to quantitatively assess the labeled data that is created with human involvement.
In some embodiments, the system and method outline a procedure to quantify the uncertainty in labeling an input when multiple humans or autonomous agents label an input and use that measure during test time to measure model's uncertainty on the same input.
In some embodiments, the system and method outline a procedure to partition the labeled data set from the previous step into training and dev/test sets such that the training set maximally spans the labeled space. This serves to both improve model performance and reduce out-of-distribution cases with respect to the labeled data set, at inference time.
In some embodiments, the system and method outline a procedure to quantitatively assess model output at inference time by leveraging off what the model was trained on. This offers a means to disentangle model failures from cases where the model has truly generalized from the training set.
In some embodiments, the system and method outline a procedure to test model on out-of-distribution (OOD) input separate from the test set.
In some embodiments, the system and method outline a procedure to leverage a human labeled data set for algorithmic labeling to expand the human labeled data set with minimal human effort (verifying the automatic labeling as opposed to actually labeling).
In some embodiments, the system and method outline a procedure to leverage the entire labeled data set during model deployment to improve model performance. In some embodiments, the system and method determine one or more of the following: how to utilize the labeled data set as a reference to quantify model uncertainty for a single input; how to use the labeled data set to create an interpretable output label for a given input in contrast to its opaque counterpart where the learning from a subset of the labeled data (the typical use of the training set) is incorporated into the model parameters; and how to effectively multiply the labeled data set by using an ensemble of models, which improves model performance and to quantify model uncertainty.
In some embodiments, the system and method leverage the procedure described above for creating the labeled data set to also be used after model deployment to continue to retrain the model on out-of-distribution cases encountered during production usage.
In some embodiments, an implementation of the system and method described above uses an embedding space that the input is mapped to. This mapping could be performed by a variety of means but not limited to self-supervised methods. Also the input could be one or more modalities such as text, image, audio and the embedding space these modalities are mapped to could be distinct or jointly learned. For the purpose of this document this embedding space or spaces will be collectively referred to as pre-trained space. The model or models that map input to this space are referred to henceforth as pre-trained space mappers. In some embodiments, the pre-trained space is used in one or more of the following ways: to find candidates for labeling optionally removing noise; to partition the input into training and dev/test sets; for predicting out-of-distribution inputs at inference times; to add a level of interpretability to model output; after model deployment to further optimally retrain the model.
In some embodiments, an implementation of the system and method described above uses the embedding space of the supervised models (optionally trained by the methods described herein) that the input is mapped to. The input could be one or more of the modalities such as text, image, audio and the embedding space these modalities are mapped to could be distinct or joined. For the purpose of this document this embedding space or spaces will be collectively referred to as fine tuned space. The model or models that map input to this space are referred to henceforth as fine tuned space mappers. The fine tuned space is used in one or more of the following ways: to partition the input space once a fine tuned model has been bootstrapped into creation (optionally) using the pretrained space; to create an interpretable output label for an input that is a counterpart to the opaque model output; to ascribe confidence to model output and identify those input that the model has difficulty separating into separate classes; for predicting out-of-distribution inputs at inference times; to increase the chances of identifying cases where model is confidently wrong in its output; to add a level of interpretability to the typical opaque model output; after model deployment to create weakly labeled data and use them for subsequent input classification; after model deployment to further optimally retrain the fine tuned model.
In some embodiments, the pretrained space is a vector space created by mapping input to sparse/dense distributed representations.
In some embodiments, the pretrained space includes of learned parameters of a probability distribution.
In some embodiments, the pretrained space is learned by a model performing density estimation (e.g. autoregressive models, BiGANs, Flow models, diffusion models).
In some embodiments, the pretrained space is learned by a model that does not do density estimation but just learns representations (e.g. autoencoder models like BERT).
In some embodiments, the pretrained space mappers are trained in a supervised manner to create fine tuned space mappers.
In some embodiments, pretrained space mappers are distinct and decoupled from fine tuned space mappers.
In some embodiments, where the pretrained and fine tuned space mappers are deep learning architectures including but not limited to transformers, convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph neural networks (GNNs).
In one embodiment, the fine tuned space mapper is a supervised model that is created from fine tuning a self-supervised model which was used to create the pretrained space (pre-trained space mapper).
In the embodiment where a self-supervised model is trained on the entire available input space of interest to map input to the pretrained space and used as follows: mapping training candidate subset that is optionally a subset of the entire available input space and serves as the candidate set for labeling to train the supervised model by mapping this candidate set to the pretrained space by the pretrained space mapper; clustering the training candidate set in the pretrained space to identify cluster centroids that are then chosen for labeling by humans or by some automated method; optionally culling noise from the training candidate set using the pretrained space; choosing candidates for labeling from each cluster where the choice of candidates is determined by the number of classes the supervised model is trained to output; choosing candidates for labeling from the clustering process where the candidates did not form clusters or belong to one; partitioning the labeled data set into train, dev, and test such that, the training set contains the centroids and at least one of the representative samples of the classes or the quantized ranges, if such additional representative samples exist in a cluster, the dev/test contains at least one of the other items in each cluster, the partitioning into dev and test purely being done by the desired dev/test split, the train set contains all the labeled singletons with the exception of a desired number which is added to an OOD set; bootstrapping the finetuning of the supervised model using the labeled data set mentioned in previous step.
In some embodiments, for a supervised model whose output is continuous, candidates for labeling from each cluster could be representative samples of quantized ranges that span the continuous range of interest. In some embodiments, choosing candidates includes applying the clustering process iteratively on the identified clusters with relaxed clustering constraints if the number of clusters is too large for labeling.
In some embodiments, choosing candidates for labeling where the candidates did not form clusters or belong to one includes applying the clustering process iteratively with relaxed or tight clustering constraints on the identified singletons if the singleton set is too small or large for labeling.
In some embodiments, fine tuned space of the bootstrapped model is used for subsequent fine tuning of the supervised model as follows: inputting the entire labeled data set through the fine tuned space model and clustering the entire bootstrapped labeled set in the fine tuned space also; examining the mapping characteristics of the labeled data between the pretrained space and fine tuned space, e.g., by treating the mapping from pretrained space to fine tuned space as a bipartite graph and quantifying the mapping of input from the pretrained space clusters to fine tuned space clusters; using the mapping characteristics to further decide the choice of candidates to label to improve model performance in addition to the use of pre-trained space to choose labeling candidates as described earlier.
In some embodiments, the clusters in pretrained and fine tuned space, as well as the mapping characteristics from pre-trained to fine tuned space, are utilized to estimate out-of-distribution candidates at inference time.
In some embodiments, the heterogeneity measure of the clusters in the fine tuned space is utilized to ascribe confidence scores to model output (this could lead to an additional category of “can't say” in the model output). This helps reduce cases where the model is confidently wrong.
In some embodiments, an ensemble of models optionally trained by the above described methods is utilized to determine model output for each input, particularly using the confidence scores of the model outputs as a metric to weight the model outputs.
In some embodiments, an ensemble of models optionally trained by the above described methods is utilized to estimate out-of-distribution candidates with respect to the train set. This ensemble is not limited to supervised models trained by the method described above—it could include the self-supervised model itself that could potentially be adapted to classify without labeled data, even if only with limited capabilities, and supervised models created by other means.
In some embodiments, an ensemble of models is utilized to output a confidence score that captures model uncertainty where the ensemble is unable to decide on the output with enough certainty. This score reflects both out-of-distribution cases and cases where the input falls into heterogeneous clusters for most models.
In some embodiments, the methods described above are utilized to create additional training data to further train the supervised models on “low confidence” samples it encounters during production use. In this case each “low confidence” sample, regardless of the model classifying it correctly, is treated analogous to a cluster centroid encountered in pretrained space and representative samples of the different classes (or quantized ranges) are picked, if available, to train the supervised model and improve future model performance.
In some embodiments, the utility of train/dev/test data set is extended beyond its traditional utility, as well as the OOD set, before model deployment to continuous lifelong learning by updating it with new data encountered in production that a model struggles with or fails to work, as well as using it to ascribe confidence to model outputs based on the mapping of input from pretrained space to fine tuned space as well as the characteristics of the set the input belongs to in the pretrained and fine tuned space.
In some embodiments, the methods and systems include use of the embedding space created by a model that is trained by some means (e.g. self-supervised, or even supervised) to cluster the unlabeled data candidates, a subset of which is used to train a supervised model.
In some embodiments, the methods and systems include use of the clusters to choose candidates for labeling to train the supervised model.
In some embodiments, the methods and systems include using clustering characteristics of input in the pretrained space as a means to partition the labeled input into train and dev/test sets.
In some embodiments, the methods and systems include training the supervised model with those candidates.
In some embodiments, the methods and systems include use of the fine tuned embedding space of the supervised model to further choose candidates to create newer versions of the fine tuned model.
In some embodiments, the methods and systems include use of one or more embeddings space learned independently or jointly to choose candidates for fine tuning a model.
In some embodiments, the methods and systems include use of two embedding spaces and the mapping characteristics of input between those two spaces as a means to choose candidates for labeling as well as to detect out-of-distribution cases.
In some embodiments, the methods and systems include using clustering characteristics of input in the pretrained and/or fine tuned space as a means to partition the labeled input into train and dev/test sets.
In some embodiments, the methods and systems include use of one or more embedding spaces to generate interpretable outputs.
In some embodiments, the methods and systems include use of one or more embedding spaces to identify out-of-distribution candidates.
In some embodiments, the methods and systems include use of one or more embedding spaces to retrain a model on out-of-distribution cases encountered at inference time.
In some embodiments, the methods and systems include use of one or more embedding spaces to retrain a model on cases it failed at inference time.
In an embodiment, shown in
In some embodiments, a fine tuned model is created by starting with a pretrained model (e.g., a trained self-supervised model) and then adding an additional layer (typically called a head) to the pretrained model. In some embodiments, the choice of additional layer is specific to the fine tuning task. For example, when fine tuning a model for a classification task, an additional classification layer is added on top of the pretrained layer. In some embodiments, the weights of the additional layer are updated during the fine tuning task. In other embodiments, the weights of the pretrained model may also be updated during fine tuning. In some embodiments, the choice of when to update the weights is driven by the amount of data available for fine tuning. For example, when there is a large amount of fine tuned data, the pretrained model weights are often updated in addition to updating the weights of the layer added specifically for fine tuning. When the amount of fine tuning data is less, typically the pretrained model weights are frozen and only the weights of the additional layer (the head) are updated. In some embodiments, the choice of number of layers of the pretrained models that are frozen or update during fine tuning is determined by the practitioners based on the amount of data available for fine tuning.
The approach described above can be used to pretrain and fine tune one or more models, with additional selective labeling being performed iteratively if required to improve a fine tuned model performance.
In an embodiment, additional labeling is also done algorithmically (autonomous agents) and a confidence score is assigned to the algorithmically labeled data leveraging the labeled data humans have already created and described in
In an embodiment, when multiple agents, human or autonomous, label a particular input, the disagreement between agents is captured in an uncertainty measure that is used at test time to check a model's uncertainty (or an ensemble uncertainty) on the same input. This approach of using a model's ensemble confidence score for an input is in effect a soft analogue to creating a separate “can't say” class distinct from all classification classes. Such inputs are then logged to continually retrain the model for its full life cycle as described below.
The present disclosure also describes a means to quantify versions of a data set as it is being created as well as compare any two datasets by a comparison metric at a data set level. For instance, two data sets are deemed similar if their comparison metric is 1, orthogonal if 0, and dissimilar if it is −1. Two similar data sets A and B would have a comparison score that would be greater than the comparison score between data set A and C where A and C are more dissimilar compared to A and B. In some embodiments, the comparison score is determined using the average cosine distance between sentences in the data set. For example, if there are M sentences in A and N in B, a dot product would yield M*N values. In some embodiments, the average of the dot products is a measure of the comparison score.
Additionally the individual model performance of two ensemble models trained by the methods described were 94.5% and 95.5% (F1-score) and the ensemble score was 97%. The OOD performance for a second use case with an ensemble of two models, which was also a binary classifier was (93%, 94%).
Table 1 shows utilization of pretrained and fine tuned spaces in the different stages. Pretrained models are used from the instance they are created through to the full lifespan of the ensemble deployment. Pretrained models are retrained too, as the corpus changes, though this frequency is typically less than fine tuning model retraining. Fine tuned models are retrained at a higher frequency and are also used for the full lifespan of ensemble deployment, until better performing models potentially replace them.
Table 2 illustrates labeling priority of input candidates to serve as a guideline for humans to label. The separation of unlabeled data into separate categories (clusters, cluster children, singletons), as well as the ordering of unlabeled data in these categories, addresses some of the inefficiencies and problems associated with manual labeling which is not inexpensive in most cases. For instance, it addresses the problem of humans labeling near duplicates if not exact duplicates. It also offers a means to quantify the labeling work not in terms of just raw counts of labeled data, but also the quality of the labeling, particularly the breadth of labeling. For instance, singletons, particularly ones that are farther from each other are more valuable than those close to each other. Cluster children are necessary for dev and test sets, but one only needs to pick a few distant ones in each cluster (algorithmic determination of near and far items make this choice easy). Despite all the algorithmic support, humans play a role in picking diverse as well as representative candidates for each output class of interest—the queues and the ordering of unlabeled candidates purely assist the user in this labeling process.
Tables 3 and 4 are examples of a binary classifier performance for various partitionings of train/dev/test sets. When the train, dev, and test sets were partitioned algorithmically, the input space was clustered into clusters, and then cluster centroids and singletons were added the train set, while cluster children were added to dev/test set. Performance tests include movement tests that determine the effect of moving certain types of inputs into different sets on model performance, compared to the algorithmic partitioning. For example, as shown in Table 3 and Table 4, when singletons were moved from the train set to the test set, such that only centroids remained in the train set, there was a decrease in model performance compared to algorithmic portioning. A similar drop in model performance relative to algorithmic partitioning was observed when centroids were moved from the train set to the test set, such that only singletons remained in the train set. For example, as shown in Table 4, moving centroid children from the test set to the train set did not improve model performance because the centroids are already present in the train set. While the examples use binary classifiers to illustrate the methods described herein, these methods apply to any supervised learning problem including but not limited to multi class classification, multi class multilabel classification, and continuous output models. Also while the embodiments described herein are examples of treating each input as a whole for classification, the methods described herein do not preclude their use to classify parts of an input, such as tagging terms/phrases for text, or classifying objects in an image, segmenting objects in an image. These diverse sets of problems require appropriate choice of models for the pretrained and fine tuned space to accomplish these tasks. In some embodiments, models for pretrained space and fine tuned space are chosen such that those models yield representations that have good clustering properties, including number of clusters, size of clusters, and heterogeneity of clusters.
This causes a single input to map to multiple clusters in the fine tuned space, more than it does in the pretrained space, as can be seen by comparing 1202 and 1204. Charts 1202 and 1204 show a histogram of how many clusters each input maps to. For example, in 1202, 2067 inputs map to singletons, while 4530 inputs map to one cluster and 868 inputs map to two clusters. For example, in 1202, each input maps to up to seven clusters in pretrained space, while, in 1204, each input maps to up to seventeen clusters in fine tuned space. These illustrations are the clusters created by passing the entire labeled input through both the pretrained and fine tune models. At deployment time, a new input may remain a singleton or fall into one more of these clusters as seen in 1204. If an input falls into only one cluster, the input's label and uncertainty are determined from that single cluster. In contrast, when an input falls into one or more of these clusters, not only the heterogeneity of a single cluster is a signal, but also the heterogeneity of the multiple clusters it falls into is also a signal of the class type. If an input falls into more than one cluster, each cluster provides information about the class type and certainty of labeling for that input. For example, if an input falls into one or more heterogeneous clusters, that is an indication of uncertainty for the labeling of that input. For example, if an input falls into one or more clusters with opposite sense, that is similarly an indication of uncertainty for the labeling of that input. This is a key distinction of the method described herein—it utilizes the entire labeled data set as a baseline reference in vector spaces to both label new data as well as to complement an ensemble that includes supervised models to determine how to classify a new input. When a supervised model produces an output given an input, despite its known performance on a test set, it is not possible to know if the model is generating a correct output by generalizing beyond the train set, or if it is confidently wrong—here confidently wrong is used to imply the model's classification score for a particular class type is unambiguously high. Vector spaces, particularly of self supervised pretrained models, tend to have predictable outcomes if used correctly. While vector spaces of fine tuned models are subject to the limited learning from train set, utilizing the entire train set as reference to help classify an input, tends to overall reduce not only the vagaries and opacity of model learning when applied to a single input, but also add a level of interpretability to model output, offered by the clustering properties of the entire labeled data set, which can be studied offline and selected for by the right choice of clustering hyperparameters. Leveraging the entire data set by mapping it to pretrained and fine tuned vector spaces, infuses predictability and interpretability to an input to output mapping that is opaque.
Table 5 illustrates the movement of input between the bipartite graph of clusters in pretrained space and fine tuned space.
Table 6 shows an example where two ensemble models of a binary classification use case don't agree on in their output. For each model, the table shows the output of the model for each input, the confidence ratio for each output, the number of clusters that input maps to that are a predominantly Label0, and the number of clusters that input maps to that are a predominantly Label1. The confidence ratio is calculated by dividing the number clusters input that maps to that are predominantly the output label by the number of clusters that input maps to that are predominantly other label. The clusters mirror the disagreement in output in their cluster counts in this case, although this need not be the case. For example, for input data_5105, Model 1 assigns the output Label2, and Model 2 assigns the output Label 1. Model 1 maps this input to 59 clusters that are predominantly Label1, and to 332 clusters that are predominantly Label 2, while Model 2 maps this input to 58 clusters that are predominantly Label1 and no clusters that are predominantly Label2. Since Model 1 maps the input to both clusters that are predominantly Label1 and clusters that are predominantly Label2, Model 1 has some uncertainty for the labeling of this input, shown by an uncertainty of 0.18 (calculated by dividing 59 by 332). In contrast, Model 2 maps this input only to clusters that are predominantly Label1 and therefore has an uncertainty of zero for its labeling of this input. The ensemble uncertainty for an input is captured in general by disagreements between individual model output and the corresponding cluster counts in its fine tuned space, as well as disagreements across model results as illustrated in Table 6.
When the uncertainty is beyond a certain threshold the output could be classified as “can't say”—an additional class to the existing classification classes, or alternatively the uncertainty can be used in conjunction with the predicted class.
Table 6 illustrates a couple of aspects unique to the methods disclosed herein. The heterogeneity measure for each model in addition to capturing model uncertainty and OOD, also serves as a predictor of result. So given n classification models in an ensemble, there are effectively double the amount of model results to ensemble and determine the result.
As shown in
Finally, as shown in
As shown in
Those of skill in the art would appreciate that the various illustrations in the specification and drawings described herein can be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware, software, or a combination depends upon the particular application and design constraints imposed on the overall system. Skilled artisans can implement the described functionality in varying ways for each particular application. Various components and blocks can be arranged differently (for example, arranged in a different order, or partitioned in a different way) all without departing from the scope of the subject technology.
Furthermore, an implementation of the communication protocol can be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system, or other apparatus adapted for carrying out the methods described herein, is suited to perform the functions described herein.
A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein. The methods for the communications protocol can also be embedded in a non-transitory computer-readable medium or computer program product, which includes all the features enabling the implementation of the methods described herein, and which, when loaded in a computer system is able to carry out these methods. Input to any part of the disclosed systems and methods is not limited to a text input interface. For example, they can work with any form of user input including text and speech.
Computer program or application in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following a) conversion to another language, code or notation; b) reproduction in a different material form. Significantly, this communications protocol can be embodied in other specific forms without departing from the spirit or essential attributes thereof, and accordingly, reference should be had to the following claims, rather than to the foregoing specification, as indicating the scope of the invention.
The communications protocol has been described in detail with specific reference to these illustrated embodiments. It will be apparent, however, that various modifications and changes can be made within the spirit and scope of the disclosure as described in the foregoing specification, and such modifications and changes are to be considered equivalents and part of this disclosure.
It is to be understood that the disclosed subject matter is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
As such, those skilled in the art will appreciate that the conception, upon which this disclosure is based, may readily be utilized as a basis for the designing of other structures, systems, methods and media for carrying out the several purposes of the disclosed subject matter. It is important, therefore, that the claims be regarded as including such equivalent constructions insofar as they do not depart from the spirit and scope of the disclosed subject matter.
It will be appreciated that while one or more particular materials or steps have been shown and described for purposes of explanation, the materials or steps may be varied in certain respects, or materials or steps may be combined, while still obtaining the desired outcome. Additionally, modifications to the disclosed embodiment and the invention as claimed are possible and within the scope of this disclosed invention.
This application claims priority to U.S. Provisional Application No. 63/270,243, filed Oct. 21, 2021, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63270243 | Oct 2021 | US |