System and method for improving efficacy of supervised learning

Description

FIELD OF INVENTION

Embodiments of the present disclosure relate to systems, methods, and computer readable media for analyzing underlying relationships in data for efficiently training a machine learning model

BACKGROUND

Despite the surge in self-supervised methods for representation learning that is then used to solve traditionally supervised tasks (e.g. classifying a picture as a cat or dog, segmenting objects in an image) without the need for labeled data, there still remains a large number of tasks where supervised learning is essential. This is for those cases when the task involves labels that cannot be inferred from available input data, notwithstanding the fact that in some cases, joint learning across media types (image and text) if such data is available, can be used to avoid explicitly labeling data. A task where supervised learning cannot be avoided is detecting the relation between two phrases in a sentence. This can be done in a self-supervised manner for a large number of cases, but requires a supervised model to learn complex relationships between the phrases.

BRIEF SUMMARY OF THE EMBODIMENTS

In one aspect, a method includes selecting a first plurality of input candidates from a corpus of data; mapping the first plurality of input candidates onto a pretrained vector space of a pretrained model; clustering the first plurality of input candidates in the pretrained vector space; adding the first plurality of input candidates to a plurality of queues for labelling; and labelling the first plurality of input candidates.

In some embodiments, labelling the first plurality of input candidates is performed by humans.

In some embodiments, labelling the first plurality of input candidates is performed algorithmically.

In some embodiments, labelling includes identifying cluster centroids in the pretrained vector space.

In some embodiments, the pretrained vector space is created by mapping input to sparse/dense distributed representations.

In some embodiments, the pretrained vector space includes learned parameters of a probability distribution.

In some embodiments, the pretrained vector space is learned by performing density estimation.

In some embodiments, the pretrained model is selected from a group consisting of transformers, convolutional neural networks, recurrent neural networks, graph neural networks, and combinations thereof.

In some embodiments, the method further includes partitioning the labeled first plurality of input candidates into a train set, a development set, a test set, and an out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids in the pretrained vector space from the first plurality of input candidates to the train set; adding labeled cluster children in the pretrained vector space from the first plurality of input candidates to one of the development set and the test set; and adding labeled singletons in the pretrained vector space from the first plurality of input candidates to one of the train set and the out-of-distribution set.

In some embodiments, the method further includes creating a fine tuned model.

In some embodiments, wherein creating the fine tuned model includes using the pretrained model to create the fine tuned model.

In some embodiments, the method further includes assigning a first plurality of outputs using the fine tuned model.

In some embodiments, the fine tuned model is selected from a group consisting of transformers, convolutional neural networks, recurrent neural networks, graph neural networks, and combinations thereof.

In some embodiments, the method further includes evaluating performance of the fine tuned model on the test set, wherein evaluating the performance of the fine tuned model includes: mapping the test set onto a fine tuned vector space; clustering the test set in the fine tuned vector space; quantifying heterogeneity of test set clusters in the fine tuned vector space; and providing a confidence score for the fine tuned model.

In some embodiments, the method further includes labelling a second plurality of input candidates from the corpus of data, wherein labelling the second plurality of input candidates includes: mapping the train set and development set onto the pretrained vector space and the fine tuned vector space; clustering the train set and development set in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; selecting the second plurality of input candidates such that the second plurality of input candidates are near to at least one of the heterogeneous clusters and singletons in the fine tuned vector space; and labelling the second plurality of input candidates.

In some embodiments, the method further includes partitioning the labeled second plurality of input candidates into the train set, the development set, the test set, and the out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids from the second plurality of input candidates to the train set; adding labeled cluster children from the second plurality of input candidates to one of the development set and the test set; and adding labeled singletons from the second plurality of input candidates to the one of the train set and the out-of-distribution set.

In some embodiments, labelling of the second plurality of input candidates includes algorithmically labelling the second plurality of input candidates.

In some embodiments, the method further includes assigning the confidence score for the labelling of the second plurality of input candidates using a bipartite graph of the pretrained vector space and the fine tuned vector space.

In some embodiments, the method further includes evaluating performance of an ensemble of two or more fine tuned models on the test set, wherein evaluating the performance of the ensemble of two or more fine tuned models includes determining whether the ensemble of two or more fine tuned models concur on an output; mapping the train, development, and test sets onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using a bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.

In some embodiments, the method further includes selecting a third plurality of input candidates from the corpus of data; labeling the third plurality of input candidates using the fine tuned model; mapping the third plurality of input candidates onto the pretrained vector space and a fine tuned vector space; clustering the third plurality of input candidates in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; and assigning a confidence score for the labelling of the third plurality of input candidates using a bipartite graph of the pretrained vector space and the fine tuned vector space.

In some embodiments, the method further includes labeling a third plurality of input candidates using an ensemble of two or more fine tuned models on the third plurality of input candidates; determining whether the ensemble of two or more fine tuned models concur on labeling of the third plurality of input candidates; mapping the third plurality of input candidates onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using a bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.

In some embodiments, the method further includes selecting a plurality of failed inputs for examination, wherein the plurality of failed inputs are inputs of the third plurality of inputs that have a low confidence score; selecting a plurality of neighbors of each of the plurality of failed inputs; labelling the plurality of neighbors; and partitioning the plurality of neighbors onto the train set, the development set, the test set, and the out-of-distribution set.

In some embodiments, partitioning the plurality of neighbors includes adding labeled cluster centroids from the plurality of neighbors to the train set; adding labeled cluster children from the plurality of neighbors to one of the development set and the test set; and adding labeled singletons from the plurality of neighbors to one of the train set and the out-of-distribution set.

In one aspect, a non-transitory memory; and one or more hardware processors configured to read instructions from the non-transitory memory that, when executed cause the one or more hardware processors to perform operations including: selecting a first plurality of input candidates from a corpus of data; mapping the first plurality of input candidates onto a pretrained vector space of a pretrained model; clustering the first plurality of input candidates in the pretrained vector space; adding the first plurality of input candidates to a plurality of queues for labelling; and labelling the first plurality of input candidates.

In some embodiments, labelling the first plurality of input candidates is performed by humans.

In some embodiments, labelling the first plurality of input candidates is performed algorithmically.

In some embodiments, labelling includes identifying cluster centroids in the pretrained vector space.

In some embodiments, the pretrained vector space is created by mapping input to sparse/dense distributed representations.

In some embodiments, the pretrained vector space includes learned parameters of a probability distribution.

In some embodiments, the pretrained vector space is learned by performing density estimation.

In some embodiments, the operations further include partitioning the labeled first plurality of input candidates into a train set, a development set, a test set, and an out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids in the pretrained vector space from the first plurality of input candidates to the train set; adding labeled cluster children in the pretrained vector space from the first plurality of input candidates to one of the development set and the test set; and adding labeled singletons in the pretrained vector space from the first plurality of input candidates to one of the train set and the out-of-distribution set.

In some embodiments, the operations further include creating a fine tuned model.

In some embodiments, creating the fine tuned model includes using the pretrained model to create the fine tuned model.

In some embodiments, the operations further include assigning a first plurality of outputs using the fine tuned model.

In some embodiments, the operations further include evaluating performance of the fine tuned model on the test set, wherein evaluating the performance of the fine tuned model includes: mapping the test set onto a fine tuned vector space; clustering the test set in the fine tuned vector space; quantifying heterogeneity of test set clusters in the fine tuned vector space; and providing a confidence score for the fine tuned model.

In some embodiments, the operations further include labelling a second plurality of input candidates from the corpus of data, wherein labelling the second plurality of input candidates includes: mapping the train set and development set onto the pretrained vector space and the fine tuned vector space; clustering the train set and development set in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; selecting the second plurality of input candidates such that the second plurality of input candidates are near to at least one of the heterogeneous clusters and singletons in the fine tuned vector space; and labelling the second plurality of input candidates.

In some embodiments, the operations further include partitioning the labeled second plurality of input candidates into the train set, the development set, the test set, and the out-of-distribution set, wherein partitioning includes: adding labeled cluster centroids from the second plurality of input candidates to the train set; adding labeled cluster children from the second plurality of input candidates to one of the development set and the test set; and adding labeled singletons from the second plurality of input candidates to the one of the train set and the out-of-distribution set.

In some embodiments, labelling of the second plurality of input candidates includes algorithmically labelling the second plurality of input candidates.

In some embodiments, the operations further include assigning the confidence score for the labelling of the second plurality of input candidates using a bipartite graph of the pretrained vector space and the fine tuned vector space.

In some embodiments, the operations further include evaluating performance of an ensemble of two or more fine tuned models on the test set, wherein evaluating the performance of the ensemble of two or more fine tuned models includes determining whether the ensemble of two or more fine tuned models concur on an output; mapping the train, development, and test sets onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using a bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.

In some embodiments, the operations further include selecting a third plurality of input candidates from the corpus of data; labeling the third plurality of input candidates using the fine tuned model; mapping the third plurality of input candidates onto the pretrained vector space and a fine tuned vector space; clustering the third plurality of input candidates in the pretrained vector space and the fine tuned vector space; identifying heterogeneous clusters and singletons in the fine tuned vector space; and assigning a confidence score for the labelling of the third plurality of input candidates using a bipartite graph of the pretrained vector space and the fine tuned vector space.

In some embodiments, the operations further include labeling a third plurality of input candidates using an ensemble of two or more fine tuned models on the third plurality of input candidates; determining whether the ensemble of two or more fine tuned models concur on labeling of the third plurality of input candidates; mapping the third plurality of input candidates onto one or more pairs of pretrained vector spaces and fine tuned vector spaces; and assigning a confidence score for each of the two or more fine tuned models using a bipartite graph for each of the one or more pairs of pretrained vector spaces and fine tuned vector spaces.

In some embodiments, the operations further include selecting a plurality of failed inputs for examination, wherein the plurality of failed inputs are inputs of the third plurality of inputs that have a low confidence score; selecting a plurality of neighbors of each of the plurality of failed inputs; labelling the plurality of neighbors; and partitioning the plurality of neighbors onto the train set, the development set, the test set, and the out-of-distribution set.

In some embodiments, partitioning the plurality of neighbors includes: adding labeled cluster centroids from the plurality of neighbors to the train set; adding labeled cluster children from the plurality of neighbors to one of the development set and the test set; and adding labeled singletons from the plurality of neighbors to one of the train set and the out-of-distribution set.

In one aspect non-transitory computer-readable medium storing instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to perform operations including: selecting a first plurality of input candidates from a corpus of data; mapping the first plurality of input candidates onto a pretrained vector space of a pretrained model; clustering the first plurality of input candidates in the pretrained vector space; adding the first plurality of input candidates to a plurality of queues for labelling; and labelling the first plurality of input candidates.