The subject matter described herein relates to making machine learning more efficient by reducing the computing resources associated with active learning.
Machine learning (ML) models may learn via training. The ML model may take a variety of forms, such as an artificial neural network (or neural network, for short), decision trees, and/or the like. The training of the ML model may be supervised (with labeled training data), semi-supervised, or unsupervised. When trained, the ML model may be used to perform a task, such as an inference task.
In some embodiments, there is provided receiving, as an input to a first machine learning model, a plurality of data; learning, by the first machine learning model and based at least on the plurality of data, a latent space; generating, based on the plurality of data and the latent space, a proximity graph, wherein label knowledge from labeled data is diffused on a plurality of nodes of the proximity graph; filtering, by the proximity graph, the plurality of nodes to provide a top k most uncertain nodes, wherein the top k most uncertain nodes form a subset of a plurality of unlabeled data; and providing the subset of the plurality of unlabeled data to a second machine learning model comprised in an active learning process. Related system, methods, and articles of manufacture are also disclosed.
In some variations, one or more of the features disclosed herein including the following features can optionally be included in any feasible combination. The first machine learning model may include a variational auto encoder including an encoder and a decoder, wherein the encoder is coupled to an encoder input to receive the plurality of data and an encoder output provides the latent space, and wherein a decoder output reproduces a representation of the input. Each node of the proximity graph corresponds to a data sample from an unlabeled pool or from a labeled training set. The latent space is a lower order dimension when compared to a dimension of the input of the first machine learning model or an output of the first machine learning model. The proximity graph includes at least one edge coupling at least a pair of nodes, wherein the label knowledge is diffused based on a similarity metric between at least the pair of nodes. The filtering further may include ranking, based on uncertainty of the diffused labels, the plurality of nodes of the proximity graph from a high uncertainty value to a low uncertainty value and in response to the ranking, selecting the top k most uncertain nodes, wherein the top k most uncertain nodes are associated with magnitudes of diffused labels. The second machine learning model outputs at least a portion of the subset of the unlabeled data toward an oracle to label the portion of the subset of the unlabeled data. The second machine learning model includes a decoder of a variational auto encoder. There may also be provided updating the label knowledge with one or more additional labels for at least a portion of the subset of the unlabeled data, wherein the updated label knowledge is diffused among the plurality of nodes of the proximity graph; filtering, by the proximity graph and based on the updated label knowledge, the plurality of nodes to provide an updated top k most uncertain nodes, wherein the updated top k most uncertain nodes form an updated subset of the unlabeled data; and providing the updated subset to the second machine learning model comprised in the active learning process.
The above-noted aspects and features may be implemented in systems, apparatus, methods, and/or articles depending on the desired configuration. The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims.
In the drawings,
Like labels are used to refer to same or similar items in the drawings.
Machine learning (ML) and, in particular, deep learning provides unprecedented performance in various semi-supervised learning tasks including speech recognition, computer vision, natural language processing, and the like. And in the case of deep Convolutional Neural Networks (CNN), the deep CNNs can recognize objects (in some cases better that a human). The success of ML comes with a need for massive amounts of labeled data to train the ML models used to perform a task, such as object recognition and the like. While data collection at a large scale may be considered relatively straight forward, the annotation of the collected data with labels is considered a bottleneck for successful ML training and execution.
Active learning may be used to select a set of data points for labeling to optimally minimize the error probability under a fixed budget of labeling effort. The phrase “active learning” refers to machine learning, in which a learning algorithm selectively queries an oracle (e.g., a user or an automated labeler) to selectively label the data. With active learning, the algorithm may proactively select a subset of data (also referred to as samples) to be labeled from a larger pool of unlabeled data. In this way, active learning may reduce the labeling effort needed from the oracle as part of training and developing ML models including semi-supervised ML models.
Although the active learning of
In some embodiments, there are provided systems, articles of manufacture, and methods for selection of a restricted pool subset of unlabeled samples using latent graph diffusion.
In some embodiments, a graph, such as a proximity graph, is generated (e.g., computed, determined, etc.) once based on the latent space of a Variational Auto Encoder (VAE). The proximity graph includes nodes, each of which may correspond to a data sample in the pool of unlabeled data, such as unlabeled pool 195. And, the latent space (which includes labeling or classification knowledge) is used to diffuse (among the nodes of the proximity graph) possible, likely labels for the data samples of the unlabeled pool 195.
The system 100 also includes a machine learning (ML) model, such as a VAE 104. During learning, the VAE is trained using samples from the dataset 102 as inputs 106A. The unlabeled samples (which are selected from the dataset 102) are provided as an input 106A until the VAE learns to reproduce the input 106A at its output 106B. For example, the system 100 may iterate over a plurality of samples in the dataset 102 until the VAE learns to reproduce the input 106A at its output 106B. Although this example refers to the VAE receiving at the input 106A unlabeled samples from the dataset 102, the VAE may also receive one or more labeled data samples as well.
In some embodiments, the VAE 104 includes an encoder 108A and a decoder 108B, so during learning the encoder 108A encodes the input into a “latent” space 110, while the decoder 108B takes the encoded data (which is in the latent space or domain) and transforms it back into its original input form or domain. During training of the VAE 104, the encoder and decoder neural networks are optimized, such that if an unlabeled image “4” from the dataset 102 is provided as an input 106A, the encoder 108A encodes that input into the latent space 110 (e.g., “domain”), and the decoder 108B takes the encoded data and transforms it back into its original input form and provides it at the output 106B, which in this example would be a reproduced version of “4”. The training of the VAE may continue until the input 106A is reproduced at the output 106B with less than a threshold amount of mean squared error. Although the latent space is depicted at
After training of the VAE 104, the latent space 110 may be used to generate a proximity graph 115 and, in particular, provide latent graph diffusion. The phrase “latent graph diffusion” refers to using the latest space 110 representation to build a proximity graph and then diffusing the label knowledge using weights on the proximity graph edges. The process of diffusion using the proximity graph is described further below.
The proximity graph 115 provides a structure where each node represents a sample in the dataset 102, and the nodes are connected via edges (e.g., links).
Referring again to
to form a weight for the edge. In the example of
Referring to
In the example of
Given for example an unlabeled dataset 102 of 10 million images, the graph 115 may output the top 500 (where k=500) images with respect to labeling. In this example, the 500 nodes (which correspond to unlabeled images from the pool dataset 102) are the most uncertain nodes with respect to their label assignment, as such, labeling by the oracle 190 should be performed. The top 500 images output at 120 are provided as an input to a second machine learning model, such as the ML model 212B for further classification or label processing as part of active learning. In this example, the ML model 212B may (or may not) be able to label some of the top 500 images. For the images that cannot be classified by the ML model 212B, these unlabeled images are output at 220 by the ML model 212B and sent toward the oracle for a label query at 220. In this way, the proximity graph 115 with the diffusion of labels is able to reduce the resources needed by the ML model 212B e.g., from 5 million images to the top 500 in this example). The subset at 120 may also reduce the overall candidate samples at 220 for labeling by the oracle 190, so the overall compute resources needed by the ML model 212B are reduced (which also decreases the ML learning time and decreases overall query times to the oracle 190).
After the oracle 190 labels some of the candidate samples provided at 220, these annotated samples are provided at 225 to the labeled training set 210. The labeled training set may be used to re-train at 230 the ML model 212A-B to determine labels for samples. Moreover, the additional training data may be used by the proximity graph 115 to further learn by diffusing classification knowledge with respect to labels among the nodes of the proximity graph 115. In other words, the structure of the proximity graph 115 does not changes but the diffusion of label information among nodes may be updated or change, so the certainty or uncertainty associated with labels for some of the nodes (which represent data samples in the dataset 102) may also change or be updated at 240. As noted with the example of
Referring again to
The graph-based diffusion of the labels via graph 115 used at
In some implementations, the diffusions of labels using the proximity graph to “filter” candidate unlabeled data samples from a pool of samples may improve active learning query time and scales to large scale datasets.
In some embodiments, a graph-based diffusion process is used to restrict or filter the candidate samples in a pool of unlabeled samples to a smaller, sub-set of candidate samples, such that smaller sub-set is used in an active learning process (which is more efficiently performed as a result of the smaller, sub-set of candidate samples). In some implementations, the active learning process using the graph-based diffusion process may accelerate the query time, when compare dot other active learning schemes.
At 302, the process may include receiving, as an input to a first machine learning model, a plurality of data, in accordance with some embodiments. For example, a first machine learning model, such as the VAE 104 depicted at
At 304, the process may include learning, by the first machine learning model and based at least on the plurality of data, a latent space, in accordance with some embodiments. For example, the first machine learning model, such as VAE 104, may learn to reproduce the input images provide at the encoder input 108A at the decoder output 106B. In this example, the encoder output provides the latent space 110. When the learning learns the reproduction to within a threshold mean squared error, the latent space 110 may be used as a data representation because the latent space may provide structure correlated with the labeling knowledge. Alternatively, or additionally, the learning by the machine learning model may, as noted, include one or more labeled data samples as well. In the example of
At 306, the process may include generating, based on the plurality of data and the latent space, a proximity graph, wherein label knowledge from labeled data is diffused among a plurality of nodes of the proximity graph, in accordance with some embodiments. For example, a proximity graph, such as proximity graph 115, may be generated, such that the data samples received as an input at 106A are represented by nodes connected by edges. The nodes may correspond to data samples obtained from for example an unlabeled pool of data samples, such as dataset 102. Alternatively, or additionally, one or more of the nodes may comprise labeled data obtained from for example a labeled training set, such as labeled training set 210 (some of which may include training data classified by the oracle or labeled autonomously by the active learning). The label knowledge obtained from the training data may be diffused among the nodes of the proximity graph. The label knowledge can be diffused and the uncertainty (or, e.g., certainty) of this diffusion among the nodes may be used to select a subset of the nodes.
At 308, the process may include filtering, by the proximity graph, the plurality of nodes to provide a top k most uncertain nodes, wherein the top k most uncertain nodes form a subset of a plurality of unlabeled data, in accordance with some embodiments. For example, the proximity graph may be used to filter (e.g., sub-sample) the nodes, such that the k most uncertain nodes are selected and form a subset of the unlabeled data, which is provided to the second machine learning model 212B. In an implementation, the plurality of nodes of the proximity graph may be ranked from a high uncertainty value to a low uncertainty value. In response to the ranking for example, the top k most uncertain nodes (with respect to labeling) may be selected.
At 310, the process may include providing the subset of the plurality of unlabeled data to a second machine learning model comprised in an active learning process, in accordance with some embodiments. For example, the subset of the unlabeled data, such as images and/or the like, may be output at 120 towards an active learning process. In the example of
As noted, the active learning of
The following provides additional description with respect to semi-supervised learning, active learning, diffusion, and other related aspects.
Graph-based semi-supervised learning (GSSL) refers to a machine learning that exploits the information of both the labeled and unlabeled datasets to learn a good classifier for unlabeled samples. As a form of semi-supervised learning, active learning automates the process of data acquisition from a large pool of unlabeled dataset for annotation, which may achieve a certain performance of the classifier. GSSL (which is a type of semi-supervised learning) aims to present the data in a graph, such that the label information of the unlabeled set can be inferred using the labeled data. For example, label propagation or Laplace learning may be used. With label propagation, label information is diffused from a labeled set to unlabeled instance. Moreover, the computation complexity of label propagation may be characterized as linear in the size of the data. The success of label propagation hinges on an informative graph that retains similarity of the data points. Due to the volatile property of image pixels (e.g., unstable to noise, rotation, etc.) feature transformations may be applied to build good quality graphs, and variational auto-encoders may be used to generate high-quality latent representation of data for feature extraction and similarity graph construction.
With respect to generative models in active learning, deep generative models may be used to learn the latent representation of data in both semi-supervised learning and unsupervised. Except for constructing similarity graph in graph-based semi-supervised learning, generative models in active learning may also be used to generate adversarial data/models for more efficient and robust active learning. For example, a task-agnostic generative model may be used to trains an adversarial network to select unlabeled instances that are distinct from the labeled set in the latent space of a variational auto-encoder (VAE).
First, consider a dataset D in Rd of cardinally [n] m, wherein Rd is the space of all real vectors in dimension d, its standard notation The dataset D can split into a labeled set Dl (which represents a labeled subset of D) and an unlabeled subset Du. C denotes the number of classes. Fixing a batch size of B, the most “informative” subset D* is sought for annotation from an unlabeled pool set given a limited budget for annotation.
With respect to active learning, let fθ be a classifier where we assume data are sampled for example independently and identically distributed over a space D×[C], denoting ·Pz. And, as the subset D* is sought,
gives a minimum expected error.
Algorithm 1 (which is depicted at
Although some of the examples refer to a latent space using the VAE-latent representation space, other latent spaces may be used. The VAE-latent representation space may provide an advantage as the VAE-latent representation space is derived via an unsupervised method which requires no labels, and can be performed only once, prior to label acquisition. The underlying assumption is that the representation space bears structure that correlates with the class function, and therefore can be used to restrict the pool set to a smaller, yet, useful set of candidates.
Latent variable models, such as the VAE, learn representations of data, such as images. The VAE may be considered a generative neural network including a probabilistic encoder q(z|x) and a generative model p(x|z). The generator models a distribution over the input data x, conditioned on a latent variable z with prior distribution pθ(z). The encoder approximates the posterior distribution p(z|x) of the latent variables z given input data x and is trained along with the generative model by maximizing the evidence lower bound (ELBO):
where KL is the Kullback-Libeler divergence and log(p(x))≥ELBO(x).
To get a representative latent space from data (without label information), a ML model, such as a CNN, a deep CNN (e.g., ResNet18), and/or the like, may be used. For example, a ML model (e.g., ResNet18) based encoder may be used for large datasets. and a CNN-based VAE for all other datasets. In the first term,
where xil is the i-th (out of n) reconstructed image. The mean squared error (MSE) loss may be used for the first term (e.g., reconstruction loss). Next, a proximity graph is considered from the latent representation to be used within a graph diffusion process and label acquisition.
The optimized latent representation, such as the latent space 110 learned by the VAE 104, to construct a k-nearest neighbor (KNN) based weighted proximity graph G=(V, E), wherein
with m as a similarity metric, ρ as a distance metric, σij as a local scaling factor, and N(i) as the K-NN neighborhood of node i in the latent space g(D)=Z for encoder g, such as encoder 108A. A graph transition matrix P=T−1W may be defined to stand for the transition probabilities of a Markov random walk on the proximity graph G. The matrix T is a diagonal where Tii=Σj Wij.
The diffusion of labels in the proximity graph may be considered as a stochastic process, such as a Markov process that is used to propagate the label information of Dl to Du (e.g., from on node to another node). The transition probability of one step between state i and j can be denoted as Pij. Consider a binary classification, the classification probability P(y(zi)=1|zi) is associated with Pt(y(z)=1|i). A label is assigned to zi in Zu on this proximity graph G after a t-step random walk. Now, let Xi=2Pt(y(z)=1|i)−1∈[−1, 1]. The signal of X stands for the binary class. Denote
In matrix form
Let the graph Laplacian L=D−W in the system, then
Equation (1) above can be solved via iteration:
And, Equation (2) transduces a label χii+1 to Xi as a weighted average of the labels of its neighbors with the transition weights. The diffusion process initializes with
The labels are propagated to Xu gradually for t steps. At the t-th step,
The matrix χ(T) of propagated values can be interpreted as uncertainties measured by the absolute value ∥χc,i(T)∥. Specifically, the absolute value magnitude represents a measure of uncertainty on whether vertex i belongs to a class c. The magnitude can be used to select a new batch to query towards the oracle 190 as
where minB denotes the B smallest elements. The selection criterion in Equation (3) depends on the initialization strategy used.
With respect to exploration and refinement, the query criterion (see, e.g., Equation (3)) coupled with the diffusion process allows exploration of the dataset at early stages of active learning and the switching to refinement when exploration has saturated. To understand this mechanism, we show in the following that the diffusion iterant χt converges to the second eigenvector ϕ2 of the graph's Laplacian as t→∞. Asymptotically ϕ2 provides a relaxed solution to the minimal normalized cut problem, where the cut corresponds to the decision boundary between the two classes in G (V, E, W). At early stages of label acquisition low magnitude entries in χ correspond to data points that are unreachable from the training set via diffusion and need to be explored. At later stages all unlabeled data points Xu are reachable via diffusion from the labeled set Xl. At this stage low magnitude entries correspond to the transition between the two classes −1 and 1. These nodes capture the eigenvector's transition from negative to positive entries. Therefore, sampling these points corresponds to the refinement of the decision boundary.
Referring again to
Without in any way limiting the scope, interpretation, or application of the claims appearing below, a technical effect of one or more of the example embodiments disclosed herein may include reduced use of compute resources during active learning.
The subject matter described herein may be embodied in systems, apparatus, methods, and/or articles depending on the desired configuration. For example, the base stations and user equipment (or one or more components therein) and/or the processes described herein can be implemented using one or more of the following: a processor executing program code, an application-specific integrated circuit (ASIC), a digital signal processor (DSP), an embedded processor, a field programmable gate array (FPGA), and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. These computer programs (also known as programs, software, software applications, applications, components, program code, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term “computer-readable medium” refers to any computer program product, machine-readable medium, computer-readable storage medium, apparatus and/or device (for example, magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions. Similarly, systems are also described herein that may include a processor and a memory coupled to the processor. The memory may include one or more programs that cause the processor to perform one or more of the operations described herein.
Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and/or variations may be provided in addition to those set forth herein. Moreover, the implementations described above may be directed to various combinations and subcombinations of the disclosed features and/or combinations and subcombinations of several further features disclosed above. Other embodiments may be within the scope of the following claims.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Although various aspects of some of the embodiments are set out in the independent claims, other aspects of some of the embodiments comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications that may be made without departing from the scope of some of the embodiments as defined in the appended claims. Other embodiments may be within the scope of the following claims. The term “based on” includes “based on at least.” The use of the phase “such as” means “such as for example” unless otherwise indicated.
The present application claims priority to U.S. Provisional Application No. 63/512,972 filed Jul. 11, 2023, and entitled “ACCELERATED DEEP ACTIVE LEARNING WITH GRAPH-BASED SUB-SAMPLING,” and incorporates its disclosure herein by reference in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| 63512972 | Jul 2023 | US |