Various example embodiments relate generally to methods and apparatuses for active learning for deep learning training of neural networks using a training data set, wherein trained neural networks may be used to classify new data in a similar manner as the training data set.
In the field of machine learning, many scenarios involve neural networks that are organized as a set of layers, such as an input layer that receives an input, one or more hidden layers that process the input based on weighted connections with the neurons of a preceding layer, and an output layer that generates an output that may indicate a classification of the input. As an example, each input may be classified into one of N classes by providing an output layer with N neurons, where the neuron of the output layer having a maximum output indicates the class into which the input is classified.
Neural networks may be trained to classify data through a learning process. As an example involving fully-connected layers, each neuron of a layer is connected to each and every neuron of a preceding layer, and each connection includes a weight that is initially set to a value, such as a random value. Each neuron determines a weighted sum of the weighted inputs of the preceding layer and provides an output based on the weighted sum and an activation function, such as a linear activation, a rectified linear activation, a sigmoid activation, and/or a softmax activation. The output layer may similarly generate an output based on the weighted sum and an activation function.
A training data set of inputs with labels (for example, the expected classification of each input) is provided to train the neural network. Each input is processed by the neural network, wherein a backpropagation process is performed to adjust the weights of each layer such that the output is closer to the label. Some training processes may involve dividing the inputs of the training data set into mini-batches and performing backpropagation on an aggregate of the outputs for the inputs of each mini-batch. Continued training may be performed until the neural network converges, such that the neural network may produce output that is at least close to the label for each input. A neural network that is trained to perform discriminant analysis between two or more classes may form a decision boundary in an input space or sample space, wherein inputs that are on one side of the decision boundary are classified into a first class and inputs that are on another side of the decision boundary are classified into a second class. When the neural network is fully trained, new data may be provided, such as inputs without known labels, and the neural network may classify the new data based upon the training over the training data set.
The field of deep learning includes a significant number of hidden layers and/or a significant number of neurons, which may enable a more complex classification process, such as the classification of high-dimensionality input. The number of weights (also known as parameters) and/or the number of inputs in the training data set may be large, such that the training may take a long time to converge. An extended duration of training may delay the availability of a trained neural network, and/or may be computationally expensive, such as consuming significant computational resources such as processing capacity, memory capacity, network capacity, and/or energy usage to apply training until the neural network converges.
Neural networks have reached record performance in many fields, including computer vision and natural language processing. However, as the size of collected data increased dramatically over the past two decades, the effort of training neural networks is becoming a key challenge to advance the state-of-the-art. In particular, the optimization at the core of the training process and the annotation effort required in labelling massive data sets have become two major bottlenecks in the training of deep neural networks models.
Some example embodiments include an apparatus and methods for optimizing the training of neural networks. In some embodiments, one or more of the methods may be incorporated into an active learning framework or apparatus.
At least some example embodiments will become more fully understood from the detailed description provided below and the accompanying drawings, wherein like elements are represented by like reference numerals, which are given by way of illustration only and thus are not limiting of example embodiments and wherein:
Various example embodiments will now be described more fully with reference to the accompanying drawings in which some example embodiments are shown.
Detailed illustrative embodiments are disclosed herein. However, specific structural and functional details disclosed herein are merely representative for purposes of describing at least some example embodiments. Example embodiments may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
Accordingly, while example embodiments are capable of various modifications and alternative forms, embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit example embodiments to the particular forms disclosed, but on the contrary, example embodiments are to cover all modifications, equivalents, and alternatives falling within the scope of example embodiments. Like numbers refer to like elements throughout the description of the figures. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising,”, “includes” and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
Example embodiments are discussed herein as being implemented in a suitable computing environment. Although not required, example embodiments will be described in the general context of computer-executable instructions (e.g., program code), such as program modules or functional processes, being executed by one or more computer processors or CPUs. Generally, program modules or functional processes include routines, programs, objects, components, data structures, etc. that performs particular tasks or implement particular abstract data types.
In the following description, example embodiments will be described with reference to acts and symbolic representations of operations (e.g., in the form of flowcharts) that are performed by one or more processors (i.e., processing circuitry), unless indicated otherwise. As such, it will be understood that such acts and operations, which are at times referred to as being computer-executed, include the manipulation by the processor of electrical signals representing data in a structured form. This manipulation transforms the data or maintains it at locations in the memory system of the computer, which reconfigures or otherwise alters the operation of the computer in a manner well understood by those skilled in the art.
Example embodiments being thus described, it will be obvious that embodiments may be varied in many ways. Such variations are not to be regarded as a departure from example embodiments, and all such modifications are intended to be included within the scope of example embodiments.
In the example 100 of
In the example 100 of
In the example 100 of
In some example embodiments and as shown in the example 100 of
In the example 100 of
In the example 200 of
In the example 200 of
In the example 200 of
The process begins with step 302. Step 302 includes extracting a subset of labeled data points from a pool set of labeled data points. For example, as shown in
A processing unit, such as the processing circuitry 110 of apparatus 102, may be configured to extract at least a subset 412 of labeled data points from the labeled data points 404 in the pool set 402. The subset 412 of labeled data points that are extracted from the pool set 402 in step 302 may be selected in several ways. In one embodiment, the processing circuitry 110 may use random selection to select and extract the subset 412 from the labeled data points 404 in pool the set 402. In other embodiments, the selection and extraction may be based upon a predetermined criteria, or a combination of predetermined criteria and random selection.
Step 304 includes populating an anchor set with the extracted subset of the labeled data points from the pool set. For example, in some embodiments, the processing circuitry 110 may be configured to create and populate the anchor set 414 with the extracted subset of labeled data points 412 for the first time in step 304 as appropriate. In
Step 306 includes using the anchor set of labeled data points as labeled inputs to partially train a neural network. For example, in some embodiments the processing circuitry 110 may be configured to use the labeled data points 416 in the anchor set 414 to train a neural network, such as neural network 106 depicted and described in
Step 308 includes selectively swapping at least some of the labeled data points in the anchor set with at least some of the labeled data points in the pool set. For example, the processing circuitry 110 may be configured to selectively swap at least some of the labeled data points 418 from the labeled data set 416 in the anchor set 414 that were used as the labeled inputs to partially train the neural network 106 with at least some of the remaining labeled data points 420 in the pool set 402 such that the overall sizes of the remaining labeled data points 420 (after extraction of the labeled data points 412 in step 302) in the pool set 402 and the anchor set 414 remain the same after the swapping in step 308. The result of step 308 is that at least some of the labeled data points in the anchor set 414 that were used to train the neural network 106 are replaced with a new set of labeled data points that are selected from the remaining labeled data points 420 in the pool set 402. Furthermore, the labeled data points in the anchor set 414 that were replaced by the newly added labeled data points from the pool set 402 are moved back into the pool set 402, thus completing a two-way transfer process between the pool set 402 and the anchor set 414 such that the overall size of the labeled data points 420 in the pool set 402 and in of the labeled data points 416 in the anchor set 414 remain the same.
Step 310 includes retraining the neural network using the labeled data points 416 in the anchor set 414 as labeled inputs to the neural network, in a manner that may be similar to the training performed in step 306;
Step 312 includes repeating step 308 (selective swapping) and step 310 (retraining) for at least one of a preselected number of iterations or until a specified learning rate criteria for training the neural network 106 is met.
In some embodiments, the process includes selecting, by the processing circuitry, at least some of the labeled data points in the anchor set to be swapped with at least some of the selected labeled data points in the pool set based on a number of times respective data points in the anchor set have been used as labeled inputs to train the neural network.
In some embodiments, example process 300 may advantageously be integrated into an active learning framework by selecting, by the processing circuitry, one or more unlabeled data points from a set of unlabeled data points, annotating the selected one of more unlabeled data points with labels to create a set of annotated data points with labels, adding, by the processing circuitry, at least one of the annotated data points to the pool set of labeled data points; and, repeating (e.g., as in step 312), by the processing circuitry, step 308 and step 310 for at least one of a preselected number of iterations or until a specified learning rate is met.
The process 300 depicted in
Standard DNN methods train the model by simply looping for a certain number of epochs on all the data points uniformly. This does not consider the fact that not all data that have been selected have the same importance during different stages of training. In the present disclosure the training points that are used for training the neural network are selected based on a focused training method that is related to the concept of importance sampling but is computationally less expensive (as selectively fewer and relatively most impactful labeled data points are used for training). The focused training method, as applied using a diffusion process on the graph constructed from the penultimate layer representation of labelled data improves conventional approaches to importance sampling. Although conventional importance sampling is a popular tool to accelerate training by selecting the points that are more relevant to the model at each stage of training, it relies upon a metric related to the norm of the gradient loss function. However, the explicit computation of those gradients is often infeasible in practice, due to the extreme number of parameters in the model. In the present disclosure, it has been found that, even though diffusion does not look directly at the gradient for the data points that it selects, it still chooses points that score very high according to such metric.
The selective swapping of step 308 that is used to select new labeled data points for training the neural network is now described in greater detail. In some embodiments, step 308 of selectively swapping at least some of the labeled data points 418 in the anchor set 414 with at least some of the labeled data points 420 in the pool set 402 such that the respective sizes of the anchor set and the pool set (after the initial extraction) are unchanged comprises constructing a proximity graph based on similarities of output of a hidden layer (e.g., the penultimate layer) of the neural network for each of the labeled inputs used to train the neural network. In this embodiment, the processing circuitry 110 may be configured to create the proximity graph based on the output of the selected hidden layer (i.e., the penultimate layer) of the neural network for the labeled data points that were used as labeled inputs into the neural network. The processing circuitry 110 may be further configured for selecting, at least some of the labeled data points in the pool set to be swapped with at least some labeled data points in the anchor set based on a graph-diffusion process of the labels of the data points of the anchor set to the data points of the pool set.
Intuitively, the new selection of labeled data points for training the neural network in accordance with various aspects of the disclosure may be understood as a diffusion based sampling method that aims at selecting new labeled data points from the pool set 402 to be used for training with that are going to make the model performance improve faster. In order to do so, as stated previously, step 308 includes, at the outset, creating a nearest neighbor graph representation of the training set of labeled data points that were previously used to train the neural network (labeled data points 416 in
More particularly, in step 302 and step 304 the process starts randomly dividing the labelled data set (pool set 402) into two groups (labeled data points 420 and extracted data points 416), one in which the labels are maintained (extracted data points 416), and one in which the process in effect assumes that it still does not know them (remaining data points 420). The real training (backpropagation of the loss and SGD update) is performed only using labeled data points 416, while the set of remaining labeled data points 420 is used as the pool set 402 to query the new data to selectively add to labeled data points 416 in step 308). The same number of data points) that are newly added to labeled data points 416 from the remaining labeled data points 420 are also put back (labeled data points 418) into the remaining data points 420, to keep the size of these two sets (416, 420) constant. This focused querying is made in batches, and two parameters of the procedure may be used to determine the size of labeled data set 416 and labeled data set 420. In the
The underlying idea of this step (step 308) is that not all remaining data points 420 that are in the pool set 402 are necessarily useful for training at the current stage, so an extra use of diffusion based selection allows the process 300 to distinguish the points that are more relevant, and use those to perform more updates/retraining of the model.