The present disclosure relates generally to the field of machine learning. More specifically, the present disclosure relates to machine learning systems and methods for evaluating sampling bias in deep active classification.
Deep neural networks (DNNs) trained on large datasets provide state-of-the-art results on various neuro-linguistic programming (NLP) problems including text classification. However, the increasing cost and time required for data labeling and model training are bottlenecks for training DNN models on large datasets to create new and/or better models. Identifying smaller representative data samples via strategies like active learning can aid with mitigating such bottlenecks. In particular, a smaller representative dataset can be utilized to train DNNs to yield a similar test accuracy as that obtained utilizing a full training dataset (i.e., the smaller sample can be considered a surrogate for the full training dataset). However, there is a lack of clarity regarding biases in a smaller sample. In particular, there is a lack of clarity regarding sampling bias in a query including, but not limited to, its dependence on models, functions and parameters utilized to acquire the sample.
Therefore, there is a need for machine learning systems and methods which can evaluate sampling bias in deep active classification while improving an ability of computer systems to more efficiently process data. These and other needs are addressed by the machine learning systems and methods of the present disclosure.
The present disclosure relates to machine learning systems and methods for evaluating sampling bias in deep active classification. The system generates an acquisition function based on an uncertainty based query strategy. A query strategy refers to the acquisition function utilized to select at least one unlabeled data sample (query) from the input data. The system utilizes the Least Confidence and the Entropy uncertainty based query strategies. In particular, the system utilizes four query strategies, namely Least Confidence computed utilizing single and ensemble models and Entropy computed utilizing single and ensemble models. The system acquires at least one data sample from the input data based on the acquisition function. The input data can include, but is not limited to, large datasets widely utilized for text classification. The system labels the data sample via an oracle and generates a training dataset with the labeled data sample. In particular, the system generates a sequence of training datasets by sampling b queries from the input data, each of size K. The system evaluates an efficiency and bias of sample datasets Sb1, Sb2, . . . , Sbt obtained by different query strategies Q1, Q2, . . . Qt. The system also trains a network with the generated training dataset. The system can select either of two text classification models representative of deep learning and classical approaches: FastText.zip (FTZ) and Multinomial Naive Bayes (MNB) with term frequency-inverse document frequency (TF-IDF). These models are fast to train and yield quality performance on text classification which provides for efficiently conducting a large scale study. Accordingly, at each iteration, the system trains the network on a current training dataset of the training input data and utilizes a network dependent query strategy via an acquisition function generation module to acquire new data samples from the input data, label the acquired data samples by an oracle, and add the labeled samples to another training dataset.
The foregoing features of the disclosure will be apparent from the following Detailed Description, taken in connection with the accompanying drawings, in which:
The present disclosure relates to machine learning systems and methods for evaluating sampling bias in deep active classification, as discussed in detail below in connection with
The machine learning system and method of the present disclosure addresses key questions of sampling bias and efficiency and the impact of algorithmic choices in the context of deep AL text classification on large models. In particular, the system and method of the present disclosure utilize a DNN which demonstrates acceptable properties without utilizing ensembles or dropouts.
Turning to the drawings,
Beginning in step 32, the acquisition function generation module 14 generates an acquisition function based on an uncertainty based query strategy. A query strategy refers to the acquisition function utilized to select at least one unlabeled data sample (i.e., a query) from the input data 12. A query refers to an incremental set of points selected to be labeled and added to a labeled training set. Uncertainty based query strategies generally utilize a scoring function on the softmax output of a single model. The system 10 utilizes the Least Confidence (LC) and the Entropy (Ent) uncertainty based query strategies. Independently training ensembles of models is a known approach to obtain uncertainties associated with an output estimate. As such, the system 10 utilizes four query strategies, namely LC computed utilizing single and ensemble models and Entropy computed utilizing single and ensemble models. The system 10 evaluates each of the four query strategies against random sampling (chance) as a baseline. Regarding ensembles, the system 10 utilizes the FastText.zip (FTZ) ensembles. It should be understood that FTZ is a compressed version of FastText (FT), a practical model that yields the same performance with memory savings.
In step 34, the system 10 acquires at least one data sample from the input data 12 based on the acquisition function. The input data 12 can include, but is not limited to, large datasets widely utilized for text classification such as AG News (AGN), DBPedia (DBP), Amazon Review Polarity (AMZP), Amazon Review Full (AMZF), Yelp Review Polarity (YRP), Yelp Review Full (YRF), Yahoo Answers (YHA), and Sogou News (SGN). Then, in step 36, the system 10 labels the data sample via an oracle.
In step 38, the system 10 generates a training dataset with the labeled data sample. In particular let DS=(xi, yi) denote a dataset consisting of |S|=n i.i.d samples of data/label pairs where |.| denotes the cardinality. Let S0⊂S denote an initial randomly drawn sample from the initial input data 12. A sequence of training datasets: [S1, S2, . . . , Sb] is generated by sampling b queries from the input data 12, each of size K. The b queries are given by [S-S0, S-S1 . . . , S-Sb-1]. It should be understood that |Si|=(|S0|+i×K) and S1⊂S2 . . . ⊂Sb⊂S. As described in further detail below, the system 10 evaluates an efficiency and bias of sample datasets Sb1, Sb2, . . . , Sbt obtained by different query strategies Q1, Q2, . . . Qt. The system 10 excludes the randomly acquired initial dataset and compares the actively acquired sample datasets defined as Ŝji=(Sji−S0i).
In step 40, the system 10 trains the network 16 with the generated training dataset. As described above, the system 10 can select as the network 16 two text classification models representative of deep learning and classical approaches: FTZ and MNB with TF-IDF. These models are fast to train and yield quality performance on text classification which provides for efficiently conducting a large scale study. The system 10 selects, as a DNN model, FTZ which yields results that are competitive with Very Deep Convolutional Neural Networks (a 29 layer CNN) but with over 15,000× speedup. This provides for conducting over 2,300 trials on large datasets of size 100K-3.6M. The traditional network MNB with TF-IDF is accurate, fast and a popular and classical baseline for text classification.
In step 42, the system 10 determines whether to acquire another data sample. If the system 10 determines to acquire another data sample, then the process returns to step 32. Alternatively, if the system 10 determines not to acquire another data sample, then the process ends. Accordingly, at each iteration, the system 10 trains the network 16 on a current training dataset of the training input data 20 and utilizes a network 16 dependent query strategy via the acquisition function generation module 14 to acquire new data samples from the input data 12, label the acquired data samples by an oracle, and add the labeled samples to another training dataset.
Training, testing and results of the system 10 will now be described in greater detail. As described above, the system 10 evaluates whether DNN models and their shallow counterparts exhibit similar behavior with regards to sample dataset bias and efficiency.
It should be understood that the system 10 can be implemented in using a wide variety of parameters and hardware. As shown in table 100, the system 10 conducts 2,304 tests. Additionally, the system 10 tests the results on three random initial datasets and three runs per dataset (to account for stochasticity in FTZ) for each of the eight datasets. The query sizes include 0.5% of the dataset for each of AGN, AMZF, YRF, and YHA and 0.25% for each of SGN, DBP, YRP and AMZP for b=30 sequential and active queries. The system 10 conducts tests with different query sizes while maintaining a size of the final training dataset b×K constant. The system 10 default query strategy utilizes a single model with output Entropy unless explicitly modified. The system 10 results in the chance column of table 140 (as shown in
Several aspects of sampling bias (e.g., class bias and feature bias) and relevant algorithmic factors (e.g., initial dataset selection, query size and query strategy in relation to the model and acquisition function) will now be described in relation to the testing results of the system 10. Sampling bias can include different types of sampling biases such as class bias and feature bias. Greedy uncertainty based query strategies are known to select disproportionately from a subset of classes per query thereby yielding an unbalanced representation in each query. However, its effect on the resulting sample dataset is unclear. The system 10 tests this by measuring the Kullback-Liebler (KL) divergence between a ground truth label distribution and a distribution obtained per query as one test ∩Q and over the resulting sample ∩S as the second test. In particular, let P denote the true distribution of labels, {circumflex over (P)} the sample distribution and C the total number of classes. Since P follows a uniform distribution, label entropy (L=−KL(P∥{circumflex over (P)})+log (C)) can be utilized. Label entropy L is an intuitive measure. A maximum level entropy is attained when sampling is uniform {circumflex over (P)}(x)=P(x) (i.e., L=log(C)).
Uncertainty sampling can yield sampling bias. It the context of active classification, it can be beneficial to have biased sampling as most informative samples can be expected to be the ones closer to class boundaries. It should be understood that the system 10 assumes ergodicity and does not consider incremental online or continuous learning scenarios where new modes or new classes are sequentially encountered. Recent approaches suggest that the learning in deep classification networks may focus on a small part of the data closer to class boundaries thereby resembling support vectors. To determine whether sampling bias also exhibits this behavior, the system 10 executes a direct comparison with support vectors from an SVM. In particular, the system 10 trains a FTZ model on the full training dataset (for common feature space) and trains an SVM on the resulting features to obtain the support vectors and determines an intersection of support vectors with each selected training dataset.
As shown in table 140, a high percentage overlap demonstrates that sampling is biased in a positive manner. Since the support vectors are indicative of the class boundaries, a large percentage of selected data consists of samples around the class boundaries. The system 10 utilizes a fast graphical processing unit (GPU) implementation for training an SVM with a linear kernel with default hyperparameters but it should be understood that any suitable graphics card can be utilized.
The system 10 also evaluates three algorithmic factors relevant to sampling bias including initial dataset selection, query size and query strategy. With regards to the initial dataset selection, the system 10 evaluates a dependence of a final selected sample dataset on the initial dataset. The system 10 compares an overlap (i.e. intersection) of final datasets incrementally constructed from different random initial datasets versus the same initial dataset. It should be understood that due to the stochasticity of training, non-identical final datasets can be expected in the latter case.
As shown in
As described above, the system 10 evaluates sampling bias in deep active text classification via over 2,300 tests involving eight large datasets having sizes ranging from 100K to 3.6M. In particular, the system 10 conducts 20 times more tests and utilizes datasets that are at least two orders of magnitudes larger than similar and known approaches. Additionally, the small query samples provided by the system 10 are often a size of the entire datasets utilized by the similar and known approaches. The system 10 also demonstrates that the selected samples are robust to sampling biases (e.g., class and feature biases) in the context of text classification and to various algorithmic factors including, but not limited to, initial dataset selection, query size and query strategy including utilized models and acquisition functions. The system 10 can be implemented utilizing default hyperparameters and trained on a NVIDIA Tesla V100 16 GB, although any suitable graphics card can by utilized.
Additionally, the system 10 demonstrates that AL with query strategies utilizing a single FTZ model with an output uncertainty as an acquisition function yields state of the art accuracy and provides sample datasets similar to those from other approaches (e.g., ensemble models). For example, a single model used for querying and utilizing a greedy uncertainty strategy with a large query size, outperforms approaches utilizing Bayesian dropout and ensemble models or diversity based query strategies for active classification as well as for creating small surrogate training datasets. In particular, the FT with output Entropy (FTZ_Ent) model is effective to generate compact surrogate datasets (e.g., 5×-20× compression) that exhibit negligible class bias, are favorably biased to sampling data points near class boundaries and are robust to various algorithmic factors.
Lastly, the system 10 demonstrates an effectiveness of the selected samples by generating small and high-quality datasets to efficiently and cost-effectively train large models. In particular, the system 10 demonstrates that the small surrogate training datasets can be effectively utilized to bootstrap the training of large DNN models (e.g., ULMFiT) to a high accuracy at 25×-200× speedups. It should be understood that the capabilities of the system 10 and results provided by the system 10 can be applicable to several issues including, but not limited to, the nature of sampled data (e.g., distribution in the feature space and importance for a task at hand), generation of surrogate datasets for a variety of applications (e.g., hyper-parameter search and architecture search), extension to other deep models beyond FTZ, extension beyond classification models, dataset compression problems, and active semi-supervised, incremental-online and continuous learning scenarios.
The functionality provided by the present disclosure could be provided by computer software code 406, which could be embodied as computer-readable program code stored on the storage device 404 and executed by the CPU 412 using any suitable, high or low level computing language, such as Python, Java, C, C++, C#, .NET, MATLAB, etc. The network interface 408 could include an Ethernet network interface device, a wireless network interface device, or any other suitable device which permits the server 402 to communicate via the network. The CPU 412 could include any suitable single-core or multiple-core microprocessor of any suitable architecture that is capable of implementing and running the computer software code 406 (e.g., Intel processor). The random access memory 414 could include any suitable, high-speed, random access memory typical of most modern computers, such as dynamic RAM (DRAM), etc.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art can make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is desired to be protected by Letters Patent is set forth in the following claims.
This application claims priority to U.S. Provisional Patent Application Ser. No. 62/869,721 filed on Jul. 2, 2019, the entire disclosure of which is hereby expressly incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62869721 | Jul 2019 | US |