Implementing natural language processing systems that allow computers to respond to natural language input is a challenging task. The task becomes increasingly difficult when machines attempt to understand expressed opinions in the input text and extract classification information based on limited training data. There is a need for techniques and systems that can respond to the needs of modern natural language systems in a time and cost-effective manner.
Certain embodiments of the present disclosure relate to a non-transitory computer readable storage medium storing instructions that are executable by a data augmentation system that includes one or more processors to cause the data augmentation system to perform a method for generating training data for a machine learning model. The method can include accessing a machine learning model from a machine learning model repository; identifying a data set associated with the machine learning model; generating a set of data augmentation operators using the data set; selecting a sequence of tokens associated with the machine learning model; generating at least one sequence of tokens by applying at least one data augmentation operators of the set of augmentation operators on the selected sequence of tokens; selecting a subset of sequences of tokens from the generated at least one sequence of tokens; storing the subset of sequences of tokens in a training data repository; and providing the subset of sequences of tokens to the machine learning model.
According to some disclosed embodiments, generating a set of data augmentation operators using the data set further comprises: selecting one or more data augmentation operators; generating sequentially formatted input sequences of tokens of the identified data set; applying the one or more data augmentation operators to at least one sequence of tokens of the sequentially formatted input sequences of tokens to generate at least one modified sequences of tokens; and determining the set of augmentation operators to reverse the at least one modified sequences of tokens to corresponding sequentially formatted input sequences of tokens.
According to some disclosed embodiments, the accessed machine learning model is a sequence-to-sequence machine learning model.
According to some disclosed embodiments, selecting a subset of sequences of tokens further comprises: filtering at least one sequence of tokens from the generated at least one sequence of tokens using a filtering machine learning model; determining a weight of at least one sequence tokens in the filtered at least one sequence of tokens, using a weighting machine learning model; and applying the weight to at least one sequence of tokens of the filtered at least one sequence of tokens.
According to some disclosed embodiments, the weight of the at least one sequence of tokens is determined based on the importance of the sequence of tokens in training the machine learning model.
According to some disclosed embodiments, the importance of the at least one sequence of tokens is determined by calculating a validation loss of the machine learning model when trained using the at least one sequence of tokens.
According to some disclosed embodiments, filtering machine learning model is trained using the validation loss.
According to some disclosed embodiments, the weighting machine learning model is trained until the validation loss reaches a threshold value.
According to some disclosed embodiments, the at least one data augmentation operators includes at least one of token deletion operator, token insertion operator, token replacement operator, token swap operator, span deletion operator, span shuffle operator, column shuffle operator, column deletion operator, entity swap operator, back translation operator, class generator operator, inverse data augmentation operator.
According to some disclosed embodiments, the inverse data augmentation operator is a combination of multiple data augmentation operators.
According to some disclosed embodiments, the at least one data augmentation operators is context dependent.
According to some disclosed embodiments, wherein providing the subset of sequences of tokens as input to the machine learning model further comprises: accessing unlabeled data from an unlabeled data repository; generating augmented unlabeled sequences of tokens of the accessed unlabeled data; determining soft labels of the augmented unlabeled sequences of tokens; and providing the augmented unlabeled sequences of tokens with associated soft labels as input to the machine learning model.
Certain embodiments of the present disclosure relate to a non-transitory computer readable storage medium storing instructions that are executable by a data augmentation system that includes one or more processors to cause the data augmentation system to perform a method for generating data augmentation operators to generate augmented sequences of tokens. The method can include accessing unlabeled data from an unlabeled data repository; preparing one or more sequences of tokens of the accessed unlabeled data; transforming the one or more sequences of tokens to generate at least one corrupted sequence; providing as input one or more sequences of tokens and at least one corrupted sequence to a sequence-to-sequence model of the data augmentation system; executing the sequence-to-sequence model to determine at least one operations needed to reverse at least one corrupted sequence to the sequence in the one or more sequences of tokens used to generate the at least one corrupted sequence; and generating inverse data augmentation operators based on the determined one or more operations to reverse at least one corrupted sequence.
According to some disclosed embodiments, preparing one or more token sequences of the accessed unlabeled data further comprises: transforming each row in a database table into a sequence of tokens, wherein the sequence of tokens includes indicators for beginning and end of a column value.
According to some disclosed embodiments, transforming the one or more token sequences to generate at least one corrupted sequence further comprises: selecting a sequence of tokens from one or more sequences of tokens; selecting a data augmentation operator from a set of data augmentation operators; and applying the data augmentation operator to the selected sequence of tokens.
According to some disclosed embodiments, generating at least one corrupted sequence further comprises: generating an aggregate corrupted sequence by applying a plurality of data augmentation operators in a sequential order to the selected sequence of tokens.
Certain embodiments of the present disclosure relate to a non-transitory computer readable storage medium storing instructions that are executable by a data augmentation system that includes one or more processors to cause the data augmentation system to perform a method for extracting classification information from input data. The method can include adding task-specific layers to a machine learning model to generate a modified network; initializing the modified network of the machine learning model and the added task-specific layers; selecting input data entries, wherein the selection includes serializing the input data entries; providing the serialized input data entries to the modified network; and extracting classification information using the task-specific layers of the modified network.
According to some disclosed embodiments, the machine learning model further comprises: generating augmented data using at least one inverse data augmentation operator; and pre-training the machine learning model using the augmented data.
According to some disclosed embodiments, serializing the input data entries further comprises: identifying class token and other tokens in the input data entries; and marking the class token and the other tokens using different markers representing a beginning and end of each token.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate several embodiments and, together with the description, serve to explain the disclosed principles. In the drawings:
In the following detailed description, numerous details are set forth to provide a thorough understanding of the disclosed example embodiments. It is understood by those skilled in the art that the principles of the example embodiments can be practiced without every specific detail. The embodiments disclosed are exemplary and are not intended to disclose every possible embodiment consistent with the claims and disclosure. Well-known methods, procedures, and components have not been described in detail so as not to obscure the principles of the example embodiments. Unless explicitly stated, the example methods and processes described herein are neither constrained to a particular order or sequence nor constrained to a particular system configuration. Additionally, some of the described embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.
As used herein, unless specifically stated otherwise, the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a component can include A or B, then, unless specifically stated otherwise or infeasible, the component can include A, or B, or A and B. As a second example, if it is stated that a component can include A, B, or C, then, unless specifically stated otherwise or infeasible, the component can include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.
Reference will now be made in detail to the disclosed embodiments, examples of which are illustrated in the accompanying drawings. Unless explicitly stated, sending and receiving as used herein are understood to have broad meanings, including sending or receiving in response to a specific request or without such a specific request. These terms thus cover both active forms, and passive forms, of sending and receiving.
The embodiments described herein provide technologies and techniques for data integration, data cleaning, text classification to extract classification information based on limited training data using natural language techniques by computing systems.
The described embodiments provide a distinct advantage over existing techniques of natural language processing. Existing natural language processing systems can need a large set of labeled training data to operate in an unsupervised manner. Alternatively, existing systems can apply a set of static operators against limited labeled data to generate additional data that is not diverse from the limited labeled data. Further, the labels of the limited labeled data may not be appropriate for the generated additional data. Thus, there is a need for generating operators that can, in turn, generate additional data with minimal supervision.
The embodiments disclosed herein can help generate data augmentation operators and perform various natural language processing tasks in a semi-supervised manner by generating diverse training data using generated data augmentation operators. The described embodiments can help with natural language processing tasks such as entity matching, error detection, data cleaning, and text classification, to name a few. The disclosed embodiments transform data into sequential representation to use the same operators to generate data for training for different natural language processing tasks listed above. This can provide significant advantages in natural language processing systems that may need to respond to different individuals or questions that often say the same thing but in different ways. By allowing for semi-supervised, data augmentation operators, and sequential data generation, the embodiments disclosed herein can improve the ability to use natural language processing in various industries and particularized contexts without the need for a time-consuming and expensive pre-training process.
Data augmentation system 100 can include a data augmentation operator (DAO) generator 110 to generate augmentation operators that can generate data used for training a machine learning model. DAO generator 110 can be a set of software functions or a whole software program(s) applied to a sequence (for example, a text sentence) to generate transformed sequences. A processor (e.g., CPU 520 of
A transformed sequence can include an original sequence with updates to one or more words of the original sequence. The updates can include adding, removing, replacing, or swapping words or phrases in the original sequence. The transformed sequences can help augment the original sequence as training data to a machine learning model. DAO generator 110 software functions can include data augmentation operators used to generate the transformed sequences to train a machine learning model. DAO generator 110 can generate data augmentation operators that can be stored in Data Augmentation Operator (DAO) repository 120 for later use.
DAO repository 120 can organize data augmentation operators by the machine learning model that intends to use the augmented sequences generated by the data augmentation operators. For example, DAO repository 120 can include a separate database for managing each set of data augmentation operators that generate a training set for training a machine learning model. In some embodiments, a data augmentation operator can have references to all machine learning models whose training data of augmented sequences are generated using the data augmentation operator. In some embodiments, DAO generator 110 can take as input an existing data augmentation operator of DAO repository 120 to generate new data augmentation operators. DAO repository 120 can store relationship information between previous data augmentation operators and new data augmentation operators generated using the previous data augmentation operators as input. In some embodiments, DAO repository 120 can include differences between new and previous data augmentation operators. DAO repository 120 can function as a version control managing multiple versions of a data augmentation operator generated and updated by DAO generator 110.
As illustrated in
Unlabeled data repository 130 can include unannotated data (e.g., data that has not been labeled or annotated by humans or other processes). Unlabeled data repository 130 can be an RDBMS, an NRDBMS, or other types of data store. Unlabeled data repository 130 can provide a large quantity of data for training that is not annotated by humans or other processes, making it difficult to use for supervised learning of a natural language processing system. Data augmentation system 100 can use data augmentation operators of DAO repository 120 to generate additional unlabeled data for training a machine learning model (e.g., target model 141). Data augmentation system 100 can encode the unlabeled data in unlabeled data repository 130 and guess labels using the MixMatch method adjusted for natural language processing. The MixMatch method can guess low-entropy labels to be assigned to unlabeled data generated using data augmentation operators of DAO repository 120. The MixMatch method can receive feedback on the guessed labels to improve the guessing in future iterations. The MixMatch method can improve its label guessing ability based on the use of unlabeled data with guessed labels in downstream language processing tasks. Downstream language processing tasks can evaluate of the unlabeled data in training and use by machine learning models and provide feedback to MixMatch method on the guessed labels. Data augmentation system 100 can connect the unlabeled data of unlabeled data repository 130 with annotated guessed labels to additional data generated to satisfy target model 141 training data requirements. A detailed description of using unlabeled data repository 130 to generate additional data is presented in connection with
Data augmentation system 100 can also include Machine learning (ML) models repository 140 that can provide machine learning models for generating data augmentation operators in turn used to generate training data for other machine learning models in ML models repository 140. ML models repository 140 can also include ML models, such as target model 141, that can be trained using the additional training data to extract classification information. Data augmentation system 100 can use data stored in both text corpus repository 160 and unlabeled data repository 130 as input to train target model 141 to improve the extraction of classification information from an input sentence.
Data augmentation system 100 can use data augmentation operators in DAO repository 120 to generate additional data to train target model 141 to improve extraction of classification information from an input sentence. Target model 141 is a machine learning model and can include a plurality of layers. The plurality of layers can include fully connected layers or partially connected layers. Target model 141 can transform the data in text corpus repository 160, and unlabeled data repository 130 before other layers of target model 141 use the data. Target model 141 can be a language model that can use embedding layer 611 (as described in
ML models repository 140 can provide target model 141 that can aid in the extraction of classification information of an input sentence. Target model 141 can include an encoding layer (e.g., embedding layer 611 and encoding layer 612 of
In some embodiments, ML models repository 140 can provide a machine learning model as input to DAO generator 110 to generate data augmentation operators to generate additional training data. ML models repository 140 can provide sequence-to-sequence models as input to DAO generator 110 to generate data augmentation operators. In some embodiments, a sequence-to-sequence model can be a standard sequence-to-sequence model, such as the T5 model from Google. A detailed description of the sequence-to-sequence model used to generate data augmentation operators is presented in connection with
(DAO) generator 110 (as shown in
As illustrated in
As illustrated in
Unlabeled data 231 can be a sequential representation of data generated by sequence-to-sequence model 220 using data from unlabeled data repository 130. In some embodiments, unlabeled data 231 can be generated using labeled data in text corpus repository 160 or a mix of labeled and unlabeled data. Sequential data representation can include identifying individual tokens in a text and including markers showing each token's beginning and end. The sequential data can include the markers and the tokens as a character string. In some embodiments, sequence-to-sequence model 220 can introduce only markers for the beginning of a token and can use the same markers as end markers for a preceding token. The input text from unlabeled data 231 can be sequentially represented by including a class type token marker “[CLS]” and other type token marker “[SEP].” For example, “The room was modern room” could be represented in sequential form as “[CLS] The room was modern [SEP] room,” indicating two tokens the class token (“the room was modern”) and another token (“room”) identified using the markers “[CLS]” and “[SEP].” The format of sequential representation of data can depend on the text classification task for which unlabeled data 231 is used as training data to train data augmentation system 100. A data augmentation system (e.g., data augmentation system 100 of
In some embodiments, sequence-to-sequence model 220 can transform tabular input data to sequential representation before using it to generate data augmentation operators. In some embodiments data augmentation operators can include the functionality to serialize data before any transformation of the serialized data using augmentation functionality in the data augmentation operators. Sequence-to-sequence model 220 can transform tabular data into sequential data by converting each row of data to include markers for each column cell and its content value using markers “[COL] and “[VAL].” An example row in a table with contact information (in three columns, Name, Address, and Phone) can be sequentially represented as follows “[COL] Name [Val] Apple Inc. [COL] Address [VAL] 1 Infinity Loop [COL] Phone [VAL] 408-000-0000.” In some embodiments, multiple rows of table data can be combined into a single sequence using an additional marker, such as “[SEP]” placed between tokens representing two rows. A detailed description of tabular data transformation to sequential representation is presented in connection with
As illustrated in
In some embodiments, the serialization process can depend on the purpose of generating data augmentation operators. A detailed description of serialization of data specific for error detection/data cleaning purposes is presented in connection with
For example, data 291 representing row 281 with three columns “Name,” “Address,” and “Phone” using “[COL]” markers followed by the same column names and “[VAL]” markers followed by values present in row 281 for each of the columns. Unlike
As illustrated in
Referring back to
DAO generator 110 can include corrupted unlabeled data 232 used as input to sequence-to-sequence model 220 to generate data augmentation operators. DAO generator 110 can generate corrupted unlabeled data 232 by applying existing data augmentation operators to unlabeled data 231. DAO generator 110 can access existing data augmentation operators from DAO repository 120 (as shown in
Data in corrupted unlabeled data 232 can be generated by apply multiple existing data augmentation of operators on each sequence in unlabeled data 231. In some embodiments, DAO generator 110 can apply a different set of existing data augmentation operators to each sequence of tokens in unlabeled data 231. In some embodiments, DAO generator 110 can apply the same set or subset of existing data augmentation operators on each sequence in a different order. A data augmentation operator can be selected randomly from an existing set of data augmentation operators to achieve the application of different data augmentation operators. DAO generator 110 can select data augmentation operators in different orders to apply to each sequence of tokens to create the random effect on the sequences of tokens. The set of data augmentation operators applied to the sequence of tokens in unlabeled data 231 can be based on the topic of the unlabeled data 231. In some embodiments, the set of data augmentation operators applied can depend on the classification task conducted by a machine learning model (e.g., target model 141 of
Corrupted unlabeled data 232 can include a relationship to the original data in unlabeled data 231 transformed using existing data augmentation operators. The relationship can include a reference to the original sequence of tokens that is corrupted using a series of existing data augmentation operators. DAO generator 110 can use corrupted unlabeled data 232 as input to sequence-to-sequence model 220 to generate data augmentation operators. Sequence-to-sequence model 220 can generate data augmentation operators by determining a set of data transformation operations needed to revert a transformed sequence of tokens in corrupted unlabeled data 232 to the associated original sequence of tokens. Sequence-to-sequence model 220 can be pre-trained to learn about the transformation operations needed to revert a transformed sequence of tokens in a corrupted sequence to the original sequence. Sequence-to-sequence model 220 can determine transformation operations used to construct data augmentation operators. Such data augmentation operators constructed reversing the effecting of a transformation are called inverse data augmentation operators. A detailed description of inverse data augmentation construction is presented in connection with
DAO generator 110 can transfer the generated inverse data augmentation operators to DAO repository 120. In some embodiments, DAO generator 110 can notify the data augmentation system 100 about the generated inverse data augmentation operators. Data augmentation system 100 can retrieve the inverse data augmentation operators from DAO generator 110 and store them in DAO repository 120. In some embodiments, DAO generator or Data augmentation system 100 can supply the inverse data augmentation operators to ML model platform 150 (as shown in
Referring back to
By generating additional phrases according to the embodiments described, data augmentation system 100 can extract classification information cost-effectively and efficiently. Moreover, the data augmentation system 100, outlined above, and described in more detail below, can generate additional data from limited labeled data, which other existing systems may consider an insufficient amount of data. Data augmentation system 100 can utilize unlabeled data in unlabeled data repository 130 that be considered unusable by the existing systems.
Data augmentation system 100 can include Machine Learning (ML) model platform 150 that can be used to train a machine learning model. ML model platform 150 can access a machine learning model to train from Machine Learning (ML) models repository 140. ML model platform 150 can train a model from ML models repository 140 using data from text corpus repository 160 and/or unlabeled data repository 130 as input. ML model platform 150 can also take as input data augmentation operators from DAO repository 120 to generate additional training data to train the machine learning model.
In some embodiments, ML model platform 150 can connect with meta-learning policy framework 170 to determine if additional training data generated using data augmentation operators from DAO repository 120 can be used to train the machine learning model. A detailed description of components of meta-learning policy framework 170 is presented in connection with
Components of Meta-learning policy framework 170 can include machine learning models filtering model 371, weighting model 372 to identify important sequences of tokens to use as training data for a machine learning model. Meta-learning policy framework 170 can also include learning capability to improve the identification of sequences of tokens. Meta-learning policy framework 170 can help improve machine learning models' training through iterations of improved identification of important sequences of tokens used as input to the machine learning models. Meta-learning policy framework 170 also includes a loss function 373 that can assist filtering model 371 and weighting model 372 improve their identification of important sequences of tokens to train the machine learning model. Loss function 373 can help improve filtering model 371 and weighting model 372 identification capabilities by evaluating the effectiveness of a machine learning model trained using data identified by filtering model 371 and weighting model 372. Loss function 373 can help teach filtering model 371 and weighting model 372 to learn how to identify important data for training a machine learning model. Utilization of loss function 373 to improve filtering model 371 and weighting model 372 is presented in connection with
Filtering model 371 can help filter out unwanted sequences generated using data augmentation operators applied to sequences in text corpus repository 160 by ML model platform 150. Filtering model 371 can be a binary classifier that can decide whether to consider or discard an augmented sequence generated by ML model platform 150 by applying data augmentation operators (generated by DAO generator 110 of
Filtering model 371 can be used to quickly filter augmented sentences when the number of augmented sentences is above a certain threshold. In some embodiments, filtering model 371 can filter augmented examples based on certain pre-determined conditions. A user (e.g., user 190 of
The weighting model 372 can determine the importance of the selected examples by assigning them weight to the augmented sequences to compute the loss of a machine learning model training using the augmented sequences. In some embodiments, weighting model 372 can be directly applied to the augmented examples generated using data augmentation operators of DAO repository 120. In some embodiments, weighting model 372 can determine which data augmentation operators can be used more than the others by applying weight to the data augmentation operator instead of the augmented sequences.
A loss function 373 can be used to train filtering model 371 and weighting model 372. In some embodiments, loss function 373 can be a layer of the target machine learning model (e.g., target model 141) that can help evaluate the validation loss values when executing the target machine learning model to classify validation sequences set. Validation loss calculation and back-propagating of loss values are presented in connection with
Referring back to
ML model platform 150 can store the additional data (e.g., in the form of sentences) in text corpus repository 160 for later use. In some embodiments, the additional data is temporarily stored in memory and supplied to DAO generator 110 to generate additional training data. Text corpus repository 160 can receive and store the additional data generated by ML model platform 150.
ML model platform 150 can select different data augmentation operators to apply to input data selected from text corpus repository 160 to generate additional data. The ML model platform 150 can select a different data augmentation operator for each input data sentence. ML model platform 150 can also select data augmentation operators based on predefined criteria or in a random manner. In some embodiments, ML model platform 150 can apply the same data augmentation operator for a set of sentences or a set period.
Text corpus repository 160 can be pre-populated using a corpus of sentences. In some embodiments, text corpus repository 160 saves a set of input sentences supplied by a user before passing them to other components of data augmentation system 100. In other embodiments, the sentences in text corpus repository 160 can be supplied by a separate system. For example, text corpus repository 160 can include sentences supplied by user input, other systems, other data sources, or feedback from data augmentation system 100 or its components. As described above in reference to unlabeled data repository 130, text corpus repository 160 can be a Relational Database Management System (RDBMS) (e.g., Oracle Database, Microsoft SQL Server, MySQL, PostgreSQL, or IBM DB2). An RDBMS can be designed to efficiently return data for an entire row, or record, from the database in as few operations as possible. An RDBMS can store data by serializing each row of data in a data structure. In an RDBMS, data associated with a record can be stored serially such that data associated with all categories of the record can be accessed in one operation. Moreover, an RDBMS can efficiently allow access to related records stored in disparate tables. For example, in an RDBMS, tables can be linked by a referential column, and the RDBMS can join tables together to retrieve data for a data structure. In some embodiments, the text corpus repository 160 can be a non-relational database system (NRDBMS) (e.g., XML, Cassandra, CouchDB, MongoDB, Oracle NoSQL Database, FoundationDB, or Redis). A non-relational database system can store data using a variety of data structures such as, among others, a key-value store, a document store, a graph, and a tuple store. For example, a non-relational database using a document store could combine all of the data associated with a particular identifier into a single document encoded using XML. The text corpus repository 160 can also be an in-memory database such as Memcached. In some embodiments, the contents of text corpus repository 160 can exist both in a persistent storage database and in an in-memory database, such as is possible in Redis. In some embodiments, text corpus repository 160 can be stored on the same database as unlabeled data repository 130.
Data augmentation system 100 can receive requests for various tasks, including generating data augmentation operators to generate augmented data to train machine learning models. Requests to data augmentation system 100 can include generating augmented data for training machine learning models, and extracting classification information using machine learning models. Data augmentation system 100 can receive various requests over network 180. Network 180 can be a local network, the Internet, or a cloud network. User 190 can send requests for various tasks listed above to data augmentation system 100 over network 180. User 190 can interact with data augmentation system 100 over a tablet, laptop, or portable computer using a web browser or an installed application. User 190 can send request 195 over network 180 to data augmentation system 100 to generate augmented data, and operators to train machine learning model and extract classification information using the machine learning model.
The components of data augmentation system 100 can run on a single computer or can be distributed across multiple computers or processors. The different components of data augmentation system 100 can communicate over a network (e.g., LAN or WAN) 180 or the Internet. In some embodiments, each component can run on multiple compute instances or processors. The instances of each component of the data augmentation system 100 can be a part of a connected network such as a cloud network (e.g., Amazon AWS, Microsoft Azure, Google Cloud). In some embodiments, some, or all, of the components of data augmentation system 100 are executed in virtualized environments such as a hypervisor or virtual machine.
As illustrated in
In stage 1, DAO generator 110 can receive unlabeled data 231 as input to generate a set of data augmentation operators 421. Data augmentation operators 421 can be different from baseline data augmentation operators, such as operators to insert, delete, and swap tokens or spans. Baseline data augmentation operators used for generating inverse data augmentation operators are presented in connection with
In stage 2, the newly generated inverse data augmentation operators stored as data augmentation operators 421 can be applied to the existing training data 411 of text corpus repository 160 to produce augmented data 412. A subset of data augmentation operators of data augmentation operators 421 can produce augmented data 412 by applying each data augmentation operator to generate multiple augmented sentences. In some embodiments, data augmentation system 100 can skip applying some data augmentation operators to certain sentences in training data 411. Selective application of data augmentation operators on an original sentence of training data 411 can be preconfigured or can be based on the configuration provided by a user (e.g., user 190 of
In some embodiments, unlabeled data 231 can be used to generate additional training data. Unlabeled data 231 needs to be associated with labels before utilization as additional training data. As illustrated in
Unlabeled data 231 and augmented unlabeled data 432 can have soft labels 481 applied using a close guess algorithm. A close guess algorithm can be based on the proximity of sequences in unlabeled data 231 and training data 411. Proximity of the sequences can be determined based on the proximity of vectors encoded representations of sequences in unlabeled data 231 and training data 411. In some embodiments, the label determination process can include averaging multiple versions of labels generated by a machine learning model. The averaging process can include averaging vectors representing the multiple labels. This label determination process can be part of MixMatch method as described in
Unlabeled data 231 can be associated with soft labels 481 applied using a close guess algorithm. A close guess algorithm can be based on the proximity of unlabeled sequences of unlabeled data 231, labeled sequences of training data 411, and augmented sequences of augmented data 412. The proximity of the sequences can be determined based on the proximity of vectors of the sequences. In some embodiments, the label determination process can include averaging multiple versions of labels. The averaging process can include averaging vectors representing the multiple labels. Soft labels 481 associated with sequences of unlabeled data 231 can also be associated with sequences of augmented unlabeled data 432. Soft labels 481 association with augmented unlabeled data 432 can be based on the relationship between sequences of unlabeled data 231 and augmented unlabeled data 432. In some embodiments, a sequence of unlabeled data 231 can share the same soft label with augmented sequences of augmented unlabeled data 432 generated from it (by applying data augmentation operators 421). Data augmentation system 100 can generate soft labels to associate augmented sequences of augmented unlabeled data 432 using the close guess algorithm discussed above. Data augmentation system 100 uses Policy models 470 in later stages to identify the training data that meets the policy. The policy models 470 can help set the quality of training data for training to a machine learning model (e.g., target model 141). Policy models 470 can be implemented using machine learning models filtering model 371 and weighting model 372.
In stage 3, the new set of training data present in augmented data 412 and augmented unlabeled data 432 is reviewed to identify high-quality training data using filtering model 371. Filtering model 371 engages in a strict determination of whether to include certain sequences of augmented data 412, augmented unlabeled data 432. In some embodiments, filtering model 371 can apply different filtering strategies to filter augmented data 412 and augmented unlabeled data 432. A detailed description of filtering model 371 is presented in connection with
In stage 4, data augmentation system 100 can further identify the most important training data from the filtered set of training data using weighting model 372. Unlike the strict determination method of filtering model 471, weighting model 472 reduces certain sequences' effect in training a machine learning model by associating a weight to the sequence. Machine learning model consuming a sequence for training can ignore its results if low weight is applied to the sequence by weighting model 372. A detailed description of weighting model 372 is presented in connection with
In stage 5, the target batch 413 of sequences identified by the set policy models 470 implemented by filtering model 371 and weighting model 372 can be used to train target model 141. Target model 141 can also be fine-tuned using a loss function 373.
Loss function 373 can fine-tune the target model 141 behavior for the provided target batch 413 and other machine learning models, such as filtering model 371 and weighting model 372, in determining the sequences to include in the target batch 413. Loss function 373 can fine-tune machine learning models by determining the amount of deviation between the original data (e.g., training data 411) and the data generated through the process represented by stages 1 to 4. Loss function 373 can be a cross entropy loss or an L2 loss function. Loss function 373 can be used to compare the probabilistic output of the machine learning model from a SoftMax layer (as described below in
The back-propagation step of stage 5 (shown in
In phase 1, loss function 373 can back-propagate success levels of target model 141 irrespective of the behavior of policy models 470. Target model 141, upon receipt of success level, can determine whether to use the current training data (e.g., training data 411) or request for an updated training data set.
In phase 2, the back-propagation process can optimize policy models 470, including filtering model 371 and weighting model 372, so they can generate more effective augmented sequences to train target model 141. Policy models 470 can generate a batch of selected augmented sequences using the training data 411 supplied to target model 141 to train and generate the updated target model 441. The updated target model 441 can then be used to calculate loss using validation data 414. The calculated loss can be used to update the policy models 470 by back-propagating loss values to policy models 470. Policy models 470 can review the calculated loss to improve and reduce the loss from updated target model 441. The second phase of the back-propagation process can train the data augmentation policy defined by policy models 470 such that the trained model can perform well on validation data 414.
Computing device 500 can include one or more central processing units (CPUs) 520 and a system memory 521. Computing device 500 can also include one or more graphics processing units (GPUs) 525 and graphic memory 526. In some embodiments, computing device 500 can be a headless computing device that does not include GPU(s) 525 or graphic memory 526.
CPUs 520 can be single or multiple microprocessors, field-programmable gate arrays, or digital signal processors capable of executing sets of instructions stored in a memory (e.g., system memory 521), a cache (e.g., cache 541), or a register (e.g., one of registers 540). CPUs 520 can contain one or more registers (e.g., registers 540) for storing various types of data including, inter alia, data, instructions, floating-point values, conditional values, memory addresses for locations in memory (e.g., system memory 521 or graphic memory 526), pointers and counters. CPU registers 540 can include special-purpose registers used to store data associated with executing instructions such as an instruction pointer, an instruction counter, or a memory stack pointer. System memory 521 can include a tangible or a non-transitory computer-readable medium, such as a flexible disk, a hard disk, a compact disk read-only memory (CD-ROM), magneto-optical (MO) drive, digital versatile disk random-access memory (DVD-RAM), a solid-state disk (SSD), a flash drive or flash memory, processor cache, memory register, or a semiconductor memory. System memory 521 can be one or more memory chips capable of storing data and allowing direct access by CPUs 520. System memory 521 can be any type of random-access memory (RAM), or other available memory chip capable of operating as described herein.
CPUs 520 can communicate with system memory 521 via a system interface 550, sometimes referred to as a bus. In embodiments that include GPUs 525, GPUs 525 can be any type of specialized circuitry that can manipulate and alter memory (e.g., graphic memory 526) to provide or accelerate the creation of images. GPUs 525 can have a highly parallel structure optimized for processing large, parallel blocks of graphical data more efficiently than general-purpose CPUs 520. Furthermore, the functionality of GPUs 525 can be included in a chipset of a special purpose processing unit or a co-processor.
CPUs 520 can execute programming instructions stored in system memory 521 or other memory, operate on data stored in memory (e.g., system memory 521), and communicate with GPUs 525 through the system interface 550, which bridges communication between the various components of the computing device 500. In some embodiments, CPUs 520, GPUs 525, system interface 550, or any combination thereof, are integrated into a single chipset or processing unit. GPUs 525 can execute sets of instructions stored in memory (e.g., system memory 521), to manipulate graphical data stored in system memory 521 or graphic memory 526. For example, CPUs 520 can provide instructions to GPUs 525, and GPUs 525 can process the instructions to render graphics data stored in the graphic memory 526. Graphic memory 526 can be any memory space accessible by GPUs 525, including local memory, system memory, on-chip memories, and hard disk. GPUs 525 can enable displaying of graphical data stored in graphic memory 526 on display device 524 or can process graphical information and provide that information to connected devices through network interface 518 or I/O devices 530.
Computing device 500 can include a display device 524 and input/output (I/O) devices 530 (e.g., a keyboard, a mouse, or a pointing device) connected to I/O controller 523. I/O controller 523 can communicate with the other components of computing device 500 via system interface 550. It should now be appreciated that CPUs 520 can also communicate with system memory 521 and other devices in manners other than through system interface 550, such as through serial communication or direct point-to-point communication. Similarly, GPUs 525 can communicate with graphic memory 526 and other devices in ways other than system interface 550. In addition to receiving input, CPUs 520 can provide output via I/O devices 530 (e.g., through a printer, speakers, bone conduction, or other output devices).
Furthermore, the computing device 500 can include a network interface 518 to interface to a LAN, WAN, MAN, or the Internet through a variety of connections including, but not limited to, standard telephone lines, LAN or WAN links (e.g., 802.21, T1, T3, 56 kb, X.25), broadband connections (e.g., ISDN, Frame Relay, ATM), wireless connections (e.g., those conforming to, among others, the 802.11a, 802.11b, 802.11b/g/n, 802.11ac, Bluetooth, Bluetooth LTE, 3GPP, or WiMax standards), or some combination of any or all of the above. Network interface 518 can comprise a built-in network adapter, network interface card, PCMCIA network card, card bus network adapter, wireless network adapter, USB network adapter, modem, or any other device suitable for interfacing the computing device 500 to any type of network capable of communication and performing the operations described herein.
As illustrated in
As illustrated in
Data augmentation operators can be applied to a different training data set (e.g., original sequence 810) to generate new augmented sequences 820 for training a machine learning model for a different classification task (for example, error detection).
Data augmentation operators applied to original sequence 710 can be applied to a different training data set (e.g., original sequence 910) to generate new augmented sequences 920 for training a machine learning model for a different classification task (for example, error detection). Further, the same inverse data augmentation operator, such as “invdDAl” 713 when applied to different sequences (e.g., original sequences 710, 810, 910), can result in different operations applied. For example, “invDA1” 713 applied to original sequence 710 and 810 results on token insertion operation at the end of the augmented sequence, as shown in augmented sequences 723 and 823. The same operator applied to original sequence 910 results in two token insertions, as shown in augmented sequence 923. The different operations applied by the same inverse data augmentation operator (e.g., “invDA1” 713) can be based on context, such as a classification task.
Linear layers 1031-1032 can aid in extracting classification information of various kinds. For example, linear layer 1031 can extract sentiment classification information. As described in
SoftMax layers 1041-1042 can help understand the probabilities of each encoding and the associated labels. SoftMax layers 1041-1042 can understand the probabilities of each encoding by convert prediction scores of classes into a probabilistic prediction of input instance (e.g., input sequence 1020). The conversion can include converting a vector representing classes of a classification task to probability percentages. For example, input sequence 1020 with a vector (1.6, 0.0, 0.8) representing various sentiment class values can be provided as input to SoftMax layer 1041 to generate probability of positive, neutral, negative values of the sentiment classes. The output vector of probabilities generated by SoftMax layer 1041 can be close to a one-hot distribution. SoftMax layer 1041 output proximity to a distribution of percentages can be based on properties of a SoftMax function used by SoftMax layer 1041. One-hot distribution can be a multi-dimensional vector. One of the dimensions of the multi-dimensional vector can include value 1 and other dimensions can include 0 as value. A ML model can utilize the multi-dimensional vector to predict one of the class with 100% probability.
In step 1110, the system can access a machine learning model (e.g., target model 141 of
In step 1120, the system can identify data set (e.g., training data 411 of
In step 1130, the system can generate a set of data augmentation operators using identified data set. The system can generate set of data augmentation operators by selecting and one or more operators to the identified data set to generate modified sequences and determining operations to reverse modified sequences to the identified data set. The identified operations can result in determination of augmentation operators. In some embodiments, the system can select an input sequence from the identified data set (e.g., training data 411) to apply the data augmentation operators. Further description of the process of generating data augmentation operators to generate new sequences of tokens is presented in connection with
In step 1140, the system can apply at least one data augmentation operators of the set of data augmentation operators (e.g., data augmentation operators 421) on a selected input sequence of tokens (e.g., training data 411) to generate at least one sequence of tokens (e.g., augmented data 412). The system can select input sequences of tokens based on a policy criterion. A user can supply the policy criteria as part of the request sent in step 1110 to the system. In some embodiments, all training data (e.g., training data 411) previously associated with the accessed machine learning model (e.g., target model 141) is selected as an input sequence of tokens. In some embodiments, selecting a sequence of tokens can be based on converting training data into a sequence of tokens by serializing the training data. Serialization of training data in different formats is presented in connection with
The system can select data augmentation operators (e.g., data augmentation operators 421) from data augmentation operators (DAO) repository 120 that includes the set of data augmentation operators generated in step 1130. The system can select data augmentation operators based on the requested task in step 1110. In some embodiments, data augmentation operators can be selected based on the machine learning model accessed in step 1110 or an available training set (e.g., training data 411). For example, a training set for a machine learning model is limited, resulting in selecting a higher number of data augmentation operators to generate a large number of sequences of tokens to train the machine learning model. The system can apply a plurality of identified data augmentation operators to each of the selected input sequences. In some embodiments, certain data augmentation operators can be skipped based on a policy either provided by a user or determined during previous runs. A policy for selecting important new sequences of tokens and, in turn, the data augmentation operators creating the new sequences of tokens are presented in connection with
In step 1150, the system can filter at least one sequence of tokens using a filtering model (e.g., filtering model 371 of
In step 1160, the system can determine weight of at least one sequence of tokens in filtered at least one sequence of tokens. The system can use weighting model 372 to determine the weights of each sequence of tokens. A detailed description of calculating weight of sequences of tokens is presented in connection with
In step 1170, the system can apply weight to at least one sequence of tokens of at least one sequence of tokens. The system can apply weights only to the sequences filtered by the filtering model in step 1160. The system can apply weight by associating the calculated weight with a sequence of tokens. In some embodiments, the associations or the sequence and the associated weight can be stored in a database (e.g., text corpus repository 160).
In step 1180, the system can identify a subset of sequences of tokens (e.g., target batch 413 of
In step 1190, the system can provide selected subset of sequences of tokens as input to machine learning model (e.g., target model 141). In some embodiments, the system can store the selected subset of sequences in a training data repository (e.g., text corpus repository 160) before providing them to the training machine learning model. The system, upon completion of step 1190, completes (step 1199) executing method 1100 on computing device 500.
In step 1210, the system can access unlabeled data 231 from unlabeled data repository 130. The system can access unlabeled data 412 when it is configured to function in a semi-supervised learning mode. A user (e.g., user 190 of
In step 1220, the system can generate augmented unlabeled sequences of tokens (augmented unlabeled data 432 of
In step 1230, the system can determine soft labels of augmented unlabeled sequences of tokens (e.g., augmented unlabeled data 432). Data augmentation system can determine soft labels by guessing labels (e.g., soft labels 481) for unlabeled data 231. The system can apply the guessed soft labels to all the augmented unlabeled sequences of tokens generated from a sequence of tokens in unlabeled data 231. In some embodiments, the system can also generate soft labels for augmented unlabeled sequences of tokens by guessing the labels.
In step 1240, the system can associate soft labels to augmented unlabeled sequences of tokens. Association of labels can involve the system storing the association in a database (e.g., unlabeled data repository 130).
In step 1250, the system can provide augmented unlabeled sequences of tokens with associated soft labels as input to the machine learning model (e.g., target model 141). The system, upon completion of step 1250, completes (step 1299) executing method 1200 on computing device 500.
In step 1310, the DAO generator can access unlabeled data (e.g., unlabeled data 231 of
In step 1320, the DAO generator can check whether the accessed unlabeled data is formatted as a database table. If the answer to the question is yes, then the DAO generator can proceed to step 1330.
In step 1330, the DAO generator can transform each row in the database table into a sequence of tokens. Serializing tabular data to sequences of tokens is presented in connection with
If the answer to the question in step 1320 was no, i.e., unlabeled data 231 is not formatted as a database table, then the DAO generator can jump to step 1340. In step 1340, the DAO generator can prepare one or more sequences of tokens of accessed unlabeled data. the DAO generator can prepare sequences of tokens by serializing the unlabeled data. Serializing unlabeled data can involve including markers separating tokens of the text in unlabeled data. Tokens in a text can be words in a sentence identified by separating space characters.
In step 1350, the DAO generator can transform prepared one or more sequences of tokens to generate at least one corrupted sequence (e.g., corrupted unlabeled data 232). The DAO generator can generate corrupted sequences using baseline data augmentation operators (e.g., data augmentation operators 711-712). Baseline data augmentation operators can include operators for tokens (e.g., token replacement, swap, insertion, deletion), spans including multiple tokens (e.g., span replacement, swap), or the whole sequences (e.g., back translation). the DAO generator can generate the corrupted sequences by applying multiple baseline data augmentation operators in a sequence. The DAO generator can randomly select baseline data augmentation operators to apply to sequences of tokens to generate corrupted sequences. The generated corrupted sequences map to the original sequences of tokens. In some embodiments, a single sequence of tokens can be mapped to multiple corrupted sequences generated by applying a different set of baseline data augmentation operators or the same set of baseline data augmentation operators applied in different orders.
In step 1360, the DAO generator can provide as input one or more sequences of tokens and generated at least one corrupted sequence to sequence-to-sequence model 220 (as shown in
In step 1370, the DAO generator can execute sequence-to-sequence model 220 to determine one or more operations needed to reverse at least one corrupted sequence to their respective sequences in one or more sequences of tokens.
In step 1380, The DAO generator can generate a data augmentation operator (e.g., inverse data augmentation operators 713-715 of
In step 1410, the DAO generator can select a sequence of tokens (for example, a text sentence) from one or more sequences of tokens (e.g., unlabeled data 231 of
In step 1420, sequence-to-sequence model 220 can select a data augmentation operator from a set of data augmentation operators present in DAO repository 120. Data augmentation system, 100 can pre-select the set of data augmentation operators available to the DAO generator.
In step 1430, sequence-to-sequence model 220 can apply a data augmentation operator to a selected sequence of tokens to generate a transformed sequence of tokens. The transformed sequence of tokens can include updated tokens, spans of tokens. The transformation can depend on the data augmentation operators applied to the selected sequence of tokens in step 1410. Sequence-to-sequence model 220, upon completion of step 1430, completes (step 1499) executing method 1400 on computing device 500.
In step 1510, the system can generate augmented data (augmented data 412) using at least one inverse data augmentation operators (e.g., inverse data augmentation operators 713-715 of
In step 1520, the system can pre-train a machine learning model (e.g., target model 141) using augmented data as training data. Training a machine learning model using augmented data is presented in connection with
In step 1530, the system can add task-specific layers (e.g., linear layers 1031-1033, SoftMax Layers 1041-1043) to the machine learning model (e.g., target model 141) to generate modified network 1000 (as shown in
In step 1540, the system can initialize the modified network. The system can initialize a network by copying it to a memory and executing it on a computing device (e.g., computing device 500).
In step 1550, the system can identify class token and other tokens in the input data entries. The system can use a machine learning model meant for natural language tasks to identity the tokens. For example, a sentiment analysis mining machine learning model can identify tokens by identifying a subject and an opinion phrase addressing the subject in an example sentence.
In step 1560, the system can mark class token and other tokens using different markers representing each token's beginning and end. Data augmentation system can mark various tokens in an input sequence using the markers such as “[CLS]” and “[SEP].” In some embodiments, only the beginning of a token can be identified using a marker that can act as the end marker of a previous token.
In step 1570, the system can serialize input data entries using a sequence-to-sequence data augmentation model.
In step 1580, the system can provide serialized input data entries to the modified network. The system may associate references to the classification layers (e.g., linear layers 1031-1032, SoftMax Layers 1041-1043) that can extract the relevant classification information from the serialized input data. The associated references can be based on the request from a user (e.g., user 190) or content of input data entries. For example, if an input data entry has an intent question structure then the system may associate opinion extraction and sentiment analysis classification layers references to the input data entry.
In step 1590, the system can extract classification information using the modified network's task-specific layers. Further description of extraction of classification information is presented in connection with
Example embodiments are described above with reference to flowchart illustrations or block diagrams of methods, apparatus (systems), and computer program products. It will be understood that each block of the flowchart illustrations or block diagrams, and combinations of blocks in the flowchart illustrations or block diagrams, can be implemented by computer program product or instructions on a computer program product. These computer program instructions can be provided to a processor of a computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart or block diagram block or blocks.
These computer program instructions can also be stored in a computer readable medium that can direct one or more hardware processors of a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium form an article of manufacture including instructions that implement the function/act specified in the flowchart or block diagram block or blocks.
The computer program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus, or other devices to produce a computer implemented process such that the instructions that execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart or block diagram block or blocks.
Any combination of one or more computer readable medium(s) can be utilized.
The computer readable medium can be a non-transitory computer readable storage medium. In the context of this document, a computer readable storage medium can be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium can be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, IR, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations, for example, embodiments can be written in any combination of one or more programming languages, including an object-oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code can execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer can be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection can be made to an external computer (for example, through the Internet using an Internet Service Provider). The computer program code can be compiled into object code that can be executed by a processor or can be partially compiled into intermediary object code or interpreted in an interpreter, just-in-time compiler, or a virtual machine environment intended for executing computer program code.
The flowchart and block diagrams in the figures illustrate examples of the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams can represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block can occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks can sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It is understood that the described embodiments are not mutually exclusive, and elements, components, materials, or steps described in connection with one example embodiment can be combined with, or eliminated from, other embodiments in suitable ways to accomplish desired design objectives.
In the foregoing specification, embodiments have been described with reference to numerous specific details that can vary from implementation to implementation. Certain adaptations and modifications of the described embodiments can be made. Other embodiments can be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only. It is also intended that the sequence of steps shown in figures are only for illustrative purposes and are not intended to be limited to any particular sequence of steps. As such, those skilled in the art can appreciate that these steps can be performed in a different order while implementing the same method.