The present invention relates to information classification, and more particularly to augmenting text to balance a dataset to improve performance of a supervised machine learning model.
Text classification techniques automatically assign categories from a set of predefined categories to unstructured text (e.g., assignment of tags to customer queries). Text classification is a fundamental task included in natural language processing. Supervised text classification is text classification using a supervised machine learning model whereby the assignment of categories to text is based on past observations (i.e., labeled training data consisting of a set of training examples).
Performance of a supervised machine learning model can be improved by using a larger set of training examples. Data augmentation can provide a larger set of training examples by generating additional, synthetic training data using the already existing training data. Existing data augmentation approaches avoid the costly and time-consuming approach of acquiring and labeling additional actual observations. One known data augmentation technique is Easy Data Augmentation (EDA), which consists of four operations: synonym replacement, random insertion, random swap, and random deletion.
In one embodiment, the present invention provides a computer-implemented method. The method includes receiving, by one or more processors, an imbalanced dataset. The method further includes identifying, by the one or more processors, a small class that includes initial text records included in the imbalanced dataset. The method further includes generating, by the one or more processors, a balanced dataset from the imbalanced dataset by augmenting the initial text records by using weighted word scores indicating respective measures of importance of words in classes in the imbalanced dataset. The method further includes sending, by the one or more processors, the balanced dataset to a supervised machine learning model. The method further includes training, by the one more processors, the supervised machine learning model on the balanced dataset. The method further includes using the supervised machine learning model employing the augmented initial text records, performing, by the one or more processors, a text classification of a new dataset whose domain matches a domain of the imbalanced dataset.
A computer program product and a computer system corresponding to the above-summarized method are also described and claimed herein.
Overview
A supervised text classification dataset often includes highly imbalanced data which has a few classes that contain a very large number of text records (i.e., large classes, which are also known as major classes) and a few classes that contain a very small number of text records (i.e., small classes, which are also known as minor classes). New records in a dataset to be classified can confuse a supervised machine learning algorithm if the new records are to be placed in small classes because the algorithm is biased towards placing records in large classes. Thus, the highly imbalanced data decreases performance of a text classification machine learning model (i.e., supervised text classification model), in terms of decreased accuracy, decreased F1 score, and similar effects on other parameters. Current approaches to balancing a dataset include (1) text record level methods of oversampling of small class text records and under-sampling large class text records; (2) word level methods of (i) randomly replacing any word by its synonym or antonym, per a language dictionary, (ii) randomly replacing any word by an equivalent word generated by a static word embeddings model based on cosine similarity, (iii) randomly replacing any word by an equivalent word generated by a contextual language model based on surrounding words in a text record, (iv) randomly inserting any word at any position in the text record, and (v) randomly deleting any word at any position in a text record; and (3) character level methods of randomly inserting any character at any position in any word in a text record and randomly deleting any character at any position in any word in a text record. Using the current approaches for balancing the dataset, the text classification machine learning model remains deficient in terms of accuracy and F1 score. The aforementioned static word embeddings model and the contextual language model do not consider class-specific word structure. Furthermore, contextual language models generate different embeddings for the same word in different contexts whereas static word embeddings models do not consider the context of the word. In the aforementioned word level methods, synonyms may not consider word context and class-specific word structure and some words may not have synonyms.
As used herein, an F1 score is defined as the harmonic mean of a supervised machine learning model's precision and recall and is a measure of the model's accuracy on a dataset.
Embodiments of the present invention address the aforementioned unique challenges of traditional text classification techniques by providing an approach of text augmentation of a small class to balance a dataset to improve the performance of a text classification machine learning model in terms of accuracy, F1 score, and generalizability. In one embodiment, text augmentation of a small class balances the dataset from a class point of view rather than at a record level. In one embodiment, text augmentation of a small class in text classification problems uses a combination of word importance statistics, natural language processing (i.e., lexical natural language features and syntactic natural language features), and natural language generation (i.e., word context). In one embodiment, the text augmentation approach includes selecting words to be replaced in the text record by using lexical features, syntactic features, and word importance statistics. In one embodiment, the text augmentation approach includes generating replacement word(s) by using lexical features, contextual relevance, and word importance statistics.
System for Augmenting Text of a Small Class
Small class text augmentation system 104 receives an imbalanced dataset 116, which is a supervised text classification dataset. As used herein, an imbalanced dataset is a dataset that includes (1) a relatively small number of classes (i.e., large class) that include a substantially large number of text records and (2) a relatively small number of classes (i.e., small class) that include a substantially small number of text records. As an example, imbalanced dataset 116 may include a single large class that includes 65% of the total number of text records in imbalanced dataset 116 and a single small class that includes 1% of the total number of text records. As used herein, “relatively small number of classes” means a number of classes that is substantially less than the total number of classes that categorize text records in the dataset.
Main module 106 receives the imbalanced dataset 116 and reads the text records and their classes in the imbalanced dataset 116. Main module 106 identifies a small class among the classes in the imbalanced dataset 116 and sends the small class to text augmentation module 108, which augments the text records in the small class and sends back to main module 106 (i) the old text records (i.e., initial text records) that were initially in the small class in the imbalanced dataset 116 and (ii) new augmented text records. Main module 106 creates a balanced dataset 118 that includes the old text records and the new augmented text records. Although not shown in
Text augmentation module 108 sends text records of the small class to word priority module 110, which selects the words in the text records that can be replaced and determines in what order the selected words can be replaced. Word priority module 110 selects the words to be replaced by considering lexical features, syntactic features, and word importance statistics (i.e., word frequency statistics). Word priority module 110 creates a word priority list that includes the selected words in a descending order based on respective word scores calculated by class word score module 114. The word scores indicate a measure of importance of the corresponding words from a class point of view.
Text augmentation module 108 sends each word in the aforementioned word priority list and the corresponding text record to suitable word generation module 112, which generates a suitable word list for a given word. The suitable word list includes word(s) that text augmentation module 108 uses to replace the given word, thereby creating a new augmented text record. Suitable word generation module 112 selects the word(s) in the suitable word list based on suitable word scores that are based on cosine similarity scores with class-based word importance statistics, contextual probability scores with class-based word importance statistics, and synonym-based class word scores. Class word score module 114 calculates class word scores that are used in conjunction with the aforementioned cosine similarity scores, contextual probability scores, and synonym-based class word scores to generate suitable word scores.
In one or more embodiments, system 100 includes the components of system 200 in
The functionality of the components shown in
Word priority module 110 identifies words in the text record that can be replaced and in what order the identified words can be replaced. Word priority module 110 receives a text record from text augmentation module 108 (see
For a given word in the received text record, word priority module 110 weights the class word score, the POS score, the stop word score, and the dependency score with different respective weights, and then calculates a word priority score by adding the aforementioned weighted scores. Word priority module 110 determines whether the word priority score exceeds a defined threshold score. If the word priority score exceeds the threshold score, word priority module 110 adds the word corresponding to the word priority score to a word priority list, which is a list of words that can be replaced to create an augmented text record. After similarly processing the other words in the received text record, word priority module 110 sends the resulting word priority list to text augmentation module 108 (see
POS module 202 uses a POS tagger model (not shown) to generate respective POS scores of the words in the text record. The POS tagger model determines respective lexical categories of the words in the text record.
Stop word module 204 uses a stop word list (not shown) to generate respective stop word scores of the words in the text record. The stop word list includes stop words, which are words that are commonly used words in a natural language (e.g., “a,” “an,” “the,” etc.) and which add little or no value to text classification.
Dependency module 206 uses a dependency parser model (not shown) to generate respective dependency scores for the words in the text record. The dependency parser model determines a syntactic dependency relationship between the words of the text record by analyzing the grammatical structure of the sentences that include the words.
The functionality of the components shown in
Suitable word generation module 112 generates a suitable word list that includes word(s) that are suitable replacements for a given word by considering cosine similar word embeddings with class-based word importance statistics, synonyms with class-based word importance statistics, and contextual words with class-based word importance statistics. Suitable word generation module 112 receives from text augmentation module 108 the word priority list and the corresponding text record. Suitable word generation module 112 sends the word priority list to synonym based sub-module 302, which returns suitable word scores corresponding to the words in the word priority list. Suitable word generation module 112 sends the word priority list to static embeddings similarity module 310, and in response receives similar words (i.e., words similar to the words in the word priority list) and cosine similarity scores of the similar words. Suitable word generation module 112 sends the word priority list and the corresponding text record to contextual language module 312, and in response receives likely words (i.e., words that are contextually relevant to words in the word priority list) and probability scores of the likely words.
Static embeddings module 308 generates top k similar tokens for a given token in the vocabulary of a transfer learning based static embeddings model. Static embeddings module 308 sends a static embeddings similarity list to static embeddings similarity module 310. The static embeddings similarity list includes the top k similar tokens and their cosine similarity scores.
Static embeddings similarity module generates top k similar words for a given word in a word priority list using the static embeddings model. Static embeddings similarity module 310 sends the aforementioned similar words and cosine similarity scores of the similar words to static embeddings similarity based sub-module 304. Class word score module 114 sends class word scores to static embeddings similarity based sub-module 304. Using the cosine similarity scores and the class word scores, static embeddings similarity based sub-module 304 generates suitable word scores.
Contextual language module 312 generates top j likely words for a given word by using a contextual language model. Contextual language module 312 sends likely words and contextual probability scores of the likely words to contextual language based sub-module 306. Class word score module 114 sends class word scores to contextual language based sub-module 306. Using the contextual probability scores and the class word scores, contextual language based sub-module 306 generates suitable word scores.
Class word score module 114 sends class word scores to synonym based sub-module 302. Using the class word scores, synonym based sub-module 302 generates suitable word scores.
The functionality of the components shown in
Process for Augmenting Text of a Small Class
In step 404, small class text augmentation system 104 (see
In step 406, main module 106 (see
In step 408, text augmentation module 108 (see
Text augmentation module 108 (see
Step 408 also includes text augmentation module 108 (see
Additional details of step 408 are described below in the discussions of
In step 410, small class text augmentation system 104 (see
In step 412, small class text augmentation system 104 (see
In step 416, using the supervised machine learning model, which is employing the augmented initial text records, small class text augmentation system 104 (see
The process of
In step 504, text augmentation module 108 (see
In step 506, text augmentation module 108 (see
In step 508, text augmentation module 108 (see
In step 510, text augmentation module 108 (see
In step 512 text augmentation module 108 (see
In step 514 and for given word(s) that are in an initial text record and are identified as being in a word priority list in step 512, text augmentation module 108 (see
In step 516, text augmentation module 108 (see
In step 518, text augmentation module 108 (see
In one embodiment, step 408 (see
The process of
(1) determines which words in the text record are to be replaced;
(2) selects the right word and the right position in a sentence in a given text record by considering a weighted average of lexical features (i.e., parts of speech), syntactic features (i.e., dependency tag), word frequency statistics (i.e., class word score), and a stop word score; and
(3) generates a word priority list for the given text record.
The process of
In step 604, word priority module 110 (see
In step 606, word priority module 110 (see
In step 608, word priority module 110 (see
In step 610, word priority module 110 (see
In response to receiving the text record sent by word priority module 110 (see
In response to receiving the text record sent by word priority module 110 (see
In response to receiving the text record sent by word priority module 110 (see
In one embodiment, stop word module 204 (see
In step 612, for each word in the text record received in step 608, word priority module 110 (see
After step 612, the process of
Prior to step 614 and for each word in the text record received in step 608 (see
In step 616, word priority module 110 (see
In step 618 and for each word in the text record received in step 608 (see
In step 620, word priority module 110 (see
Returning to step 620, if word priority module 110 (see
In step 624, word priority module 110 (see
Returning to step 624, if word priority module 110 (see
In step 626, word priority module 110 (see
In step 628, word priority module 110 (see
The process of
In one embodiment, the process of
The process of
In step 704 and for each class in imbalanced dataset 116 (see
In step 706, class word score module 114 (see
In step 708, class word score module 114 (see
In step 710 and for each unique word in each class, class word score module 114 (see
In step 712 and for each class, class word score module 114 (see
After step 712, the process of
In step 714, class word score module 114 (see
In step 716 and for each unique word across all the classes, class word score module 114 (see
In step 718, class word score module 114 (see
In step 720, class word score module 114 (see
After step 720, class word score module 114 (see
In one embodiment, the process of
The process of
In step 804, suitable word generation module 112 (see
In step 806, suitable word generation module 112 (see
In step 808, suitable word generation module 112 (see
In step 810, suitable word generation module 112 (see
In step 812 and for each word in the word priority list received in step 810, suitable word generation module 112 (see
In one embodiment, using the static embeddings similarity module 310 (see
In one embodiment, static embeddings module 308 generates top k similar tokens for a given token in the vocabulary of a static embeddings model by performing the following actions:
After step 812, the process of
In step 814 and for each word in the word priority list received in step 810 (see
In one embodiment, class language module 312 (see
In step 816 and for each word in the word priority list received in step 810 (see
In step 818 and for each word in the word priority list received in step 810 (see
In step 820 and for each word in the word priority list received in step 810 (see
In step 822, suitable word generation module 112 (see
Following step 822, the process of
In one embodiment, the process of
In one embodiment, the process of
In one embodiment, the conversions of scores from one scale to another scale (i.e., from an old scale to a new scale), as described in step 616 (see
Scale Converter Formula:
Text augmentation module 108 (see
Text augmentation module 108 (see
In step 1008, class word score module 114 (see
In step 618 (see
In step 620 (see
In step 812 (see
In step 812 (see
In step 1208, suitable word generation module 112 (see
Computer System
Memory 1304 includes a known computer readable storage medium, which is described below. In one embodiment, cache memory elements of memory 1304 provide temporary storage of at least some program code (e.g., program code 1314) in order to reduce the number of times code must be retrieved from bulk storage while instructions of the program code are executed. Moreover, similar to CPU 1302, memory 1304 may reside at a single physical location, including one or more types of data storage, or be distributed across a plurality of physical systems or a plurality of computer readable storage media in various forms. Further, memory 1304 can include data distributed across, for example, a local area network (LAN) or a wide area network (WAN).
I/O interface 1306 includes any system for exchanging information to or from an external source. I/O devices 1310 include any known type of external device, including a display, keyboard, etc. Bus 1308 provides a communication link between each of the components in computer 102, and may include any type of transmission link, including electrical, optical, wireless, etc.
I/O interface 1306 also allows computer 102 to store information (e.g., data or program instructions such as program code 1314) on and retrieve the information from computer data storage unit 1312 or another computer data storage unit (not shown). Computer data storage unit 1312 includes one or more known computer readable storage media, where a computer readable storage medium is described below. In one embodiment, computer data storage unit 1312 is a non-volatile data storage device, such as, for example, a solid-state drive (SSD), a network-attached storage (NAS) array, a storage area network (SAN) array, a magnetic disk drive (i.e., hard disk drive), or an optical disc drive (e.g., a CD-ROM drive which receives a CD-ROM disk or a DVD drive which receives a DVD disc).
Memory 1304 and/or storage unit 1312 may store computer program code 1314 that includes instructions that are executed by CPU 1302 via memory 1304 to augment text of a small class. Although
Further, memory 1304 may include an operating system (not shown) and may include other systems not shown in
As will be appreciated by one skilled in the art, in a first embodiment, the present invention may be a method; in a second embodiment, the present invention may be a system; and in a third embodiment, the present invention may be a computer program product.
Any of the components of an embodiment of the present invention can be deployed, managed, serviced, etc. by a service provider that offers to deploy or integrate computing infrastructure with respect to augmenting text of a small class. Thus, an embodiment of the present invention discloses a process for supporting computer infrastructure, where the process includes providing at least one support service for at least one of integrating, hosting, maintaining and deploying computer-readable code (e.g., program code 1314) in a computer system (e.g., computer 102) including one or more processors (e.g., CPU 1302), wherein the processor(s) carry out instructions contained in the code causing the computer system to augment text of a small class. Another embodiment discloses a process for supporting computer infrastructure, where the process includes integrating computer-readable program code into a computer system including a processor. The step of integrating includes storing the program code in a computer-readable storage device of the computer system through use of the processor. The program code, upon being executed by the processor, implements a method of augmenting text of a small class.
While it is understood that program code 1314 for augmenting text of a small class may be deployed by manually loading directly in client, server and proxy computers (not shown) via loading a computer-readable storage medium (e.g., computer data storage unit 1312), program code 1314 may also be automatically or semi-automatically deployed into computer 102 by sending program code 1314 to a central server or a group of central servers. Program code 1314 is then downloaded into client computers (e.g., computer 102) that will execute program code 1314. Alternatively, program code 1314 is sent directly to the client computer via e-mail. Program code 1314 is then either detached to a directory on the client computer or loaded into a directory on the client computer by a button on the e-mail that executes a program that detaches program code 1314 into a directory. Another alternative is to send program code 1314 directly to a directory on the client computer hard drive. In a case in which there are proxy servers, the process selects the proxy server code, determines on which computers to place the proxy servers' code, transmits the proxy server code, and then installs the proxy server code on the proxy computer. Program code 1314 is transmitted to the proxy server and then it is stored on the proxy server.
Another embodiment of the invention provides a method that performs the process steps on a subscription, advertising and/or fee basis. That is, a service provider can offer to create, maintain, support, etc. a process of augmenting text of a small class. In this case, the service provider can create, maintain, support, etc. a computer infrastructure that performs the process steps for one or more customers. In return, the service provider can receive payment from the customer(s) under a subscription and/or fee agreement, and/or the service provider can receive payment from the sale of advertising content to one or more third parties.
The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) (i.e., memory 1304 and computer data storage unit 1312) having computer readable program instructions 1314 thereon for causing a processor (e.g., CPU 1302) to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions (e.g., program code 1314) for use by an instruction execution device (e.g., computer 102). The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions (e.g., program code 1314) described herein can be downloaded to respective computing/processing devices (e.g., computer 102) from a computer readable storage medium or to an external computer or external storage device (e.g., computer data storage unit 1312) via a network (not shown), for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card (not shown) or network interface (not shown) in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions (e.g., program code 1314) for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations (e.g.,
These computer readable program instructions may be provided to a processor (e.g., CPU 1302) of a general purpose computer, special purpose computer, or other programmable data processing apparatus (e.g., computer 102) to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium (e.g., computer data storage unit 1312) that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions (e.g., program code 1314) may also be loaded onto a computer (e.g. computer 102), other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be accomplished as one step, executed concurrently, substantially concurrently, in a partially or wholly temporally overlapping manner, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
While embodiments of the present invention have been described herein for purposes of illustration, many modifications and changes will become apparent to those skilled in the art. Accordingly, the appended claims are intended to encompass all such modifications and changes as fall within the true spirit and scope of this invention.
Number | Name | Date | Kind |
---|---|---|---|
6279017 | Walker | Aug 2001 | B1 |
7036075 | Walker | Apr 2006 | B2 |
7765471 | Walker | Jul 2010 | B2 |
7861163 | Walker | Dec 2010 | B2 |
8429098 | Pawar | Apr 2013 | B1 |
8504562 | Ikeda | Aug 2013 | B1 |
20020091713 | Walker | Jul 2002 | A1 |
20060129922 | Walker | Jun 2006 | A1 |
20080222518 | Walker | Sep 2008 | A1 |
20100306144 | Scholz | Dec 2010 | A1 |
20130041885 | Bennett | Feb 2013 | A1 |
20130145241 | Salama | Jun 2013 | A1 |
20130311181 | Bachtiger | Nov 2013 | A1 |
20170075935 | Lagos | Mar 2017 | A1 |
20170270546 | Kulkarni | Sep 2017 | A1 |
20180373691 | Alba | Dec 2018 | A1 |
20190050624 | Chai | Feb 2019 | A1 |
20190121842 | Catalano | Apr 2019 | A1 |
Number | Date | Country |
---|---|---|
108897769 | Nov 2018 | CN |
302985 | Feb 2012 | CZ |
Entry |
---|
Munkhdalai et al., “Self-training in significance space of support vectors for imbalanced biomedical event data”, BMC Bioinformatics, Apr. 23, 2015, pp. 1-8 (Year: 2015). |
Hakim et al., “Oversampling Imbalance Data: Case Study on Functional and Non Functional Requirement,” 2018 Electrical Power, Electronics, Communications, Controls and Informatics Seminar (EECCIS), 2018, pp. 315-319 (Year: 2018). |
Abdollahi et al., “A Dictionary-based Oversampling Approach to Clinical Document Classification on Small and Imbalanced Dataset,” 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), 2020, pp. 357-364 (Year: 2020). |
Nikhila et al., “Text Imbalance Handling and Classification for Cross-platform Cyber-crime Detection using Deep Learning,” 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 2020, pp. 1-7 (Year: 2020). |
Kobayashi, Sosuke; Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations; Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Short Papers); Jun. 2018; pp. 452-457. |
Kothiya, Yogesh; How I handled imbalanced text data; https://towardsdatascience.com/how-i-handled-imbalanced-text-data-ba9b757ab1d8; May 15, 2019; 6 pages. |
Koto, Fajri; SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An Enhancement Strategy to Handle Imbalance in Data Level; Conference: The 6th International Conference on Advanced Computer Science and Information Systems (ICACSIS); Oct. 2014; pp. 193-197. |
Liu, Ruibo et al.; Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation; arXiv:2012.02952v1; Dec. 5, 2020; 11 pages. |
Mnasri, Maali; Text augmentation for Machine Learning tasks: How to grow your text dataset for classification? https://medium.com/opla/text-augmentation-for-machine-learning-tasks-how-to-grow-your-text-dataset-for-classification-38a9a207f88d; Jan. 18, 2019; 9 pages. |
Paduraiu, Cristian et al.; Dealing with Data Imbalance in Text Classification; Procedia Computer Science 159; 2019; pp. 736-745. |
Wei, Jason et al; EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks; Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing; Nov. 2019; pp. 6382-6388. |
Number | Date | Country | |
---|---|---|---|
20220366293 A1 | Nov 2022 | US |