This invention is related to the field of natural language processing (NLP) and information extraction (IE).
Information Extraction (IE) aims at structuring and organizing information resources as acquired knowledge from unstructured text, thereby enabling efficient and effective utilization of the information in downstream applications (e.g., question answering). The output of IE often results in (subject, relation, object) triples, and they form an atomic unit of knowledge in a knowledge graph. Partly because of this knowledge base formulation, IE is often considered comprising two tasks: entity detection and relation extraction. Entity detection aims at recognizing entity mentions in text and deciding their types. For example, given a sentence “Joe Biden is the president of the United States.”, entity detection is expected to detect two entity mentions “Joe Biden” with type ‘Person’ and “United States” with type ‘Location’ (or ‘Country’). Relation extraction aims at detecting relations between two entity mentions. This invention focuses on relation extraction, with entity mentions already given in text.
Relation extraction is widely studied in the field of Natural Language Processing (NLP). As with other NLP tasks, state-of-the-art for relation extraction employs deep learning models, e.g., Long Short-Term Memory (LSTM) networks and Transformers, mainly because they were shown to achieve high performance in some benchmark datasets, as compared to prior models, such as rule-based or traditional feature-rich machine learning models. The benchmark datasets (e.g., a set of sentences or documents annotated with predefined labels of relations) are created in some domains. Because these existing domains and labels do not necessarily match domains or labels of interest, deep learning models trained on the datasets are not directly applicable to a particular domain-specific task of our interest. Therefore, it is necessary to have our own dataset with sentences and labels of interest for model training.
The main disadvantage of such supervised deep learning models is that they rely on a large amount of manually labeled data for supervision. A small amount of training data is not sufficient, because it is likely for the models to overfit the small data and not generalize well. However, creating a large amount of human-curated training data is often difficult because human-annotation of sentences by domain experts is expensive in practice. Therefore, generalizing deep learning models without relying on a large amount of manually annotated data is an important problem for NLP tasks including relation extraction.
Data augmentation is often used to increase the amount of samples in a training data set. The basic idea is that given a small set of training examples, data augmentation generates new synthesized examples from the original training data. Data augmentation is quite common and widely used for images in studies on computer vision, because some simple operations (flipping, rotating, cropping, etc.) have been shown to be effective. However, these kinds of intuitive operations are not readily available for text due to its discrete nature. Unlike a pixel in an image, changing a single word can significantly change the meaning of a phrase or a sentence.
According to one embodiment, a computer-implemented method for training a relation extraction model using data augmentation of training data includes receiving an original labeled sentence as input, the labeled sentence including entities and at least one relation. A dependency parsing process is used on the labeled sentence to generate first augmented training data. A constituency parsing process is used on the labeled sentence to generate second augmented training data. A scoring function is used to order a training set based on difficulty. The training set includes the original labeled sentence, the first augmented training data, and the second augmented training data. A curriculum learning process is then used to train the relation extraction model by feeding the scored training set to the machine learnable model. The trained relation extraction model is then stored in a memory.
According to another embodiment, a computer-implemented method for training a relation extraction model using data augmentation of training data includes receiving an original labeled sentence as input, the labeled sentence including entities. A lexically constrained paraphrasing process is then used on the labeled sentence to generate first augmented training data. A scoring function is used to order a training set based on difficulty. The training set includes the original labeled sentence and the first augmented training data. A curriculum learning process is then used to train the relation extraction model by feeding the scored training set to the machine learnable model. The trained relation extraction model is then stored in a memory.
According to yet another embodiment, a computer-implemented method for training a machine learnable model using data augmentation of training data includes receiving an original labeled sentence as input, the labeled sentence including entities. At least one of a dependency parsing process, a constituency parsing process, and a lexically constrained paraphrasing process is then used on the labeled sentence to generate augmented training data. A first scoring function is selected from a plurality of scoring functions to order a training set based on difficulty. The relation extraction model is then trained using a curriculum learning process by feeding the scored training set to the relation extraction model in an order determined by the selected scoring function to generate an intermediate model. A respective performance metric is determined for each of the scoring functions in the plurality by evaluating a performance of the intermediate model using a validation data set ordered respectively by the plurality of scoring functions. Another scoring function is then selected from the plurality of scoring functions to order the training data based on difficulty. The another scoring function is selected based on the determined performance metric of the second scoring function. The relation extraction model is then trained again using the scored training set data from the another scoring function.
For the purposes of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiments illustrated in the drawings and described in the following written specification. It is understood that no limitation to the scope of the disclosure is thereby intended. It is further understood that the present disclosure includes any alterations and modifications to the illustrated embodiments and includes further applications of the principles of the disclosure as would normally occur to a person of ordinary skill in the art to which this disclosure pertains.
As mentioned in the previous section, data augmentation for textual tasks is not straightforward. The augmentation for relation extraction data is especially challenging and still understudied. One particular challenge is that the augmentation methods need to preserve the entity mentions in the original sentence as well as the textual statement of the relationship between them. This disclosure is directed to data augmentation strategies for the task of extracting sentence-level relations between provided entity mentions. Note that the relationship manifested in a sentence is specific to the (typically two) entity mentions involved. The data augmentation methods described herein are intended to satisfy two constraints: (1) preserving the entity mentions, and (2) preserving the relationship between them. In this disclosure, two simple yet effective strategies for data augmentation are proposed that satisfy the two aforementioned constraints. The first methodology makes use of a simple intuition that often, for a large sentence, the relationship exhibited between two entities can be captured by a smaller span of the original sentence. The second strategy is based on paraphrasing the original text in which these constraints are fulfilled by additionally making use of lexically-constrained decoding on neural paraphrase systems (explained in more detail below).
An embodiment of an augmented data generator 200 for generating augmented data samples is depicted in
Dependency parsing makes use of graph-based dependency parsing of sentences. Given a sentence as input, this method first constructs a weighted, fully-connected graph between all words in the sentence and then constructs a tree by extracting a maximum spanning tree from this graph. For the purposes of this disclosure, it is assumed that the most important information concerning the relation between the two entity mentions is contained on the Shortest Dependency Path (SDP) connecting the two entities in this tree. Therefore, all the words located on the SDP connecting two entities from the dependency tree of the sentence are selected.
As an illustration, please consider the example from
Referring to
For the purposes of the disclosure, it is assumed that, in a constituency parse tree, the most relevant information for the concerned relation is contained in the sub-tree rooted at the lowest common ancestor (LCA) of the two entity mentions in the tree. In the example, for the two entities - ‘Robert Bosch GmbH’, and ‘Stuttgart’ - this process essentially selects the sub-tree containing the words ‘Robert Bosch GmbH, founded in Stuttgart in 1886,’ (highlighted with bold lines in
Referring again to
To enable effective use of paraphrasing as a data augmentation strategy for relation extraction, lexically constrained decoding with the prevailing neural paraphrasing models is used. This approach enables retention of the original entity mentions while paraphrasing other parts of the same sentence. Common paraphrase systems use a neural sequence-to-sequence framework for generating the additional paraphrased sentences. In these frameworks, the original sentence is fed as a sequence to the input and the output sequence is decoded one-word-at-a-time using a procedure called Beam Search, which maintains a best-k list of most likely sequence outputs. Lexically constrained modifies this procedure so that the best-k list contains only those sequences that satisfy the required lexical constraints. In the framework described herein, original entity mentions are used as the lexical constraints.
In one embodiment, the method used for paraphrasing the original sentence is back-translation, although the proposed framework is general and can be applied to any paraphraser that makes use of Beam Search for decoding. A back-translation model uses two textual translation models, called a forward translation model, and a backward translation model, for obtaining the paraphrase of a sentence. In essence, such a model first translates the sentence in an original language (e.g, English) to one of the foreign languages (e.g., German) using the forward model and then translates back to the original language using the backward model. The examples illustrating lexically constrained paraphrasing are shown as A5 and A6 in
As depicted in
Once the augmented training data has been generated, the relation extraction model may be trained using the augmented training data. In accordance with the present disclosure, a curriculum learning process is used to train the model. Similar to established curriculums in human teaching, curriculum learning aims to provide a structure to the training set that can aid the model during the training process. For instance, feeding easier examples under a curriculum at the start (or the end) might improve the generalization performance of the model. One important challenge in developing an effective curriculum for any training model is the selection of a scoring function. The scoring function determines the order in which the examples from the training set are fed to the learning model - the examples with higher scores are introduced later in training.
However, it may be difficult to determine a fixed order of examples in the case of data-augmented relation extraction. On one hand, the proposed parsing-based augmentation provides simpler examples for the model to learn, but on the other hand, they might also introduce some noise into the training examples. Thus, the easiness of examples can change, depending on a particular dataset and a particular model during the training stage. Therefore, it is proposed herein that the scoring function be determined adaptively to the dataset and the model. In particular, a novel adaptive curriculum learning framework is proposed that, instead of a single scoring function, adaptively chooses a scoring function during different stages in the training process. Such a framework is additionally useful when there is no single scoring function that suits multiple different datasets of interest.
A schematic illustration of a training system 500 for training the relation extraction model is depicted in
Referring to
Referring to
The selected scoring function is then communicated to the model trainer which uses the selected scoring function to order the training data set and then train the intermediate model again using the new ordered training data set. This process is repeated until convergence of the model is determined. As is known in the art, a machine learning model reaches convergence when it achieves a state during training in which loss settles to within an error range around the final value. In other words, a model converges when additional training will not improve the model. The output evaluator 512 is configured to evaluate the task performance to determine whether convergence of the model has occurred. Referring to
The augmented data generator 200 and adaptive curriculum learning system 500 may be implemented by a computer system which comprises one or more processors and associated memories that cooperate together to implement the operations discussed herein. These components can interconnect with each other in any of a variety of manners (e.g., via a bus, via a network, etc.). For example, the computer system can take the form of a distributed computing architecture where one or more processors implement the various tasks described above. The one or more processors may comprise general-purpose processors (e.g., a single-core or multi-core microprocessor), special-purpose processors (e.g., an application-specific integrated circuit or digital-signal processor), programmable-logic devices (e.g., a field programmable gate array), etc. or any combination thereof that are suitable for carrying out the operations described herein. The associated memories may comprise one or more non-transitory computer-readable storage mediums, such as volatile storage mediums (e.g., random access memory, registers, and/or cache) and/or non-volatile storage mediums (e.g., read-only memory, a hard-disk drive, a solid-state drive, flash memory, and/or an optical-storage device). The memory may also be integrated in whole or in part with other components of the system. Further, the memory may be local to the processor(s), although it should be understood that the memory (or portions of the memory) could be remote from the processor(s), in which case the processor(s) may access such remote memory through a network interface. The memory may store software programs or instructions that are executed by the processor(s) during operation of the system . Such software programs can take the form of a plurality of instructions configured for execution by processor(s). The memory may also store project or session data generated and used by the system.
The system may include an input/output (I/O) interface that may be configured to provide digital and/or analog inputs and outputs. The I/O interface may include additional serial interfaces for communicating with external devices (e.g., Universal Serial Bus (USB) interface). The system 102 may also include a human-machine interface (HMI) device that may include any device that enables the system to receive control input. Examples of input devices may include human interface inputs such as keyboards, mice, touchscreens, voice input devices, and other similar devices. The system may include a display device. The system may include hardware and software for outputting graphics and text information to the display device. The display device may include an electronic display screen, projector, printer or other suitable device for displaying information to a user or operator. The system may be further configured to allow interaction with remote HMI and remote display devices via the network interface device.
While the disclosure has been illustrated and described in detail in the drawings and foregoing description, the same should be considered as illustrative and not restrictive in character. It is understood that only the preferred embodiments have been presented and that all changes, modifications and further applications that come within the spirit of the disclosure are desired to be protected.