The present invention relates to automatic extraction of complex relations from free natural language text.
Trainable Machine Learning-based sequence classifiers are proficient at performing tasks such as part-of-speech (PoS) tagging (Avinesh, P. and Karthik, G. 2007. Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning. Proceedings of SPSAL 2007), named entity recognition (NER) (McCallum, A. and Li, W. 2003. Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. Proceedings of CoNLL-2003, Edmonton, Canada: 188-191; Zhang, T. and Johnson, D. 2003. A Robust Risk Minimization based Named Entity Recognition System. Proceedings of CoNLL-2003, Edmonton, Canada: 204-207), and shallow parsing (Zhang, T., Damerau, F. et al. 2001. Text Chunking using Regularized Winnow. Meeting of the Association for Computational Linguistics: 539-546; Sha, F. and Pereira, F. 2003. Shallow Parsing With Conditional Random Fields. Technical Report CIS TR MS-CIS-02-35, University of Pennsylvania). However, they are less proficient at the task of relation extraction as shown by their relatively poor performance in Automatic Content Extraction (ACE) relation extraction shared tasks.
There are several reasons for the poor performance of Trainable Machine Learning-based sequence classifiers in relation extraction tasks.
Firstly, relation extraction is structurally more complex than PoS tagging, NER and shallow parsing. PoS tagging, NER, and, to a lesser extent, shallow parsing can be easily and precisely formulated in terms of sequential classification whereas relations, especially non-binary ones, often require multi-level structures, at least in the intermediate form during parsing.
Secondly, the volume of useful training data available for relation extraction in a corpus of a given size is significantly lower than that available for PoS tagging, NER and shallow parsing. In a reasonably-sized training corpus tens of thousands instances of each token class (e.g., “Noun” part-of-speech or “Person” entity) may be found but at best only several hundred instances of relations. Achieving a comparable level of accuracy for relation extraction as for entity extraction (95% F1-measure or higher) using the same algorithms requires a much larger training corpus.
Currently, the task of relation extraction is usually formulated as extraction of binary relations between entities, which entities are assumed to be known, for example by means of having been extracted by a separate NER component. Such formulation allows the task to be modeled as a classification-of-pairs-of-entities problem or as a sequence classification problem. However, this approach has limited applicability since it cannot easily be generalized to relations with multiple and variable number of slots. Furthermore, attempts to combine several different binary relations into a single n-ary relation fail because the interdependencies between the relations are missed. In addition, the sentence structure complexity is missed unless a full parsing of the sentences is first performed. However, full parsing is relatively inaccurate due to ambiguities that cannot be resolved without reference to semantic processing, which semantic processing is only performed following parsing and therefore cannot inform the parsing. Also, full parsing is costly and, since only a small number of sentences contain instances of the target relation, performing it on every sentence is wasteful.
Rule-based entity and relation extraction systems based on context-free grammars are long known (Feldman, R. 2002. Text Mining. Handbook of Data Mining and Knowledge Discovery, Kloesgen, W. and Zytkow, J. Cambridge, Mass., MIT Press). Such systems are notoriously difficult to build and maintain due to a large number of rules and exceptions and the necessity of resolving every ambiguity by using rule ordering or complex constraints.
Systems that learn rules automatically from training data have also been tried, but with limited success (Freitag, D. 1997. Using grammatical inference to improve precision in information extraction. ICML'97. Nashville, Tenn.; Soderland, S. 1999. Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34(1-3): 233-272; Aitken, J. S. 2002. Learning Information Extraction Rules: An Inductive Logic Programming approach. 15th European Conference on Artificial Intelligence. Amsterdam; Kushmerick, N. 2002. Finite-state approaches to Web information extraction. 3rd Summer Convention on Information Extraction. Rome). Learning complex structures is very difficult for such systems and so the rules usually have a relatively simple “flat” form. Although the accuracy of automatic rules is relatively low for classic information extraction tasks involving finding and labeling all mentions of entities and relations in a given text, automatic rules may be successfully applied in redundancy-based settings, such as extracting many instances of a given relation from the Internet (Cafarella, M. J., Downey, D. et al. 2005. KnowItNow: Fast, Scalable Information Extraction from the Web. EMNLP 2005; Rosenfeld, B. and Feldman, R. 2007. Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web. ACL-2007).
More recently, research has focused on relation extraction systems based on classification methods using arbitrary context features rather than rules. Such systems are typically only capable of extracting binary relations. (Zelenko, D., Aone, C. et al. 2003. Kernel methods for relation extraction. The Journal of Machine Learning Research 3: 1083-1106; Chen, J., Ji, D. et al. 2005. Unsupervised Feature Selection for Relation Extraction IJCNLP-05, Jeju Island, Korea)
Relation extraction based on full parsing of sentences and subsequent analysis of the parse trees are still state-of-the-art for more complex relations (Miller, S., Crystal, M. et al. 1998. Description of the SIFT system as used for MUC-7. Proceedings of the Seventh Message Understanding Conference (MUC-7)).
Systems based on partial and relation-specific parsing of sentences are less well known. Trainable Extraction Grammar (TEG) (Feldman, R., Rosenfeld, B. et al. 2006. TEG—a hybrid approach to information extraction. Knowledge and Information Systems 9(1): 1-18) is one such system and is the direct predecessor of CARE, which can be seen as a discriminative version of a generative TEG.
The present invention seeks to provide a system for extracting complex relations from free natural language text.
There is thus provided in accordance with a preferred embodiment of the present invention a system for extracting information from text, the system including parsing functionality operative to parse a text using a grammar, the parsing functionality including named entity recognition functionality operative to recognize named entities and recognition probabilities associated therewith and relationship extraction functionality operative to utilize the named entities and the probabilities to determine relationships between the named entities, and storage functionality operative to store outputs of the parsing functionality in a database.
There is also provided in accordance with another preferred embodiment of the present invention another system for extracting information from text, the system including parsing functionality operative to parse a text using a grammar, the grammar including a plurality of rules, at least some of the plurality of the rules having different weights assigned thereto, the parsing functionality employing the weights to select preferred results of the parsing, and storage functionality operative to store outputs of the parsing functionality in a database.
In accordance with a preferred embodiment of the present invention the parsing functionality includes named entity recognition functionality operative to recognize named entities and recognition probabilities associated therewith, and relationship extraction functionality operative to utilize the named entities and the probabilities to determine relationships between the named entities.
Preferably, both the weights and the probabilities are employed to select preferred results of the parsing. Additionally, the weights are trained using a labeled corpus. Alternatively, the weights are specified by a knowledge engineer. Preferably, the rules utilize results of sequence classifiers.
There is further provided in accordance with yet another preferred embodiment of the present invention a method for extracting information from text, the method including parsing a text using a grammar, the parsing including recognizing named entities and recognition probabilities associated therewith and utilizing the named entities and the probabilities to determine relationships between the named entities, and storing outputs of the parsing in a database.
There is yet further provided in accordance with still another preferred embodiment of the present invention a method for extracting information from text, the method including parsing a text using a grammar, the grammar including a plurality of rules, at least some of the plurality of the rules having different weights assigned thereto, the parsing employing the weights to select preferred results of the parsing, and storing outputs of the parsing in a database.
In accordance with a preferred embodiment of the present invention the parsing includes recognizing named entities and recognition probabilities associated therewith and utilizing the named entities and the probabilities to determine relationships between the named entities.
Preferably, both the weights and the probabilities are employed to select preferred results of the parsing. Additionally, the weights are trained using a labeled corpus. Alternatively, the weights are specified by a knowledge engineer. Preferably, the rules utilize results of sequence classifiers.
The present invention will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:
The present invention relates to a hybrid Machine Learning/Knowledge Engineering-based system, referred to as CARE (CRF Assisted Relation Extraction), for extracting complex relations from free natural language text. CARE can be thought of as an engine or a core of a Natural Language Processing (NLP) system, which receives an input comprising a block or corpus of natural language text such as English, and returns an output comprising the block after being parsed and processed. The parsing and processing may include identifying simple categorical entities such as personal names or addresses, referred to as NER, as well as more complex tasks such as extracting relations between predefined entities, such as identifying a relation between a person, and the position of the person in a given company, also known as the PPC relation.
The present invention provides a system for extracting information from text including parsing functionality operative to parse a text using a grammar which includes named entity recognition functionality operative to recognize named entities and recognition probabilities associated therewith and relationship extraction functionality operative to utilize said named entities and said probabilities to determine relationships between said named entities, and storage functionality operative to store outputs of said parsing functionality in a database.
The system is based on weighted deterministic context free grammars, which work together with feature rich sequence classifiers. An extraction grammar of CARE is a set of manually written rules, which specify both the structure of sentences and the slot assignment rules in a simple unified syntax. The rules may include real-valued weights, which can either be trained using a labeled corpus or specified intuitively by a knowledge engineer, as will be shown hereinbelow. CRF is the mathematical structure underlying the algorithms used for stochastic parsing of the grammar.
The rules have access to the results of sequence classifiers such as an NER system or a PoS tagger. The interface between sequence classifiers and the grammar is a unique feature of the present invention. The interface is flexible in the sense that the grammar is able not only to access the classification results but also to modify the classification results according to the specific needs of the interface. The flexible interface is one of the key features of the present invention, as will be demonstrated in the experimental section hereinbelow.
The present invention is innovative in its use of a tree-based CRF for a “partial” focused parsing model and in its flexible interface between the parsing component and a lower-level sequence classification component.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent to one skilled in the art, however, that the present invention may be practiced without these specific details.
Preferably, CARE is used as a parsing component of a fully automatic Web-to-DB relation extraction system. CARE parses short segments of input text, such as sentences or paragraphs, into entities, and labels the entities in a manner which enables a deterministic post-processor to produce a final set of relations after resolving co-references and combining knowledge extracted from different sentences of the same document.
The output of CARE is a labeled sentence or paragraph, wherein the entities, their types and which slots of which relations, if any, they occupy, are labeled. The labeling may be multi-level, whereby entities may comprise sub-entities having relations between themselves.
Reference is now made to
Once the labeling of the entities is provided the final “flat” relation frames may be extracted. Three “flat” relation frames are shown in the example of
The role of a post-processor in producing such final relation frames is not described herein in detail.
The information that CARE uses to parse the input sentence and to produce the labeling as shown in
The standalone token-level sequence classifier receives a sequence of words or tokens as input, and labels each token with a tag from a small predefined set of tags. The most common sequence classifiers in NLP are NER systems, PoS taggers, and shallow parsers. State-of-the-art sequence classifiers are based on discriminative models such as CRF or Maximal Margin-based, which allow the use of arbitrary context features. These models use similar dynamical programming-based inference algorithms, differing only in the way the weights of various state transitions are calculated as functions of the context features. Because of the similarity between the models, CARE is able to function identically using any of the models.
The interface between CARE and the sequence classification is flexible, which, as mentioned above, is one of the most important aspects of the CARE architecture. Instead of running NER and/or PoS, and/or shallow parser systems separately and using their results as input, CARE receives as inputs the weights of state transitions and uses these weights as special context feature functions. This allows CARE to selectively modify a given trained model, adapting it in specific places and contexts but otherwise retaining unchanged scores. This results in significant flexibility as well as significantly improved accuracy as will be shown herein.
The inference algorithm for a standalone sequence classification model comprises the following elements:
The Transition Weights Function is the most important and most complex element of those listed and is the only element that CARE uses. The transition weights function depends on the model type, on the set of context feature functions and on the training data that were used to train the model. Several possible model types, feature functions and datasets are described in (Rosenfeld, B., Fresko, M. et al. 2005. A Systematic Comparison of Feature-Rich Probabilistic Classifiers for NER Tasks. PKDD). In the experiments described herein a CRF model trained on a CoNLL-2003 dataset for the shared NER task was used (Tjong, E., Sang, K. et al. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Edmonton, Canada).
The method by which CARE uses the NER Transition Weights Function is now explained in more detail.
The CARE parser is a weighted discriminative context-free grammar (WDCFG). A context-free grammar (CFG) is a precise description of a language defined by formation rules, in which every rule is of the form
CFGs are used for analyzing the syntax of natural languages. The structure of CARE rulebooks, which will be described in detail hereinbelow, closely resembles the structure of a context free grammar, and the rules specified therein can be regarded as weighted context-free syntax generating rules. “Weighted” in this context means that each generating rule is assigned a weight, which is used by the CARE engine to find the highest-scoring parse and to determine which generating-rule should be applied.
The terminal symbols of the CARE grammar are token patterns. Such patterns may either match or not match a given token based on arbitrary properties of the token and its context, as well as on the classification of the token and the previous token according to the available sequence classification models.
The rules of the CARE grammar are conventional context-free production rules of the form
During the inference process or parsing, the task of the system is to find a parse (a tree of expansions of rules, the leaves of which are terminals) which matches the input sequence of tokens and has a maximal total weight among all such parses. The total weight is a sum of the weights of all rules participating in the grammar, with the addition of the weights of all participating sequence classification transitions.
A WDCFG can be interpreted as a CRF over the set of input sequences and their possible parses. In a preferred embodiment of the present invention, this interpretation is achieved by translating the grammar into a set of FSAs, in order to simplify the description of the algorithms. This may also be achieved by translating the grammar into an equivalent grammar of a Chomsky Normal Form, as in (Taskar, B., Klein, D., et al. 2004. Max-margin parsing. EMNLP ACL 2004).
In an FSA representation, each nonterminal symbol of the grammar gives rise to an FSA with one starting state and one finishing state and with weighted transitions labeled by terminal, nonterminal, or “empty” symbols. The FSA for a nonterminal is built in a manner wherein every path from the starting state to the ending state corresponds to a single rule of the nonterminal, and each such path contains exactly one transition with a weight equal to the weight of the rule. The weights of all other transitions are zero.
The CRF that corresponds to the grammar can be formally defined as follows: let X be the set of all possible input sequences and Y be the set of all pairs of the form (state, pos), where state is a state in one of the FSAs, and pos is a position within a sequence. In the graphical model of the CRF, the pairs are nodes and a pair (state1, pos1) is connected to (state2, pos2) if there is a transition from state1 to state2 and pos1≦pos2.
The CRF model defines a conditional probability
P(y|x)=Z(x)−1Πc(Σifi(x, c, yc))
where c ranges over the 2-cliques of the graphical model, which are, basically, transitions with specified places. Only 2-cliques are used in order to keep the inference in the model tractable. For larger cliques, the values off are assumed to be zero.
There are K+L feature functions fi, where K is the number of rules and L is the number of available sequence classification models. Given an input sentence x, the value of the i-th feature function for i≦K is:
fi(x, i, t, from, to))=1, if t is the non-zero-weighted transition that corresponds to the i-th rule;
and otherwise:
fi(x, i, t, from, to))=0.
For i>K:
fi(x, i, t, from, to)=TFi-K(x, from, t.Labeli-K, t.PrevLabeli-K), if t is a terminal transition;
and otherwise
fi(x, i, t, from, to)=0.
The t.Label and t.PrevLabel denote the classification labels allowed for the token by the transition's token pattern.
The interpretation of a WDCFG as a CRF allows a CRF training algorithm to be used, given a training corpus, to set the weights of rules automatically. Assuming there is a set of training sequences {x(k)}, for which the correct parses {y(k)} are known, the rule weights wi can be set to maximize the (Gauss-penalized) log-likelihood:
L(w)=Σk log Pw(y(k)|x(k))−Σiσi(wi−αi)2
where Pw(y|x) is the conditional probability of y given x, assigned by a CRF with the weight vector w. Usually, for i>K, αi=1 and σi=∞ are set. For i≦K, αi=0 and σi=1 are set.
In principle, this method allows optimal setting of the weights. However, a large volume of training data is required even for a moderately small number of rules. Fortunately, for rules written by humans, the weights have a simple meaning that can be easily grasped intuitively: the weight of a rule is the strength of a “force” that compels the parser to include the rule in the final parse if the rule matches. Provided that the orders of magnitude of the weights are correct, the precise values of the weights are insignificant. Thus, it is usually sufficient to use just six different weight values—“small”, “medium”, and “large” in magnitude, each of which may be positive or negative. These values can be specified manually in an intuitive manner, as will be shown in examples below.
A grammar, its connection to the sequence classifiers and its output-producing directives are all defined within a rulebook, which is a text file written in a special-purpose formal language. A rulebook contains the following parts: declarations of the sequence classifier models, declarations of the target relations and a definition of the WDCFG.
In a preferred embodiment of the present invention only one sequence classifier model, which is a CRF-based named entity recognition model trained on the CoNLL-2003 shared task data, is used. Thus, a sequence classifier declaration specifies the file containing the NER model and the set of NER labels that can be used as terminal symbols within the grammar specification. This set contains the labels PERSON, ORG, LOC and “None”, which includes everything else.
Target relation declarations specify the names of the target relations and their slots. Relations without slots are target entities. It is important to note that the output entities of CARE can be and often are different from the entity labels defined by the NER model.
Below are examples of declarations, suitable for describing the relations and entities shown in
The WDCFG syntax is similar to the syntax of regular production rules. The EBNF of the grammar syntax is:
Nonterm, Relation and Slot are identifiers. The syntactic elements |, ( . . . ), [ . . . ], { . . . } and + constitute “syntactic sugar” and not required in principle but help greatly in practice. The | (OR) is equivalent to several rules with the same nonterminal. The other elements are translated into calls of additional unnamed nonterminals.
It is not necessary to declare nonterminals, with the exception of the start nonterminal, the relation-producing nonterminals and nonterminals that are used in a rule before their first rule appears.
For all rules used in the preferred embodiments of the present invention, the following patterns are sufficient:
As described hereinabove, the nonterminals can be declared to generate a target relation. When a relation-generating nonterminal appears in the final parse of a sentence, the fragment of text that matches the nonterminal will be marked by the tags <RelName> . . . </RelName>. However, the assignment of slots is specified by Elements which are the building blocks of the rules as specified in the EBNF. If an element with a slot assignment appears in the final parse of a sentence, the matching fragment of text will be marked by the tags <_SlotName> . . . </_SlotName>. Note that it is not necessary for slot assignments to appear within a rule for the relation-generating nonterminal. They may appear at any location within the parse tree, provided that the location is a descendant of such a nonterminal.
Reference is now made to
5. The starting nonterminal Sentence (line 12) is defined as an arbitrary sequence of defined entities, relations, and “None”-s. The <−S> weight ensures that, all else being equal, the longest entities and relations would be chosen since they would produce fewest invocations of this rule.
As described hereinabove, the internal representation of a CARE grammar is a set of FSAs, with transitions being labeled by terminals and nonterminals, weights and output actions.
The allowed feature functions are such that given a segment within a sentence and a nonterminal matching the segment, the weights of the nonterminal do not depend on the parse of the rest of the sentence. Conversely, the weights of the rest of the parse do not depend on the chosen path through the nonterminal's FSA. This allows dynamical programming to be used to solve training and inference problems of the corresponding CRF. In the case of inference, if a best (“heaviest”) path is found for a given nonterminal and a given fragment within a sentence, it can be stored and reused. Similarly, in the case of training, if the total weight mass is calculated for a nonterminal and for a fragment, it can be stored and reused. This is the basis of employed training and inference algorithms.
In a first step, the “syntactic sugar” referred to above is converted to an internal “austere” representation:
After this step, each rule can be represented by a WCFG sequence
S→E1E2 . . . Ek
where each Ei is either a nonterminal or a terminal (token pattern, NER label, or wordclass). Each Ei may also be marked with an output tag, if so indicated by the source rulebook.
In the next step, an FSA is built for every nonterminal. Each rule adds a path from the beginning node of the FSA of its head nonterminal to the ending node. Every Ei becomes a transition, labeled with Ei and with “output tag actions” where appropriate.
For example, following the three steps detailed above, a rule
N1 and N4 are starting and ending nodes of the PPC's FSA and N5 and N8 are the starting and ending nodes of the Temp's nonterminal. Other nodes are internal. The transitions are:
In the next step, the set of FSAs is optimized, as described in the next section below.
Usually, parse-based CRFs are formulated in a manner which makes it possible for the feature function of a rule to depend on a caller. While a CRF with such dependency would still allow for a dynamical programming training and inference, its complexity would be significantly higher. Thus, such dependencies are preferably forbidden in preferred embodiments of the present invention. In the rare cases where such dependencies are necessary, they may be simulated by duplicating the corresponding FSAs.
The set of FSAs produced by the grammar according to the steps described hereinabove is highly suboptimal in that there are many unnecessary FSAs and extra transitions, leading to costs in terms of time and memory. The set of FSAs is optimized by applying the following four types of transformations:
The following are examples of the above optimizations:
The four empty transitions are unnecessary and can be removed, by gluing together the nodes N3, N5, N7, and N8, which results in:
The limiting of a generic tree parse CRF and the optimization and programming steps described above improve the time and memory complexity by at least two orders of magnitude. This permits the use of far more complex grammars which are required in order to achieve the desired level of accuracy.
Detailed guidance for the composition of a basic rulebook is provided in this section. A rulebook can be written as a plain text file with an “*.rec” extension. Every rulebook begins with a preamble which includes several general definitions. In the preamble the NER model and general entities are defined, the basic sentence processing rules are specified and an optional set of macros and wordclasses are provided, as will now be described.
Formally, the only component required in a preamble is an NER declaration. However, it is a matter of convention and of good coding style to place the fundamental parsing rules, relation definitions, macros and wordclasses at the beginning of the rulebook for the purpose of order and clarity. This convention is adhered to in the composition of the CARE rulebook, and all general purpose macros and wordclasses are placed at the beginning of the rulebook. Note that once declared, specific or ad-hoc wordclasses and macros may be placed anywhere within the rulebook.
Preferably, a rulebook begins with the following template lines:
@DefaultTokenFeature is an instruction specifying the default token feature. A token feature is a property of tokens defined by regular expressions in the NER file. Formally it is a function which maps strings to strings. For example, a feature can be set to evaluate numeric characters, uppercase characters, or even whitespace characters. The default token feature used is ‘WordAll’ which takes the entire token and treats it as if it was in lowercase, regardless of its original case. Thus, provided that the default feature is not changed, CARE evaluates the text comprising the corpus as if it was entirely in lower case.
The above instruction follows the more general syntax:
Following the NER and general entity declaration, the sentence processing rules need to be specified:
‘concept’ is used to define a new non-terminal symbol ‘start’ which indicates that this is the initial symbol of the CFG and that all rules can be traced back to this symbol. In effect, the above predicate states:
It is noted that Sentence is only a name and may be replaced by any other identifier or case-sensitive sequence of alphanumeric characters and underscores which starts with a letter.
‘concept Phrase’ defines a new non-terminal symbol called Phrase, which will be used in the following rules.
‘Sentence :- Phrase;’ is the first rule of the CFG. It states the following:
No weight is explicitly stated for this rule, in which case CARE assigns this rule a default weight of 0. This concept will be discussed in further detail herein.
‘Phrase :- <−0.01, 0, 0> None Phrase;’ is the second CFG rule. Whereas the first rule had no weight assigned to it, this second rule has three weight-related numbers assigned to it. The rule is a simple recursive directive stating:
In effect, this rule, together with the next one, allows ‘Phrase’ to be comprised of any number of unlabeled tokens. The first number is called weight. Weight is the value that CARE will assign to every instance of this rule's application when calculating which parsing rule is most feasible. The higher the weight of the rule the more likely it is to be applied in generating an appropriate parsing. The very small negative weight in the above example (−0.01) is intended to discourage greediness, so that the rule is not applied repeatedly a large number of times. Note that the scale for the weight is relative to the weights assigned to the different rules and to the weights specified by the underlying NER model. A detailed explanation about weights and how to assign them will be provided hereinbelow.
The next two numbers are named prior weight and sigma, respectively. These values may be omitted, hence normal syntax usually includes only the first number which is used only for training, during which the weights of the rules are adjusted to optimize the accuracy of the extraction from the training data. This adjustment should not happen for entities defined by NER, such as None, since NER training will have already optimized these rules in the model, and they should therefore not be changed during the further relation-rule training. These values are rarely explicitly specified. The prior weight indicates the standard value of the weight and sigma specifies the prior strength. When the system trains, it will penalize the deviation of the weight from its standard value. By “penalize” it is meant that it will balance the magnitude of the difference between the weight and the prior weight against the fit of the model to the training data. The default prior weight is 0. The prior strength is a measure of how important the |prior -weight| deviation is versus the importance of the training fit. The default sigma is 1. Small values of sigma indicate a very strong prior weight, so that the trained weights will deviate very little from their prior weight value, and zero sigma means the weight will not be changed at all during training.
‘Phrase :- ;’ allows the empty word to be considered a phrase, thus guaranteeing a finite parse for recursive rules.
Wordclasses are explicitly defined sets of words or word-sequences used as lookup libraries as specified in the rules. The syntax for declaring a wordclass is as follows:
Note that a sequence of words must be placed in parenthesis, e.g. (olive green) as shown above.
When the set is too large to be directly included in the rulebook, we can put it into a separate file and include it using an “%include” directive, as follows:
Relations are the target labels which CARE uses to tag its output. In other words, relations are used strictly for output. In order to allow CARE to tag the corpus with a specific relation, one must tie the relation to a concept using the following syntax:
This will produce the respective XML tags
Each relation can have a number of slots or subfields, which are used as inner tags. A relation with no slots is simply an Entity, which is tagged as above using the corresponding concept. The tags corresponding to the slots of relations are slightly different, in that they are produced by rule elements, rather than by concepts. Also, the slots in the output are indicated by underscores appearing in their names as in:
The essence of the rules described here is to extract from the text corpus the instances of the defined relations. Relations are therefore referred to extensively throughout this description. The declaration syntax:
Macros are single instructions that expand automatically into a set of instructions which perform a task. The last part of this initial part of the rulebook may contain macros which look and perform exactly like C/C++ macros. Two examples of prevalent macros are given:
The function of this macro can be explained using an example, namely:
The second example of a macro is:
This macro is used to generate the set of all permutations of any given three symbols. It can be applied for example to create a wordclass as follows:
This word class would now include all the permutations of the three words:
This section how to write a relatively short, yet fully functional rulebook for the extraction of a specific simple relation. A relation is considered simple if it appears as one consecutive sequence in a sentence and can be relatively easily extracted. When a rulebook extracts a relation or an entity this means it aims to direct CARE how to recognize and tag all instances of this relation or entity in a given text corpus.
The entity to be extracted is an address. In other words, the rulebook described here attempts to extract all instances of addresses from a given free language unstructured text. Since the rulebook deals with composite entities, which do not necessarily share a well-defined format, it is inevitable that some targets will be missed and conversely some false instances declared.
The extent to which all true instances are successfully extracted is called recall, and the measure of the accuracy of a declaration that an extraction is indeed a designated target is called precision.
Recall and precision are defined as follows:
Recall=TP/All
Prec=TP/(TP+FP)
where TP is True Positive, meaning correctly declared instances, FN is False Negative meaning missed instances, and FP is False Positive meaning incorrectly declared instances. It is ideally desired that recall and precision reach 100%, but as explained above this is generally impossible to attain unless the rulebook is dealing with a concept whose form can be precisely defined, such as one provided by a regular expression. Instead, there exists a delicate balance between recall and precision and the goal is a subtle threshold in which both values are optimized. Usually, precision is considered more important than recall, and thus very high precision rates are aspired to. Recall above 95% is usually considered to be very good.
The rulebook begins with
What remains to be explained of this is the syntax:
This syntax already appears in the rulebook, at least partially, in the first line:
This is taken to be an instruction specifying the default token feature. The last few lines of code switch back and forth between the token feature WordAll and the token feature Feature. Token features are functions of tokens with string values. They are defined by regular expressions in the NER definitions file, in NERCRF_nerdefes.crf. The value of “WordAll” for a token is the token converted to lowercase. The value of “Feature” for a token is one string from the set { “Capital”, “AllCaps”, “Punctuation”, . . . }, which is chosen according to whether the token matches the regular expressions “[A-Z][a-z]+”, “[A-Z]+”, “[̂a-zA-Z0-9]”, . . . , respectively. The set of possible values and the regular expressions are all defined in the NERCRF_nerdefes.crf.
The default token feature is used in CARE in the following way: whenever a terminal symbol appears in a rule
Rules all have the following general form:
The Weight is denoted by angular quotes: <weight>.
A template example is given by the following:
Note that [<w> A|B] is not equivalent to [<w>A|<w> B]
To further clarify the above explanation, an example of a real rule is presented below:
Here the non-terminal symbol Streetword, is provided by the following:
These aforementioned rules are the main syntactic rules. The following section describes the guidelines for designating weights to the various rules and their constituents.
Some heuristics for assigning weights to the rules are now presented in order to provide insight into what the rulebook is trying to achieve.
A phone number has the general form of +X(XXX)XX-XX-XX. The appearance of any one small clue does not guarantee that the result is definitely a phone number, but if all the small clues appear, namely a plus sign, brackets and minus signs, there is a cumulatively high chance that this indeed is a phone number.
The idea here is that the non-terminal symbol “DomainX” should consume as many tokens as possible, as long as other higher-weight constraints do not match. Such higher-weight constraints may come from NER when a high-probability entity occurs or from another CARE rule.
Sometimes it may be necessary to raise or lower particular weights in order to resolve conflicts between competing rules. Also, after a rulebook is trained using a training data corpus, the accuracy of the parsing should increase significantly.
A complete basic rulebook is now presented with annotations. /* . . . */ and // . . . denote comments, enclosed and until end of line, respectively.
In order to run CARE, a directory containing the following files needs to be created:
In order to run CARE on a given rulebook the following line is typed at the command prompt:
This compiles the rulebook RuleBook.rec, runs it over Corpus.txt and places the tagged corpus in a new file “results”.
Note that a corpus must be in the following format:
A complex relation refers to a relation with multiple slots, which may be broken up into segments across a sentence. As implied by the structure of the legal corpus provided hereinabove, the basic units that CARE knows how to deal with are sentences as delineated by the tag <S> . . . </S>. Extracting any relation that spans several sentences must be done in post-processing and is therefore beyond the scope of this description.
Obviously, there is no one way of writing a rulebook, in particular for extracting a complex relation. However, there are some very useful tricks and tips which are presented below.
The following is an example of how macros can be harnessed to create very general and comprehensive wordclasses using a few simple instructions: For example, a rulebook needs to be written in order to identify a complex relation between a Person, the Company the Person works for and the Person's Position at that company (PPC),In order to extract such a relation the basic constituents, namely people, companies and positions must be identifiable as well as the relationship binding them together. A few examples are provided in the following sentences, wherein relation constituents are marked in bold:
Aside from the three constituents, these sentences all seem to contain a verb indicating that the person was appointed to the job. The question is how such verbs can be identified inside sentences since they seem to be in so many different tenses, conjunctions and positions. To answer this question, first a macro is defined that generates a set in which an element x is moved across a sequence w1, . . . , wn, while keeping the sequence's order fixed:
Then, an auxiliary macro is defined which will help create the active forms of a given verb.
breaking a line before “;” */
must appear with no spaces between them*/
To show how this macro works, it can be fed with the following input set:
The last parameter is empty, which is a legal syntax for omitting a parameter. Additionally, the last two parameters are identical, since the perfect tense of the verb “to name” is “has named” and not “has namen”. The result of feeding the above set to the macro is:
Had the input set included the last parameter, such as:
This example shows how powerful the macro is. Now this auxiliary macro is used as input to a more general macro to generate a full word class of active verb forms:
In order to generate almost any active verb with the stem verb “promote” it is sufficient to write:
For irregular verbs, such as “choose” above, all the conjugations are explicitly specified, manually omitting the stem or root. Similar macros can be specified for passive forms of verbs:
which is used as an auxiliary macro for the following macro:
A symmetric analysis of these macros is omitted here, however their use is be clear from the above description.
Determining the Relative Position in which a Relation Can Appear
There are several standard ways in which a calling rule for a relation, pointed to by a concept, can be added into the rulebook. Three of the most prevalent ways to do so are presented here.
“Phrase” is meant to match an arbitrary sequence of entities, relations, and “None”s, as specified by the standard preamble rule:
Phrase can be visualized as a state in a FSA with multiple transitions to itself, labeled by every possible entity and relation, and one ending transition, defined by:
The rules for a complex relation are presented here. The following code is an excerpt from a rulebook for PPC extraction:
It is instructive to show how to ensure that PPC instances are indeed tagged in the output:
Writing a rulebook is a creative process in which not all instances to be extracted can be predicted in advance. A rulebook often starts out as a simple collection of elementary rules that undergo modifications and extensions, unfortunately sometimes up to a point of congestion. Before such a state is reached, it is important to restructure the rules, so that medium level common phrases are aggregated into auxiliary midlevel rules, as depicted in aforementioned example (OrgX, PersonX).
In order to build comprehensible and effective rulebooks, it is important to invest some time in articulating the rules elegantly, so that the result is concise, readable, modular and extensible. There will always be new instances not “caught” by the rulebook. Repeated iterations over the tagged results will be needed to ensure that the correct instances are extracted, but even then it is unlikely to cover all possible cases, unless the relation is extremely simple or given completely by deterministic clues. However, if an effort is made to organize the rules in a readable manner, it will make the work of extending the rulebook effectively much easier. Another aspect, in which creativity is evinced, is in the weights assigned to the rules. A delicate balance sometimes needs to be reached between competing rules. Producing a precise and extensive training corpus for the rules will also affect the accuracy of the rulebook.
This section contains additional CARE syntax and tips for reducing the compilation time of rulebooks.
At a later stage of CARE development, an explicit syntax for manipulating regular expressions was added as an integral part of the CARE syntax. The syntax employed is that of Perl's Regular Expressions. Full documentation is available on the Internet and is beyond the scope of this description. However, several examples and explanations of usage are presented below:
The following are several additional examples:
Note that these concepts, once defined, can be used as any other concept would be used, for example:
Long rulebooks may take a significant time to compile, up to 3-6 minutes. This becomes a major impediment for development when a rulebook needs to be regularly updated to handle new examples.
A solution for reducing compilation time does exist. In fact, the long compilation time is almost entirely due to normalization of wordclasses. In order to skip normalization of wordclasses simply use the following syntax:
Experiments are described here for three main purposes. Firstly, the experiments demonstrate the present invention's ability to cope with complex structures and to successfully extract relations with very high accuracy (95% and higher) in terms of both precision and recall. The present invention achieves this accuracy using a standard training corpus for training a named entity recognition model and a limited number of manually-written relation extraction rules. Secondly, the experiments demonstrate the importance of various parts of the present invention's architecture. This is shown by a series of “handicap” experiments, in which one of the key components of the system is disabled. Thirdly, the experiments show that although training the weights of the CARE grammar automatically is possible, it produces worse results than setting the weights manually and in an intuitive way.
No precise numerical comparisons to other relation extraction systems are attempted, due to the difficulty of establishing a standard benchmark test. Instead, comparisons are made at qualitative levels.
In this experiment a complete PPC (“Person-Position-Company”) rulebook is used, working with a CRF-based NER model trained on the CoNLL-2003 shared task data. The experimental corpus is a large set of short biography pages from the BusinessWeek website. In order to test the performance of the system a random set of ten such pages was taken containing 113 instances of the PPC relation. The rulebook was run on these pages and the results were then checked manually. The results themselves have the same basic form as that shown in
It is difficult to numerically define the accuracy measures for a tree-structured output. Consequently, two different sets of measures are used: one takes into account only the accuracy of slot assignments, irrespective of the structure, whereas the other counts the number of correctly extracted “final” relations produced after post-processing the CARE output. This second measure indirectly takes the structure into account. Two counts are shown, the “exact” and “partial”. For the “exact” count, an extracted instance is considered a “true positive” if all of the slots of the relation instance are found and correctly filled. For the “partial” count, the position and dates can be missed, but not incorrectly filled.
In these experiments, the importance of the various parts of CARE architecture is demonstrated by disabling them in turn and evaluating the results produced by the handicapped systems.
There are two such experiments. In the first experiment, the flexible interface between CARE and NER is disabled, in effect making all NER decisions immutable. In the second experiment, NER is disabled altogether by making the weights of all NER transitions equal, so that all named entity recognition decisions are made by the rules. The second experiment is thus seen to be the opposite of the first.
The statistics for these two experiments are shown in
The Training experiment
In this experiment the results of a system in which the weights were trained instead of being set manually and in an intuitive way are shown. The amount of training data used was relatively small due to the difficulty in labeling a sufficiently large training corpus. The results in
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather the scope of the present invention includes both combinations and subcombinations of various features described hereinabove as well as modifications of such features which would occur to a person of ordinary skill in the art upon reading the foregoing description and which are not in the prior art.
Reference is made to U.S. Provisional Patent Application Ser. No. 61/273,961, filed Aug. 10, 2009 and entitled “CONDITIONAL RANDOM FIELDS (CRF)-BASED RELATION EXTRACTION SYSTEM”, the disclosure of which is hereby incorporated by reference and priority of which is hereby claimed pursuant to 37 CFR 1.78(a) (4) and (5)(i).
Number | Date | Country | |
---|---|---|---|
61273961 | Aug 2009 | US |