Ever since its invention, text has been the fundamental repository of human knowledge and understanding. With the invention of the printing press, the computer, and the explosive growth of the Web, the amount of readily accessible text has long surpassed the ability of humans to read it. This challenge has only become worse with the explosive popularity of new text production engines such as Twitter where hundreds of millions of short “texts” are created daily [Ritter et al., 2011]. Even finding relevant text has become increasingly challenging. Clearly, automatic text understanding has the potential to help, but the relevant technologies have to scale to the Web.
Starting in 2003, the KnowItAll project at the University of Washington has sought to extract high-quality collections of assertions from massive Web corpora. In 2006, it was noted that: “The time is ripe for the Al community to set its sights on Machine Reading—the automatic, unsupervised understanding of text.” [Etzioni et al., 2006]. In response to the challenge of Machine Reading, the Open Information Extraction (Open IE) paradigm, which aims to scale IE methods to the size and diversity of the Web corpus, was investigated [Banko et al., 2007].
Typically, Information Extraction (IE) systems learn an extractor for each target relation from labeled training examples [Kim and Moldovan, 1993; Riloff, 1996; Soderland, 1999]. This approach to IE does not scale to corpora where the number of target relations is very large, or where the target relations cannot be specified in advance. Open IE solves this problem by identifying relation phrases—phrases that denote relations in English sentences [Banko et al., 2007]. The automatic identification of relation phrases enables the extraction of arbitrary relations from sentences, obviating the restriction to a pre-specified vocabulary.
Open IE systems avoid specific nouns and verbs at all costs. The extractors are unlexicalized—formulated only in terms of syntactic tokens (e.g., part-of-speech tags) and closed-word classes (e.g., of, in, such as). Thus, Open IE extractors focus on generic ways in which relationships are expressed in English—naturally generalizing across domains.
Open IE systems have achieved a notable measure of success on massive, open-domain corpora drawn from the Web, Wikipedia, and elsewhere. [Banko et al., 2007; Wu and Weld, 2010; Zhu et al., 2009]. The output of Open IE systems has been used to support tasks like learning selectional preferences [Ritter et al., 2010], acquiring common-sense knowledge [Lin et al., 2010], and recognizing entailment rules [Schoenmackers et al., 2010; Berant et al., 2011]. In addition, Open IE extractions have been mapped onto existing ontologies [Soderland et al., 2010].
Open IE systems make a single (or constant number of) pass(es) over a corpus and extract a large number of relational tuples (Arg1, Pred, Arg2) without requiring any relation-specific training data. For instance, given the sentence, “McCain fought hard against Obama, but finally lost the election,” an Open IE system should extract two tuples, (McCain, fought against, Obama), and (McCain, lost, the election). The strength of Open IE systems is in their efficient processing as well as ability to extract an unbounded number of relations.
Several Open IE systems have been proposed before now, including T
The first Open IE system was T
All prior Open IE systems have two significant problems: in incoherent extractions and uninformative extractions. Incoherent extractions are cases where the extracted relation phrase has no meaningful interpretation.
Table 1 provides examples of incoherent extractions. Incoherent extractions make up approximately 13% of T
The second problem, uninformative extractions, occurs when extractions omit critical information. For example, consider the sentence “Hamas claimed responsibility for the Gaza attack.” Previous Open IE systems return the uninformative: (Hamas, claimed, responsibility) instead of (Hamas, claimed responsibility for, the Gaza attack). This type of error is caused by improper handling of light verb constructions (LVCs). An LVC is a multi-word predicate composed of a verb and a noun, with the noun carrying the semantic content of the predicate [Grefenstette and Teufel, 1995; Stevenson et al., 2004; Allerton, 2002]. Table 2 illustrates the wide range of relations expressed with LVCs, which are not captured by previous open extractors.
Table 2 provides examples of uninformative relations (left) and their completions (right). Uninformative extractions account for approximately 4% of
A method and system for extracting a relation phrase from a sentence having words is provided. In some embodiments, the system (”R
In some embodiments, the system (“A
R
The syntactic constraint serves two purposes. First, it eliminates incoherent extractions, and second, it reduces uninformative extractions by capturing relation phrases expressed via light verb constructions.
The syntactic constraint requires relation phrases to match the POS tag pattern shown in Table 3.
Table 3 is a simple part-of-speech-based regular expression reduces the number of incoherent extractions like was central torpedo and covers relations expressed via light verb constructions like made a deal with. The pattern limits relation phrases to be either a simple verb phrase (e.g., invented), a verb phrase followed immediately by a preposition or particle (e.g., located in), or a verb phrase followed by a simple noun phrase and ending in a preposition or particle (e.g., has atomic weight of). If there are multiple possible matches in a sentence for a single verb, R
Finally, if the pattern matches multiple adjacent sequences, R
While this syntactic pattern identifies relation phrases with high precision, the extent to which it limits recall was determined by an analysis of Wu and Weld's set of 300 Web sentences. The analysis manually identified all verb-based relationships between noun phrase pairs resulting in a set of 327 relation phrases.
For each relation phrase, the analysis checked whether it satisfies the R
Table 4 illustrates that approximately 85% of the binary verbal relation phrases in a sample of Web sentences satisfy our constraints. Many of these cases involve long-range dependencies between words in the sentence. Attempting to cover these harder cases using a dependency parser can actually reduce recall as well as precision.
While the syntactic constraint greatly reduces uninformative extractions, it can sometimes match relation phrases that are so specific that they have only a few possible instances, even in a Web-scale corpus. Consider the sentence
is offering only modest greenhouse gas reduction targets at (1)
Thus, there are phrases that satisfy the syntactic constraint, but are not useful relations.
To overcome this limitation, R
R
This algorithm differs in three important ways from previous methods. First, R
R
To determine whether rv satisfies the lexical constraint, R
In addition to the relation phrases, the Open IE task also requires identifying the proper arguments for these relations. Previous research and R
For example, from the sentence “The cost of the war against Iraq has risen above 500 billion dollars,” R
A goal of this linguistic-statistical analysis is to find the largest subset of language from which we can extract reliably and efficiently. To this cause, a sample of 250 random Web sentences was first analyzed to understand the frequent argument classes to answer questions such as:
Chicago
was founded in 1833.
The forest in Brazil
is threatened by ranching.
five Great Lakes of North
America.
Google and Apple
are
headquarteed in Silicon Valley.
stellar remnants.
Google will acquire YouTube,
announced the New York Times.
oil remains a treat.
Chicago, which is located in
Illinois, has three million
dwarf galaxies, which are
small.
Table 5 illustrates a taxonomy of arguments for binary relationships. In each sentence, the argument is bolded and the relational phrase is italicized. Multiple patterns can appear in a single argument so percentages do not need to add to 100. In the interest of space, argument structures that appear in less than 5% of extractions are omitted. Upper case abbreviations represent noun phrase chunk abbreviations and part-of-speech abbreviations.
By far the most common patterns for arguments are simple noun phrases such as “Obama,” “vegetable seeds,” and “antibiotic use.” This explains the success of previous open extractors that use simple NPs. However, simple NPs account for only 65% of Arg1s and about 60% of Arg2s. This naturally dictates an upper bound on recall for systems that do not handle more complex arguments. Fortunately, there are only a handful of other prominent categories—for Arg1: prepositional phrases and lists, and for Arg2: prepositional phrases, lists, Arg2s with independent clauses, and relative clauses. These categories cover over 90% of the extractions, suggesting that handling these well will boost the precision significantly.
The analysis also explored arguments' position in the overall sentence. It was determined that that 85% of Arg1s are adjacent to the relation phrase. Nearly all of the remaining cases are due to either compound verbs (10%) or intervening relative clauses (5%). These three cases account for 99% of the relations in the sample.
An example of compound verbs is from the sentence “Mozart was born in Salzburg, but moved to Vienna in 1781,” which results in an extraction with a non-adjacent Arg1:
Arg2s almost always immediately follow the relation phrase. However, their end delimiters are trickier. There are several end delimiters of Arg2 making this a more difficult problem. In 58% of the extractions, Arg2 extends to the end of the sentence. In 17% of the cases, Arg2 is followed by a conjunction or function word such as “if,” “while,” or “although” and then followed by an independent clause or VP. Harder to detect are the 9% where Arg2 is directly followed by an independent clause or VP. Hardest of all is the 11% where Arg2 is followed by a preposition, since prepositional phrases could also be part of Arg2. This leads to the well-studied but difficult prepositional phrase attachment problem. For now, limited syntactic evidence (POS-tagging, NP-chunking) was used to identify arguments, though more semantic knowledge to disambiguate prepositional phrases could come in handy for this task.
The analysis of syntactic patterns reveals that the majority of arguments fit into a small number of syntactic categories. Similarly, there are common delimiters that could aid in detecting argument boundaries. This analysis lead to the development of A
A
A
The other key challenge for a learning system is training data. Unfortunately, there is no large training set available for Open IE. So, a novel training set was built by adapting data available for semantic role labeling (SRL), which is shown to be closely related to Open IE [Christensen et al., 2011b]. It was found that a set of post-processing heuristics over SRL data can easily convert it into a form meaningful for Open IE training.
A subset of the training data adapted from the CoNLL 2005 Shared Task [Carreras and Marquez, 2005] was used. The dataset consists of 20,000 sentences and generates about 29,000 Open IE tuples. The cross-validation accuracies of the classifiers on the CoNLL data are 96% for Arg1 right bound, 92% for Arg1 left bound, and 73% for Arg2 right bound. The low accuracy for Arg2 right bound is primarily due to Arg2's more complex categories such as relative clauses and independent clauses and the difficulty associated with prepositional attachment in Arg2.
Additionally, a confidence metric was trained on a hand-labeled development set of random Web sentences. Weka's implementation of logistic regression and the classifier's weight to order the extractions were used.
The combination of R
In the following, references are listed, which are hereby incorporated by reference.
From the foregoing, it will be appreciated that specific embodiments of the invention have been described herein for purposes of illustration, but that various modifications may be made without deviating from the scope of the invention. Accordingly, the invention is not limited except as by the appended claims.
This application claims the benefit of U.S. Provisional Patent Application No. 61/676,579 (Attorney Docket No. 72227-8061.US01) filed Jul. 27, 2012, entitled TEXTRUNNER, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61676579 | Jul 2012 | US |