IBM® is a registered trademark of International Business Machines Corporation, Armonk, N.Y., U.S.A. Other names used herein may be registered trademarks, trademarks or product names of International Business Machines Corporation or other companies.
1. Field of the Invention
This invention relates to information extraction, and particularly to systems, methods and computer program products for an algebraic approach to rule-based information extraction.
2. Description of Background
Search and business intelligence applications are increasingly relying on the wealth of structured information that can be extracted from text. Information of interest to such applications ranges from mentions of entities and relationships (e.g., persons, phone numbers, addresses, etc.) to significantly more complex information such as reviews, opinions, and sentiments. Extracting structured information from unstructured text by finding instances of complex, multilevel patterns in the text can be a difficult task. This structured information serves as an input to application such as search and business intelligence. Some known solutions include grammar-based systems that are based on cascading regular expressions. However, there are drawbacks to grammar-based systems, including: a) extraction performance degrades severely as patterns become more complex; and b) it can be difficult or impossible to express important constructs like “an instance of pattern x contained within an instance of pattern y” and “an instance of pattern x that does not satisfy pattern y”.
In addition, the area of rule-based information extraction (IE) has developed several rule languages and frameworks for building such information extraction programs (called annotators). Since extraction is viewed as a sequential operation over text, such rule languages and their implementations are predominantly based on the theory of grammars and finite-state automata. However, there is a significant issue with the scalability of such approaches, particularly as the complexity of the annotators and the size of the document collections increase. For example, execution times can be high due to the cost associated with the actual evaluation of each grammar rule. Such high CPU cost is a consequence of the fact that, for a grammar rule to be evaluated over a document, potentially every character in that document must be examined. As the number of rules increases, the associated CPU cost per document continues to grow, resulting in a large execution time over the entire collection. One approach to address this scalability problem is that of employing more hardware, distributing the document collection over a large number of processing nodes, and executing the annotators in parallel. However, it is desirable to achieve scalability by improving the efficiency of the processing operations performed by the annotator.
In a current grammar approach, the following example is considered. In the task of extracting, from blogs, informal reviews of live performances by music bands, a grammar approach can be implemented.
In a traditional rule-based IE system, the annotator described in
A translation of this specification into a cascading grammar yields the results shown in
A popular and well-understood standard for cascading grammars is the Common Pattern Specification Language (CPSL). Using such a CPSL-like language a large number of annotators over several diverse data sets can be developed. A significant drawback of the cascading grammar implementations is their enormous execution time. For example, even after extensive performance tuning, the total running time for the annotator shown in
Exemplary embodiments include a method for rule-based information extraction, the method including specifying an annotator using algebraic operators, wherein each algebraic operator describes annotations identification from text documents.
Further exemplary embodiments include a method of annotation plan optimizing in an environment where annotators are expressed as a graph of algebraic operators, the method including identifying subgraphs that exclusively contain relational operators and span extraction operators, applying topological sort to determine order in which to process the subgraphs, optimizing each subgraph independently, selecting the least cost plan for each subgraph and combining the least cost plan for each subgraph into a final plan.
Further exemplary embodiments include a computer program product for annotation plan optimizing in an environment where annotators are expressed as a graph of algebraic operators, the computer program product including instructions for causing a computer to implement a method, including identifying subgraphs that exclusively contain relational operators and span extraction operators, applying topological sort to determine order in which to process the subgraphs, optimizing each subgraph independently, selecting the least cost plan for each subgraph and combining the least cost plan for each subgraph into a final plan.
Additional exemplary embodiments include a computer program product for rule-based information extraction, the computer program product including instructions for causing a computer to implement a method, including specifying an annotator using algebraic operators, wherein each algebraic operator describes annotations identification from text documents.
System and computer program products corresponding to the above-summarized methods are also described and claimed herein.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.
As a result of the summarized invention, technically we have achieved a solution which provides an algebraic approach to rule-based information extraction, providing an algebra that includes span and text-specific operators based on building information extraction modules over a wide range of data-sets. In general, an algebra can express annotators that are impossible to describe with a cascading grammar.
By viewing data manipulation procedures as operators in an algebra, database query execution engines are able to consider equivalent but potentially faster execution plans for a given user query. As a result, optimization significantly speeds up annotation running time by reordering operations and eliminating redundant work. The benefits can further include clean semantics and the ability to leverage previous work on optimizing relational algebra queries. In addition, the novel operators and context that are required by IE lend themselves to novel optimizations that are shown to yield impressive improvements to response time when compared to a grammar-based approach.
The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
In exemplary embodiments, a data model and associated operator algebra for representing the text manipulation tasks that are performed by an annotator are provided. In an exemplary embodiment, the systems and methods described herein focus on a single document for information extraction tasks and thus implement intra-document operations. In exemplary embodiments, the core operations of an annotator involve the generation or examination of contiguous regions of text. Therefore, the fundamental concept in the exemplary algebra is that of a span, a region of text within a document identified by its “begin” and “end” positions. In exemplary embodiments, the user expresses annotators as a graph of algebraic operators, either by directly specifying the graph or by writing a query that the system translates to a graph. An annotation optimizer implements a set of algebraic equivalences and a cost model to analyze many alternate execution plans and chooses the most efficient plan.
In exemplary embodiment, the exemplary systems for an algebraic approach to rule-based information extraction implement an object-relational data model for representing annotations over a given document. Furthermore, set of logical operators can be applied over this model to demonstrate that complex rule-based annotators can be expressed as compositions of these operators.
In exemplary embodiments, the systems and methods implement the exemplary algebra to extract annotations from a single document at a time, the algebra's semantics are defined in terms of the current document being analyzed. In an exemplary implementation, the current document can be modeled as a string called doctext. In exemplary embodiments, each annotator finds regions of doctext that satisfy a set of rules and marks each region with an object called a span. In an exemplary embodiment, a span is an ordered pair <begin,end> that denotes the region of doctext from position begin to position end. In addition, the text of the span's region can be included in the notation. For example, if doctext was the string “Document text”, <9, 12>, “text” would denote the range from characters from positions 9 to 12 of the document.
In exemplary embodiments, the algebra operates over a simple relational data model with three data types: span, tuple, and relation. In the data model, a tuple is an finite sequence of w spans s1, . . . , sw; w is the width of the tuple. A relation is a multiset of tuples, with the constraint that every tuple in the relation must be of the same width. In exemplary embodiments, each operator in the algebra takes zero or more relations as input and produces a single output relation.
In exemplary embodiments, the algebra runs over a local annotation database including the current document and a set of annotation relations that represent pre-computed annotations. As part of the process of loading a document, the system 100 computes a set of useful general-purpose annotations like Sentence, Paragraph, Noun, and Verb and inserts these annotations into the local annotation database. Since a local annotation database only deals with a single document, it generally fits entirely in main memory. In exemplary embodiments, a collection of local annotation databases forms a global annotation database. To annotate all the documents in a global annotation database, the execution framework applies an algebra expression to every local annotation database separately. In exemplary embodiments, execution can proceed as follows:
To run multiple annotators in a single pass, step 2 in the above process can be repeated multiple times per document.
In exemplary embodiments, the set of operators in the algebra can be categorized broadly into relational operators, span extraction operators, and span aggregation operators as shown in Table 1. Since the data model is a minimal extension to the relational model, all of the standard relational operators (select, project, join, etc.) apply without any change.
In exemplary embodiments, span extraction operators identify segments of text that match a particular input pattern and produce spans corresponding to each such text segment. Since text pattern matching is at the core of almost any information extraction task, these extraction operators perform a significant number of operations for the algebra. The general form of the extraction operators is now described.
In exemplary embodiments, for a function f: Pattern,String→{Span} that maps a string to a set of pattern matches within the string, the corresponding span extraction operator Ef(Pattern) returns the maximal set of tuples {(T1, . . . , Tn}, where each Ti consists of a span from f(Pattern, doctext( )).
In exemplary embodiments, the algebra incorporates two kinds of span extraction operators: standard regular expression matcher (εre) and dictionary matcher (εd). Given a regular expression r, εre(r) identifies all non-overlapping matches when r is evaluated from left to right over the text represented by s. The output of εre(r) is the set of spans corresponding to these matches. Given a dictionary, dict, including a set of words/phrases, the dictionary matcher εd(dict) produces an output span for each occurrence of some entry in dict within the current document text. A separate dictionary operator is included because most regular expression engines only produce non-overlapping matches whereas the dictionary operator produces all possible matches for each dictionary entry. In addition regular expressions operate at the character level whereas dictionaries are at the level of tokens (i.e., words and phrases). Finally, dictionaries automatically enforce the semantics of word boundaries, i.e., dictionary matches only include complete words and phrases. For example, as shown in
In exemplary embodiments, span aggregation operators take in a set of input spans and produce a set of output spans by performing certain aggregation operations over their entire input. In exemplary embodiments, the input and output of every span aggregation operator is a single-column relation of the form R(a), where R.a is of type Span. In exemplary embodiments, the systems and methods described herein can include: containment consolidation, overlap consolidation, and block.
In exemplary embodiments, consolidate operators are implemented when multiple extraction patterns are used to identify the same concept; two different patterns often produce matches over the same or overlapping pieces of text. To resolve such “duplicate” matches, two kinds of consolidation operations are implemented: containment consolidation and overlap consolidation.
In exemplary embodiments, containment consolidation (Ωc) is used to discard annotation spans that are wholly contained within other annotation spans. Specifically, given a set of input spans, Ωc produces as output only those spans in the input that are not contained within another. In exemplary embodiments, containment consolidation can be expressed using relational operators by applying the correct span predicate. Given a relation R(a), Ωc(R)can be computed as:
R1(x)=Πx(R(a as x)x x,yR(a as y))
Ωc(R)=R(a)−R1(x as a)
Since containment consolidation is a common operation in several extraction tasks, a first class operator is retained in the algebra.
In exemplary embodiments, overlap consolidation (Ωo) is used to produce new spans by merging overlapping spans. Given a set of spans as input, Ωo produces a set of non-overlapping spans generated by repeatedly merging all possible spans in the input. In exemplary embodiments, an expression for Ωo in terms of relational operators requires a recursive fixed-point computation.
In exemplary embodiments, the block operator (β) identifies a large span of text enclosing a set of input spans such that no two successive spans are more than a specified distance apart. In exemplary embodiments, the systems and methods described herein identify regions of text where input spans occur with enough regularity. For example, as shown in
In exemplary embodiments, the single-column relation R(a), where R.a is of type Span, is the input to a block operator β with distance constraint d and count constraint n. A span (b,e)is produced as output by this block operator if there exists a set of input spans ρ((b,e))⊂R.a such that:
In exemplary embodiments, the output of the block operator β(n,d,R)is the set of all such spans that satisfy conditions B1..B5. Condition B5 ensures that every span output by the block operator begins and ends with one of the input spans.
In exemplary embodiments, an algebraic approach is applied to information extraction because a principled annotation optimizer similar to database query optimizers is developed. Since the data model and algebra build upon the standard relational model, strategies for generating alternative plans in the relational model known in the art (e.g., pushing down selections, re-ordering joins, etc.) are directly applicable. However, significantly more transformations can be performed by exploiting the semantics of the text-specific operators.
In exemplary embodiments, the systems and methods described herein implement three design guidelines: 1) document-at-a-time processing; 2) CPU-intensive text operations; and 3) Span properties. In keeping with the per-document nature of information extraction, the algebra operates on a single document at a time. As a result, the individual per-document relations that the operators described herein produce and consume are generally quite small and are often completely empty. The core text processing operations of the algebra are the span extraction operators εre and εd. In the absence of any index structures, these operators require the examination of each character or token in a document, resulting in significant CPU cost that often dominates the overall running time of an annotator. A span is merely a special instance of the general mathematical object called an interval. Therefore, spans obey all of the natural properties of interval algebra and these properties yield powerful transformation rules.
Techniques for transforming annotator execution plans are now described in accordance with exemplary embodiments. In exemplary embodiments, it is advantageous is reducing the effect of CPU-intensive text operations by exploiting document-at-a-time processing and span properties.
In exemplary embodiments, dictionary matching involves tokenizing the current document's text and looking for all occurrences of the set of words and phrases listed in a specified dictionary. However, dictionaries are also fairly powerful information extraction primitives and therefore used quite often. For example,
In exemplary embodiments, conditional evaluation (CE) avoids evaluating an entire subquery over a particular document if it is possible to infer that that document is not going to yield any output annotations. For instance, consider the last step in the BandReview annotator in which ConcertInstance and ReviewBlock are joined together. If the subquery corresponding to ConcertInstance is evaluated first on each document, the evaluation of BandReview can be avoided on documents in which there are no instances of the former. In exemplary embodiments, the entire computation proceeds one document at a time, providing a natural granularity at which to implement such conditional evaluation. The symmetric transformation of evaluating ReviewBlock and conditionally evaluating ConcertInstance is also possible.
In exemplary embodiments, both SDM and CE attempt to either reduce or eliminate work at the document level. In contrast, restricted span extraction (RSE) operates at the sub-document level. In exemplary embodiments, RSE restricts the evaluation of the expensive span extraction operators to some carefully chosen region(s) of text (as opposed to the entire document).
To illustrate this approach, Plan A from
In exemplary embodiments, RSE optimization is a generalization of the technique illustrated by the above example. RSE is applicable for expressions, such as the one shown in
In exemplary embodiments, the systems and methods described herein implement extraction operators that accept bindings for all but one of the unbound variables in a given join predicate p. The RSE extraction operators compute the pattern matches that satisfy p for a given set of bindings, and they do so without examining the entire document. The RSE implementation supports bindings for all the predicates listed in Table 2.
As described herein, dictionary matches enforce word boundaries, i.e., only match complete words or phrases. When restricting the execution of the dictionary extractor to a particular window of text, it is possible that spurious matches are returned at the two end-points of the window.
In exemplary embodiments, the design of an RSE regular expression extractor takes into account the left-to-right matching semantics of the regular expression operator. Regular expression matches are evaluated in left-to-right order over the entire document. By evaluating a regular expression over an arbitrary window within this text, it may not be possible to precisely compute the set of matches in this window that would have been produced by evaluating over the entire document. Therefore, whenever εre is involved, using join span bindings is adopted to only compute the end-offset and always evaluating the regular expression from the very beginning of the document.
A high-level design of an annotation plan optimizer based on the algebra and optimization is now discussed. Given an operator graph for an annotator in terms of the algebra, the first step is to identify subgraphs that exclusively contain the operators σ, π, ×, εd, and εre (i.e., a Select-Project-Join (SPJ) block extended to include the span extraction operators). In the case of the band review annotator, there are 40 such subgraphs as shown in
Within each subgraph, a space of possible plans is independently enumerated by: 1) all possible join orders including ones that involve cross-products; 2) standard transformations such as pushing down selections and projections to the extent possible, and 3) additional plans generated by the application of the CE and RSE techniques as described herein.
In exemplary embodiments, each subgraph would be treated independently; the least cost plan would be picked for each, and combined to produce the final plan. However, with the SDM optimization, the cost of evaluating dictionaries is now amortized across subgraphs and must be carefully accounted for. In exemplary embodiments, sharing of dictionary computations is possible only between dictionary operators that are completely evaluated over a document, not when an optimization such as RSE has been applied to restrict the evaluation to a smaller span. In addition, the cost of executing dictionary matches can include two parts: a certain fixed cost associated with tokenization and a variable cost associated with the actual matches produced by each operator. Given these considerations, an approach similar to the one used to handle interesting orders is adopted. For each subgraph B, two optimal plans along with their associated costs are computed: 1) A plan under the assumption that at least one dictionary is evaluated over the entire document, thus enabling amortization of the tokenization cost; and 2) Another plan under the assumption that no dictionary is evaluated over the entire document. Once this pair of plans has been computed, a global pass over all the blocks is used to pick one of the two plans for each block and build the overall execution plan.
The goal of the experimental study is two-fold: 1) validate the performance benefits obtained by using an algebraic approach to information extraction; and (2) understand and contrast the different optimization techniques as discussed herein.
The document corpus used in the experiments is a collection of 4.5 million web logs (5.1 GB of data) crawled from http://www.blogspot.com. Two annotators that identify informal reviews from these blogs (a) BandReview as shown in
The first set of experiments compare the performance times between the grammar-based implementation and an embodiment of the algebraic approach to rule-based information extraction as described herein. The following implementations are executed: 1) GRAMMAR: A hand-optimized grammar-based implementation that has been tuned separately for both BandReview and RestaurantReview; 2) ALGEBRABaseline: Baseline for the algebraic approach obtained by directly implementing the plan from GRAMMAR into the operator algebra; and 3) ALGEBRAOptimized: Plan obtained by applying the optimization algorithm presented herein over ALGEBRABaseline.
The execution times for BandReview and RestaurantReview are shown in
Despite the fact that ALGEBRABaseline is a direct implementation of GRAMMAR there is still a significant improvement in running time, which is explained by the fact that every rule in a cascading grammar is evaluated over the complete text of the document. On the other hand operations in an algebra work only over the input annotations and consequently the running time depends primarily on the size of the input annotations. The exact same information extraction task (BandReview) which took about eight hours in an optimized grammar-based implementation now runs in just under 30 minutes.
To understand the individual transformations and study their interactions with each other multiple versions of BandReview are run. Each version applies a restricted combination of transformations and seven combinations were executed. Four combinations were obtained directly by applying each transformation, discussed herein, individually. Two more were obtained by combining traditional with each of SDM and RSE and the last one obtained by applying all transformations.
While the exemplary algebraic approach addresses problems of scalability, the approach has another significant advantage over cascading grammars. To illustrate, the following example t illustrates a common problem in complex information tasks, namely, overlapping annotations.
The annotations overlap because: (a) individual rules are run independently, and (b) rules may make mistakes (in the sense that the author of that rule did not intend to capture a particular text snippet even though the snippet turned out to be a match). In a grammar-based implementation, overlapping annotations must necessarily be disambiguated, i.e., “Pipe” must either be an Instrument or a part of BandMember and a similar choice must be made for “Hammond”. To make these choices, one of several ad hoc disambiguation strategies is employed. Two popular strategies are: (a) retain the annotation that starts earlier (e.g., BandMember for John Pipe), and (b) a priori, impose global tie-breaking rules (e.g., BandMember dominates Instrument). Using (a), the choice in the Snippet 2 is unclear since both annotations start at the beginning of Hammond. Using (b) and assuming BandMember dominates, Snippet 2 is not identified by the cascading grammar in
To appreciate the true effects of such disambiguation, two experiments were run using the rules from
In exemplary embodiments, in terms of hardware architecture, as shown in
The processor 105 is a hardware device for executing software, particularly that stored in memory 110. The processor 105 can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computer 101, a semiconductor based microprocessor (in the form of a microchip or chip set), a macroprocessor, or generally any device for executing software instructions.
The memory 110 can include any one or combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), tape, compact disc read only memory (CD-ROM), disk, diskette, cartridge, cassette or the like, etc.). Moreover, the memory 110 may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory 110 can have a distributed architecture, where various components are situated remote from one another, but can be accessed by the processor 105.
The software in memory 110 may include one or more separate programs, each of which comprises an ordered listing of executable instructions for implementing logical functions. In the example of
The algebraic rule-based information extraction methods described herein may be in the form of a source program, executable program (object code), script, or any other entity comprising a set of instructions to be performed. When a source program, then the program needs to be translated via a compiler, assembler, interpreter, or the like, which may or may not be included within the memory 110, so as to operate properly in connection with the OS 111. Furthermore, the algebraic rule-based information extraction methods can be written as an object oriented programming language, which has classes of data and methods, or a procedure programming language, which has routines, subroutines, and/or functions.
In exemplary embodiments, a conventional keyboard 150 and mouse 155 can be coupled to the input/output controller 135. Other output devices such as the I/O devices 140, 145 may include input devices, for example but not limited to a printer, a scanner, microphone, and the like. Finally, the I/O devices 140, 145 may further include devices that communicate both inputs and outputs, for instance but not limited to, a network interface card (NIC) or modulator/demodulator (for accessing other files, devices, systems, or a network), a radio frequency (RF) or other transceiver, a telephonic interface, a bridge, a router, and the like. The system 100 can further include a display controller 125 coupled to a display 130. In exemplary embodiments, the system 100 can further include a network interface 160 for coupling to a network 165. The network 165 can be an IP-based network for communication between the computer 101 and any external server, client and the like via a broadband connection. The network 165 transmits and receives data between the computer 101 and external systems. In exemplary embodiments, network 165 can be a managed IP network administered by a service provider. The network 165 may be implemented in a wireless fashion, e.g., using wireless protocols and technologies, such as WiFi, WiMax, etc. The network 165 can also be a packet-switched network such as a local area network, wide area network, metropolitan area network, Internet network, or other similar type of network environment. The network 165 may be a fixed wireless network, a wireless local area network (LAN), a wireless wide area network (WAN) a personal area network (PAN), a virtual private network (VPN), intranet or other suitable network system and includes equipment for receiving and transmitting signals.
If the computer 101 is a PC, workstation, intelligent device or the like, the software in the memory 110 may further include a basic input output system (BIOS) (omitted for simplicity). The BIOS is a set of essential software routines that initialize and test hardware at startup, start the OS 111, and support the transfer of data among the hardware devices. The BIOS is stored in ROM so that the BIOS can be executed when the computer 101 is activated.
When the computer 101 is in operation, the processor 105 is configured to execute software stored within the memory 110, to communicate data to and from the memory 110, and to generally control operations of the computer 101 pursuant to the software. The algebraic rule-based information extraction methods described herein and the OS 111, in whole or in part, but typically the latter, are read by the processor 105, perhaps buffered within the processor 105, and then executed.
When the systems and methods described herein are implemented in software, as is shown in
In exemplary embodiments, where the algebraic rule-based information extraction methods are implemented in hardware, the algebraic rule-based information extraction methods described herein can implemented with any or a combination of the following technologies, which are each well known in the art: a discrete logic circuit(s) having logic gates for implementing logic functions upon data signals, an application specific integrated circuit (ASIC) having appropriate combinational logic gates, a programmable gate array(s) (PGA), a field programmable gate array (FPGA), etc.
As described above, the embodiments of the invention may be embodied in the form of computer-implemented processes and apparatuses for practicing those processes. Embodiments of the invention may also be embodied in the form of computer program code containing instructions embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other computer-readable storage medium, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. The present invention can also be embodied in the form of computer program code, for example, whether stored in a storage medium, loaded into and/or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
While the invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims. Moreover, the use of the terms first, second, etc. do not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another.