The present invention relates generally to extraction of information from document corpora. Computer-implemented methods are provided for producing a searchable representation of information contained in a corpus of documents. Information extraction systems and computer program products implementing such methods are also provided.
The publication of scientific papers, articles and other technical documents has increased exponentially over the last few decades. These documents provide a vast repository of technological knowledge, calling for systems which can make this knowledge discoverable and usable to further advance technology. Extracting knowledge from large document collections is an important strategy in numerous technical applications, such as materials science, the oil and gas industry, and medical applications such as disease analysis and treatment development.
Knowledge graphs are well-known data structures for representing information derived from a large corpus of documents. A knowledge graph essentially comprises nodes, which represent particular entities about which associated information is stored, interconnected by edges which represent defined relations between entities. To generate a knowledge graph for a document corpus, machine learning models trained to implement NLP (Natural Language Processing) tasks are applied to the documents to extract entities and relations from the text. Entities here may be document items, such as paragraphs, images, tables, and so on, as well as language items such as words or phrases defining particular things, or types or properties of things, contained in those document items. Language items and their relationships can be identified using various NLP techniques. For example, NER (Named Entity Recognition) models can be trained to identify words/phrases defining particular entities and annotate these by type, such as polymer classes, polymer names, material properties, and so on. NLP relation models can analyze text to identify relations between two entities X and Y, such as X “is a type of” Y, or X “is a property of” Y, where text in quotation marks defines the relation.
“Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, Peter Staar et al., KDD 2018: 774-782, describes a system for identifying particular types of document items (titles, subtitles, text paragraphs, figures, etc.) in documents to produce an annotated list of the items contained in each document in a corpus. “Corpus Processing Service: A Knowledge Graph Platform to perform deep data exploration on corpora”, Peter Staar at al., Authorea, Sep. 16, 2020, describes a system which uses NLP techniques to process the individual document items in these lists to identify entities/relations and generate a knowledge graph for a corpus. The resulting knowledge graph can be loaded to a database for querying and searching the graph.
The ultimate goal of such information extraction systems is to extract all relevant information from documents with regard to the domain of a document corpus. Different technical domains require different annotations and hence models trained to identify the particular entities and relations relevant to a given domain. NLP models for identifying relations are typically based on closeness of entities in the original text. In generic models, closeness is often the only criterion. Some models also use grammar analysis, but this is inherently local by sentence.
Extracting all relevant information from a document corpus is an extremely challenging task. In view of the wealth of information contained in these corpora, improved information extraction techniques would be highly desirable.
One aspect of the present invention provides a computer-implemented method for producing a searchable representation of information contained in a corpus of documents by generating a document structure graph, the graph indicating a structural hierarchy of document items in that document based on a predefined hierarchy of predetermined item-types, and linking document items to a parent document item in the structural hierarchy, for each document, generating a knowledge graph including first nodes, representing document items in the corpus and second nodes representing language items identified in those document items, interconnecting the first nodes and second nodes by edges representing a defined relation between items represented by the nodes interconnected by that edge, storing the knowledge graph in a knowledge graph database, and producing the searchable representation by traversing edges of the graph in response to input search queries.
Another aspect of the invention provides an information extraction system for producing a searchable representation of information contained in a corpus of documents each comprising a succession of document items of predetermined item-types defined for the corpus. The system comprises: memory for storing the documents, document graph logic adapted to generate a document structure graph as described above for each document, a knowledge graph generator adapted to generate a knowledge graph including edges representing parent-child relations as described above, and a knowledge graph database for storing the knowledge graph to produce the searchable representation of information contained in the corpus, wherein the knowledge graph database is adapted to search the knowledge graph by traversing edges of the graph, in response to input search queries.
A further aspect of the invention provides a computer program product comprising a computer readable storage medium embodying program instructions, executable by a computing system, to cause the computing system to implement a method described above for producing a searchable representation of information contained in a document corpus.
Embodiments of the invention will be described in more detail below, by way of illustrative and non-limiting example, with reference to the accompanying drawings.
By providing parent-child edges in the knowledge graph based on the document structure graphs for documents, methods embodying the invention assimilate the structures of the documents themselves in the overall knowledge representation. Information which is implicit in the hierarchical structure of a document as a whole can be embedded in the knowledge graph and extracted via search operations. The structural layout of a document, such as titles, section headers, and sub-headers for sub-sections at various nested levels, expresses valuable information that may not otherwise be expressed in the text of individual document items. For example, a key term may be stated in a section header and not repeated in paragraphs under that header, or information in an introductory statement may relate to all items in a subsequent list. Methods embodying the invention can capture such additional information encoded in the structural hierarchy of each document. The resulting knowledge graph thus enables extraction of more information from a corpus than can be derived from individual document items in the documents. This constitutes a significant advance in knowledge extraction systems, offering improved search processes, better search results, and better solutions to the real-life problems supported by these searches.
It will be appreciated that edges representing parent-child relations in the knowledge graph indicate which document items are subordinate/superior to which other items in the document structure. By traversing these edges, information implicit in this hierarchical relationship can be extracted in search operations. As explained further below, parent-child edges can be exploited in user-constructed search queries, and/or predefined template search queries, to extract this information and provide more comprehensive search results. Moreover, parent-child relations can be exploited by NLP processes to deduce new relations between language items in related document items. This results in new edges in the knowledge graph between nodes representing these items, further supplementing the body of knowledge represented in the graph. By way of example, it may be deduced that a term mentioned in a paragraph with a parent section header is a particular example of a more generic term appearing in that header. In general, relations expressly or implicitly encoded in the knowledge graph produced by embodiments of the invention are not limited to proximity of terms in individual documents items or by grammatical analysis of individual sentences.
Knowledge graphs generated by methods embodying the invention may further include edges, representing ancestral relations, between nodes representing document items in each document and nodes representing at least one ancestor of their respective parent document items in the structural hierarchy for that document. Such knowledge graphs can therefore include direct edges between a document item node and nodes representing the parent-of-its-parent document item, the grandparent of its parent document item, and so on up to a desired hierarchy level in the document structure graph. These direct ancestral edges offer more flexible and efficient search operations. For example, multiple ancestral edges may be traversed in parallel to retrieve information associated with multiple ancestors or descendants of a given node. In addition, NLP relation models may be applied to deduce relations between language items in document items and language items in ancestors of those document items in the structural hierarchy of a document, resulting in additional edges explicitly encoding these relations in the knowledge graph.
Advantageously, knowledge graphs produced by embodiments of the invention can also include edges, representing neighbor relations, between nodes representing document items in each document and nodes representing their respective succeeding document items in the succession of document items for that document. These edges allow potentially relevant information to be retrieved from neighboring documents items, such as neighboring paragraphs, which often contain text with related information.
Particularly preferred methods include providing a graphical user interface, for display by a user computer, for input of search queries to the knowledge graph database. These methods can provide a mechanism in the interface for selecting traversal of edges representing parent-child relations between document items in search operations for input search queries. Corresponding mechanisms can be included for selecting traversal of edges representing ancestral and/or neighbor relations where provided. In addition, or as an alternative, these methods can provide predefined template search queries using the various structure-derived edges in the interface, where each template, or “search workflow”, defines a particular type of search query which can be further customized to particular user requirements in the interface.
Methods embodying the invention may include a preprocessing step in which each document in a source document corpus is first processed to parse the document into the succession of document items which are annotated with their item-types as predefined for the corpus. However, document structure graphs can be generated from any corpus of documents which have been processed to identify the succession of document items in each document. In preferred embodiments, each document structure graph is generated in a particularly efficient manner via a recursive process. This process identifies a parent document item for each document item, sequentially in order of succession in the document, in dependence on relative location in the predefined item-type hierarchy of the item-type of that item and the item-type of items earlier in the succession. This and other features and advantages of methods embodying the invention will be described in more detail below.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Embodiments to be described can be performed as computer-implemented methods for generating a searchable representation of information contained in a document corpus. Such methods may be implemented by a computing system comprising one or more general- or special-purpose computers, each of which may comprise one or more (real or virtual) machines, providing functionality for implementing operations described herein. Steps of methods embodying the invention may be implemented by program instructions, e.g. program modules, implemented by a processing apparatus of the system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. The computing system may be implemented in a distributed computing environment, such as a cloud computing environment, where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Bus 4 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer 1 typically includes a variety of computer readable media. Such media may be any available media that is accessible by computer 1 including volatile and non-volatile media, and removable and non-removable media. For example, system memory 3 can include computer readable media in the form of volatile memory, such as random access memory (RAM) 5 and/or cache memory 6. Computer 1 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 7 can be provided for reading from and writing to a non-removable, non-volatile magnetic medium (commonly called a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can also be provided. In such instances, each can be connected to bus 4 by one or more data media interfaces.
Memory 3 may include at least one program product having one or more program modules that are configured to carry out functions of embodiments of the invention. By way of example, program/utility 8, having a set (at least one) of program modules 9, may be stored in memory 3, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data, or some combination thereof, may include an implementation of a networking environment. Program modules 9 generally carry out the functions and/or methodologies of embodiments of the invention as described herein.
Computer 1 may also communicate with: one or more external devices 10 such as a keyboard, a pointing device, a display 11, etc.; one or more devices that enable a user to interact with computer 1; and/or any devices (e.g., network card, modem, etc.) that enable computer 1 to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 12. Also, computer 1 can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 13. As depicted, network adapter 13 communicates with the other components of computer 1 via bus 4. Computer 1 may also communicate with additional processing apparatus 14, such as a GPU (graphics processing unit) or FPGA, for implementing embodiments of the invention. It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer 1. Examples include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
The
Logic modules 24 through 27 interface with memory 21 which stores various data structures used in operation of system 20. These data structures include a parsed document corpus 31, an item-label hierarchy (HDI) 32 which defines a hierarchy of document item-types, a set of document structure graphs 33 produced by DSG generator 25 in operation, and KG data 34 which comprises data defining the nodes, edges and associated metadata for a KG generated by KG generator 26. System 20 further comprises a knowledge graph database (KGDB) 35 comprising a database management system (DBMS) 36 and associated memory 37 for storing a KG which is assembled and loaded to the database for searching.
In general, functionality of logic modules 24 through 27 may be implemented by software (e.g., program modules) or hardware or a combination thereof. Functionality described may be allocated differently between system modules in other embodiments, and functionality of one or more modules may be combined. The various components of system 20 may be provided in one or more computers of a computing system. For example, all modules may be provided in a computer 1 at which GUI 30 is displayed to a user, or modules may be provided in one or more computers/servers to which user computers can connect via a network (which may comprise one or more component networks and/or internetworks, including the Internet). System memory 21 may be implemented by one or memory/storage components associated with one or more computers of system 20.
Document corpus 23 may be local or remote from system 20 and may comprise documents from one or more information sources spanning the domain(s) of interest for a particular application. Documents in this corpus may be distributed over a plurality of information sources, e.g. databases and/or websites, which may be accessed dynamically by the system via a network, or the corpus 23 may be precompiled for system operation and stored in system memory 21.
In KGDB 35, the database management system 36 typically comprises a set of program modules providing functionality for storing and accessing the KG data in database memory 37. Such management systems can be implemented in generally known manner and the particular implementation is orthogonal to the operations described herein. Various data structure formats, of generally known type, can be used for storing the KG in memory 37, and the stored data structures may correspond directly or indirectly to features of the graph. In particular, KGDB 35 may employ native graph storage, which is specifically designed around the structure of the graph, or non-native storage such as a relational or object-orientated database structure. It suffices to understand that, in a knowledge graph database, a knowledge graph is defined at some level of the database model.
Steps 42 and 43 represent the knowledge graph generation process in KG generator 26. In step 42, the KG generator applies NLP models 28 to extract entities and relations, which will correspond to nodes and edges respectively of the knowledge graph, from documents 31. NLP models applied here may use generally known techniques for identifying and labelling language items as named entities (NEs), and for deducing relations between these language entities by locally analyzing text within individual document items. However, as indicated in brackets in step 42, preferred embodiments can apply “structure-aware” NLP models here. A structure-aware NLP model can exploit document structure as defined by the document structure graphs to derive additional relations between language entities in different document items. This is explained further below.
In step 43, the KG generator 26 generates the knowledge graph elements by storing data defining all nodes and edges of the graph as KG data 34 in system memory 21. Nodes are defined here for respective document items in corpus 31 and also language items identified in those document items in step 42. Edges interconnecting language item nodes are defined for all relations identified in step 42, along with edges connecting document items nodes to nodes representing the language items in each document item. In addition, the KG generator uses the document structure graph (DSG) 33 for each document to define edges, representing parent-child relations, between nodes representing document items in each document and nodes representing their respective parent document items in the structural hierarchy for that document. Various other nodes/edges may be included in the KG as described for particular embodiments below. The resulting KG data, defining all nodes and edges with their associated metadata (such as labels, properties, and/or any other data associated with graph elements) is stored as KG data 34 in system memory 21. In step 44, the resulting knowledge graph is loaded to KGDB 35 and stored in KG memory 37, providing a searchable representation of information contained in the document corpus 23.
The I/F manager 27 of this embodiment provides GUI 30 to assist users with KG searches. This module provides tools for construction of search queries in the GUI, receives input search queries for submission to KGDB 35, and controls presentation of search results in the GUI. In a KG search operation, the I/F manager receives an input search query, as indicated at step 45 of
Steps of the KG generation process are described in more detail in the following. Document analysis step 40 can be implemented using generally known feature extraction techniques for documents in a given format, such as PDF (Portable Document Format) or bitmap images. For example, interpretation of PDF printing commands can identify text characters and groupings for PDF documents generated from computer inputs such as Microsoft Word or Latex applications. OCR (Optical Character Recognition) techniques can also identify text characters in PDF documents produced by scanning, with morphological dilation applied to identify character strings and lines of text. Location of features such as horizontal/vertical lines and spaces, and vertical/horizontal feature alignment, can be used to identify boundaries of items such as paragraphs, pictures, tables, etc., and recognition of text features such as section numbers, capitals and bold type can assist with header and sub-header identification. Such feature extraction techniques can be used to parse each document into a succession of document items in the order of presentation in the textual flow of the document, and label each item with an item-type according to a predefined set of item-type labels for a corpus. Examples of such item-types comprise: document title; subtitle; document author; document abstract; author affiliation; chapter; section heading; subsection heading; paragraph; table; picture; caption; keyword; citation; table-of-contents; list item; sub-list item; table; table column-header; table row-header; table cell; list in table cell; code; form; formula; footnote, and so on. All or a subset of these or other predefined item labels may be used as appropriate for a given document corpus. Labels for subsection headings can specify an associated level to accommodate multiple levels of progressively subordinate subheadings. Levels can be similarly specified in labels for sub-list items, sub-sub-list items, and so on. In a preferred embodiment, document analyzer 24 is implemented using the Corpus Conversion Service (CCS) system described in the reference above. The parsed documents produced by this system are formatted as labeled lists of document items, in reading order of a document, defined in JSON (JavaScript Object Notation) format.
Generation of the DSGs in step 41 of
Item-type Hierarchy HDI:
“supertitle”: 1000, # this label does not exist (used for initializing the DSG generation process detailed below)
“title”: 200,
“subtitle”, “author”: 190 # Independent items under the title
“affiliation”: 185,
“chapter”: 180,
“section-level-1”: 160,
“section-level-2”: 150,
“section-level-3”: 140,
“section-level-4”: 130,
“section-level-5”: 120,
“paragraph”, “table-of-contents”, “abstract”, “keyword”, “citation”: 100, # Separate items under headings
“list item”: 90, “sub-list item”: 89, “sub-sub-list item”: 88,
“code”, “caption”, “form”, “formula”: 80, # Items that can occur inside normal text
“table”, “picture”: 70, # Subordinate to their captions if present
“column-header”, “row-header”: 60 # Inside tables
“table cell”: 50
“list in table cell”: 40
“footnote”: 10, # As it can also belong to table elements
“nothing”: 0 # Just an initialization value for the DSG generation process below.
CCS labels such as “page-footer” and “page header” for items which are outside the normal text flow of a document are omitted from the above hierarchy and the succession of document items used in the DSG generation process below.
In step 50 of
current_index=0;
previous_index=−1;
previous_label=“nothing” (corresponding to level 0 in hierarchy HDI above);
previous_parent_label=“supertitle” (corresponding to level 1000 in hierarchy HDI above);
previous_parent_index=−1.
An “index” here is the index number of a document item in the succession order of the parsed document, and can be indicated by an explicit index field in the metadata for document items. After initialization, the structure-linker process progresses through the succession of document items for a document, selecting each item in turn. For each selected item, the process identifies the index, denoted by “parent_index”, of its parent document item in the structural hierarchy of that document. For example, the parent index of a normal text paragraph should be the index of the nearest preceding heading (i.e., a document item with a label “section-level-x” for some number x), and the parent index of an item with label “section-level-x”, where x>1, should be the nearest preceding higher heading, i.e., a document item with label “section-level-y” and y<x.
Considering first the steps in column A of
In response to decision “No” at step 52, operation proceeds to column B of
In response to decision “No” at step 55, operation proceeds to column C. In step 58 here, the DSG generator checks whether the hierarchy level of the current item is lower than that of the previous item's parent (e.g., when proceeding from a paragraph in a level-2 section to a level-3 heading). If so, the current and previous items have the same parent item. The current item's parent index is set accordingly in step 59, variables are updated in step 60, and operation returns to re-entry point R.
In response to decision “No” at step 58, operation proceeds to
In response to decision “No” at step 62, the DSG generator loops through steps 65 through 67 back to step 62, in each loop comparing the hierarchy level of the current item with that of a progressively earlier ancestor (parent of a parent) of the previous item. At decision step 66 of any loop here, if the hierarchy level of the current item is less than that of the current ancestor, then that ancestor is the current item's parent. The parent index is set accordingly in step 68, parameters are updated in step 69, and operation reverts to
The structure-linker process defined above thus identifies a parent document item for each document item, sequentially in order of the document item succession, based on relative location in the hierarchy HDI of the item-type of that item and the item-type of items earlier in the succession. The DSG for a document is fully defined by the parent indexes assigned to document items by this structure-linker process. It can be seen that all the parent indexes are identified by this process without going back linearly through the document. This provides a highly efficient DSG generation process, with complexity that only goes though document items once, in the original linear order, with a constant maximum amount of processing per item.
The extraction of entities from document items in step 42 of
Known NLP relation techniques may then be applied to identify relations between items. Examples here include: proximity analysis; regular expressions; grammar analysis; LSTM networks; CRFs, CNNs, and RNNs; classification systems based on transformer networks such as BERT (see, e.g., “Simple BERT Models for Relation Extraction and Semantic Role Labeling”, Peng Shi et al., arXiv:1904.05255v1 (2019)); transformer networks with additional head layers for relations between any pair of entities (see, e.g., “BERT-Based Multi-Head Selection for Joint Entity-Relation Extraction”, Weipeng Huang et al., arXiv:1908.05908v2 (2019) and “Joint Learning with Pre-trained Transformer on Named Entity Recognition and Relation Extraction Tasks for Clinical Analytics”, Miao Chen et al., ClinicalNLP@EMNLP 2020, pp. 234-242); and various other NER systems which can identify and label relations between language items in text.
In some embodiments, relations between language entities may be derived by analysis of individual document items, without considering overall document structure, as in the Corpus Processing Service (CPS) system referenced above. In step 43 of
To create edges for parent-child relations in the knowledge graph, the KG generator 26 uses the DSGs to insert an edge between each document item node and the node for its parent document item, as indicated by the parent_index derived by the structure-linker in this embodiment. The structure-linker code can be embedded as a task-type for dataflows here, and an additional “link-properties” task can be provided to create the parent-child edges in the KG.
The simple example above demonstrates how incorporation of document structure via parent-child edges can significantly increase the amount of information extracted from a document corpus and hence overall information encoded in the KG. Since KGDB 35 searches the KG by traversing edges of the graph, inclusion of parent-child edges allows this additional information to be readily extracted in search operations. The system thus extracts information implicit in a document structure which a human would naturally assimilate when reading the document, and encodes this in the KG. As a result, finding structural context of sentence- or paragraph-level search results is directly possible in the KG. The structural information also allows co-reference resolution. For example, “Permian Basin” may be mentioned in a header, but only referred to as “the basin” in the underlying section text. Embodiments of the invention thus offer more efficient search operations, more accurate and comprehensive search results, and improved operation of the technical applications exploiting these search results.
Additional structure-based edges can be included in the KGs generated by preferred embodiments. For example, in step 43 of
The I/F manager 27 of preferred embodiments provides a mechanism for selecting traversal of edges representing parent-child relations (and ancestral/neighbor relations where provided) between items in search operations for input search queries.
Various other mechanisms can of course be envisaged for selecting traversal of structure edges in user-constructed search queries. As a further example, draggable icons may be provided for different types of nodes, and traversal of different types of structure edges, in workflows constructed by the user in a workflow construction pane of the GUI.
For more complex search tasks, the I/F manager of preferred embodiments provides predefined search templates (search workflows), each defining a particular type of search query involving traversal of a structure edge, in GUI 30. These structure-traversing workflows can be constructed from basic component operations such as search, edge traversal, filter, intersection, and union.
The
Where structure-aware NLP models are employed in KG generator 26, these can be applied to derive additional relations between entities in structurally-related document items. The KG generator then includes additional edges explicitly encoding these relations in the KG. For example, edges may be added for the new relations indicated by dotted lines in
In a first implementation, a structure-aware NLP task for “animal-property-value”, applied to the level-2 paragraph in
In a second implementation, a structure-aware NLP task may take the complete structural sequence “3 Heron∥3.5 Great Egret∥Large and slim wading bird, yellow-bill, black feet . . . ” (where “∥” denotes a separation indicator) like a single paragraph that is passed to basic NLP. What happens then depends on the type of basic NLP. If there are three different base NLP models for animals, properties, and values, the overall task will get the animal classes “heron”, “bird”, “wading bird”, the properties “bill color” and “foot color”, and the values “yellow” and “black” (and possibly the vaguer properties “large” and “slim”). The overall task may then piece these elements together (e.g. using proximity/grammatical criterion as for basic relation models) to the same triples as the first implementation above. A more powerful basic NLP model can be trained to directly find relations. If this was trained (or at least pretrained) on normal sentences (i.e., without pre-pended headings), the overall task may transform the headers to be closer to normal sentences, e.g., it may strip off the header numbers and input the following to the NLP model: “Heron, great egret, large and slim . . . ”. Alternatively, or in addition, NLP finetuning including the header structures can be performed.
It will be seen that the embodiments described offer significant improvements in information extraction systems. However, numerous changes and modifications can be made to the exemplary embodiments described. For example, I/F manager 27 may provide various other features in GUI 30, such as views representing topology of all, or selected parts, of a KG to show the structure-derived edges. Relation edges in the KG may be weighted in various ways, e.g., language-entity nodes may be weighted according to confidence values output by an NER system. Item-label hierarchies HDI can be defined in any convenient manner to indicate relative hierarchical positions of the item labels, and various other processes can be envisaged for generating the DSGs. Also, while the
Steps of flow diagrams may be implemented in a different order to that shown and some steps may be performed in parallel where appropriate. In general, where features are described herein with reference to a method embodying the invention, corresponding features may be provided in a system/computer program product embodying the invention, and vice versa.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.