Implementing interface for rapid ground truth binning

Information

  • Patent Grant
  • 11144337
  • Patent Number
    11,144,337
  • Date Filed
    Tuesday, November 6, 2018
    5 years ago
  • Date Issued
    Tuesday, October 12, 2021
    2 years ago
  • CPC
  • Field of Search
    • CPC
    • G06N20/00
    • G06N7/005
    • G06N5/02
    • G06F17/27
    • G06F17/2775
    • G06F16/287
    • G06F16/288
    • G06F16/313
    • G06F16/3334
    • G06F16/9038
    • G06F17/274
    • G06F17/277
    • G06F16/245
    • G06F16/24564
    • G06F16/3322
    • G06F16/3323
    • G06F17/272
    • G06F17/2765
    • G06F17/2785
    • G06F16/93
    • G06F17/278
    • G06F16/3346
    • G06F16/951
    • G06F17/2715
    • G06F17/28
    • G06F16/248
    • G06F17/18
    • G06F9/453
    • G06F40/295
    • G06F3/018
    • G06F40/169
    • G06F16/26
    • G06F16/355
    • G06F3/017
    • G06F3/04842
    • B65H2511/20
  • International Classifications
    • G06F3/048
    • G06F9/451
    • G06F3/01
    • G06N5/02
    • G06F16/93
    • G06F40/295
    • G06F40/169
    • Term Extension
      160
Abstract
A method, system and computer program product are provided for implementing an interface for rapid ground truth binning. A set of documents are received wherein each document has at least one entity in a set of entities. A user interface is provided for each received document allowing a user to view passages and select options related to confirming or denying an equivalence between the entity in the received document and an output document entity bin including the entity. Responsive to the user utilizing the user interface and confirming the equivalence, combining the received document with the output document entity bin with reference to the entity.
Description
FIELD OF THE INVENTION

The present invention relates generally to the data processing field, and more particularly, relates to a method, system and computer program product for implementing an interface for rapid ground truth binning.


DESCRIPTION OF THE RELATED ART

Identifying coreference entity bins is tedious and time consuming for several reasons: (1) there are potentially a large number of entity bins to review (2) key information that serve as a good indicator of coreference are not easy to spot when manually reading through a bin's document collection, and (3) keeping track of all the input entity bins that are potentially coreference can be overwhelming for a human annotator.


A need exists for a mechanism for rapidly developing ground-truth sets needed to train a statistical cross-document coreference model operating over entity bin objects, while reducing complexity of the overall task.


SUMMARY OF THE INVENTION

Principal aspects of the present invention are to provide a method, system and computer program product for implementing an interface for rapid ground truth binning. Other important aspects of the present invention are to provide such method, system and computer program product substantially without negative effects and that overcome many of the disadvantages of prior art arrangements.


In brief, a method, system and computer program product are provided for implementing an interface for rapid ground truth binning. A set of documents are received wherein each document has at least one entity in a set of entities. A user interface is provided for each received document allowing a user to view passages and select options related to confirming or denying an equivalence between the entity in the received document and an output document entity bin including the entity. Responsive to the user utilizing the user interface and confirming the equivalence, combining the received document with the output document entity bin with reference to the entity.


In accordance with features of the invention, the user interface enables rapidly developing ground-truth sets needed to train a statistical cross-document coreference model operating over entity bin objects, while reducing complexity of the overall task.


In accordance with features of the invention, the user interface displays an entity bin name of the received document and the titles of the documents in a document collection of output document entity bins.


In accordance with features of the invention, the user interface aggregates all related entities found in the document collection of the output document entity bin and displays the related entities found in the document collection above the document text.


In accordance with features of the invention, the user interface highlights the spans in the text passages from both the received document and the document collection of the output document entity bin from which the relationship was identified.


In accordance with features of the invention, the user interface provides visual guides to help the user identify key information and, therefore, speed up the decision-making process. The user interface separately organizes and displays an entity bin of the input document and the output document entity bins.


In accordance with features of the invention, the user interface provides a search capability allowing the user to filter the output document entity bins for certain terms.





BRIEF DESCRIPTION OF THE DRAWINGS

The present invention together with the above and other objects and advantages may best be understood from the following detailed description of the preferred embodiments of the invention illustrated in the drawings, wherein:



FIG. 1 provides a block diagram of an example computer system for implementing an interface for rapid ground truth binning in accordance with preferred embodiments;



FIGS. 2, and 3 are respective flow chart illustrating example system operations to implement an interface for rapid ground truth binning in the example computer system of FIG. 1 in accordance with preferred embodiments; and



FIG. 4 is a block diagram illustrating a computer program product in accordance with the preferred embodiment.





DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the following detailed description of embodiments of the invention, reference is made to the accompanying drawings, which illustrate example embodiments by which the invention may be practiced. It is to be understood that other embodiments may be utilized, and structural changes may be made without departing from the scope of the invention.


The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.


In accordance with features of the invention, a method and system are provided for implementing an interface for rapid ground truth binning. Ground truth binning refers to information provided by direct observation rather than information provided by inference. Human curators are given a set of naive entity bins, or bins with a single document in their collection, produced from a query. The task is to merge together entity bins referring to the same real-world entity. The invention provides a user interface for rapidly developing the ground-truth sets needed to train a statistical cross-document coreference model operating over entity bin objects.


Having reference now to the drawings, in FIG. 1, there is shown an example system embodying the present invention generally designated by the reference character 100 for implementing an interface for rapid ground truth binning in accordance with preferred embodiments. System 100 includes a computer system 102 including one or more processors 104 or general-purpose programmable central processing units (CPUs) 104. As shown, computer system 102 includes a single CPU 104; however, system 102 can include multiple processors 104 typical of a relatively large system.


Computer system 102 includes a system memory 106 including an operating system 108, a ground truth binning control logic 110 and a cross-document co-reference algorithm 111 in accordance with preferred embodiments. System memory 106 is a random-access semiconductor memory for storing data, including programs. System memory 106 is comprised of, for example, a dynamic random-access memory (DRAM), a synchronous direct random-access memory (SDRAM), a current double data rate (DDRx) SDRAM, non-volatile memory, optical storage, and other storage devices.


Computer system 102 includes a storage 112 including a statistical cross-document co-reference model 114 in accordance with preferred embodiments and a network interface 116. Computer system 102 includes an I/O interface 118 for transferring data to and from computer system components including CPU 104, memory 106 including the operating system 108, ground truth binning control logic 110, cross-document co-reference algorithm 111, storage 112 including statistical cross-document co-reference model 114, and network interface 116, and a network 120 and a client system, and user interface 122.


In accordance with features of the invention, the ground truth binning control logic 110 enables manually creating ground truth sets to train the cross-document co-reference or disambiguation algorithm 111. The ground truth binning control logic 110 presents a user interface with information in a way that significantly reduces the effort and time to create the ground truth sets, allowing for rapid adaptation of the algorithm to new domains. The input to the ground truth binning control logic 110 is a set of entity bins E (E1, E2, . . . , En). The set of entity bins E (E1, E2, . . . , En) are pairs consisting of an entity name and a collection of documents that contain a reference to that entity. The task of the human annotator is to identify which of the input bins are references to the same real-world entity and merge such co-reference bins together yielding a single entity bin per real-world entity. The ground truth binning control logic 110 supports that task by presenting user interface information in a way that allows the user to rapidly identify which entity bin a specific document belongs. The ground truth binning control logic 110 implements a process by which the human annotator can quickly iterate over all input entity bins. This reduces the overall complexity of the task to making simple one-to-one comparison with the generated user interface.


Referring to FIGS. 2 and 3, there are shown respective example system operations generally designated by the reference characters 200, and 300 of computer system 102 of FIG. 1, for implementing an interface for rapid ground truth binning in accordance with preferred embodiments.


Referring to FIG. 2, system operations 200 for binning documents starting at a block 201 with loading documents as indicated at a block 202, such as receiving a set of entities E (E1, E2, . . . , En) and a set of documents D (D1, D2, . . . , Dm) wherein the each document Di has at least one entity Ei in the set of entities E (E1, E2, . . . , En). As indicated at a block 204, a first unbinned document Di is selected. As indicated at a block 206, providing a user interface (UI) allowing a user to view passages and select options related to confirming or denying an equivalence between the entity Ej in the document Di and the document Dk. As indicated at a decision block 208, checking a user entry using the user interface and confirming or denying equivalence is performed. When a document match is not identified, a new bin is created with the current document as indicated at a block 210. Otherwise responsive to the user utilizing the UI and confirming the equivalence, the input bin of the document Di with the document Dk with reference to the entity Ej are combined or merged in existing output bin and aggregating related entities contained in this output bin is performed as indicated at a block 212. As indicated at a decision block 214, checking for more documents Di to bin is performed. When no more documents Di to bin are identified, then the operations end as indicated at a block 212 where the process is completed when all input entity bins have been assigned to an output bin. Otherwise when more documents Di to bin are identified, then the operations return to block 204 to continue binning remaining documents.


Referring to FIG. 3, system operations 300 for the entity match task starting at a block 301 with searching output bins for certain terms as indicated at a block 302. At block 302, content of the retrieval step is presented in the UI with visual guides for the user to identify key information and speed up the decision-making process including 1. Display the bin's entity name and the titles of the documents in the bin's document collection; 2. Aggregate all related entities found in the document collection and display the related entities above the document text; and 3. Highlight the spans in the text passages from which the relationship was identified.


As indicated at a block 304, iterate through bins: the user can discard an output bin as candidate matches for the current documents allowing the user to quickly parse through the list of output bins. As indicated at a block 306, highlight all entities in all output bins that match related entities of the current document that are highlighted. The output bins are sorted based on how many entities match the current document. At block 306, the user interface also highlights how well the name of the source bin matches the name of the target bin, for example, “Mike Jordan” vs “Michael Jordan.” The user can quickly identify whether there is a good match in the set of output entity bins.


Referring now to FIG. 4, an article of manufacture or a computer program product 400 of the invention is illustrated. The computer program product 400 is tangibly embodied on a non-transitory computer readable storage medium that includes a recording medium 402, such as, a floppy disk, a high capacity read only memory in the form of an optically read compact disk or CD-ROM, a tape, or another similar computer program product. The computer readable storage medium 402, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire. Recording medium 402 stores program means or instructions 404, 406, 408, and 410 on the non-transitory computer readable storage medium 402 for carrying out the methods for implementing an interface for rapid ground truth binning in the system 100 of FIG. 1.


Computer readable program instructions 404, 406, 408, and 410 described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The computer program product 400 may include cloud-based software residing as a cloud application, commonly referred to by the acronym (SaaS) Software as a Service. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions 404, 406, 408, and 410 from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.


A sequence of program instructions or a logical assembly of one or more interrelated modules defined by the recorded program means 404, 406, 408, and 410, direct the system 100 for implementing an interface for rapid ground truth binning of the preferred embodiment.


While the present invention has been described with reference to the details of the embodiments of the invention shown in the drawing, these details are not intended to limit the scope of the invention as claimed in the appended claims.

Claims
  • 1. A system for implementing an interface for rapid ground truth binning comprising: a ground truth binning control logic;said ground truth binning control logic tangibly embodied in a non-transitory machine readable medium used to implement rapid ground truth binning;said ground truth binning control logic, receiving a first set of documents wherein each document in the first set has at least one reference to a first entity;said ground truth binning control logic, receiving an unbinned document wherein the unbinned document has at least one reference to a second entity;said ground truth binning control logic, providing a user interface for each received document allowing a user to view passages and select options related to confirming or denying an equivalence between the second entity in the unbinned document and the first entity, wherein providing the user interface comprises: displaying a name of the first entity;displaying a title of each document in the first set of documents;displaying a set of references to the first entity, wherein each reference from the set of references is found in a document in the first set of documents;displaying a similarity between the first entity and the second entity; andsaid ground truth binning control logic, responsive to the user utilizing the user interface and confirming the equivalence, combining the unbinned document with the first set of documents.
  • 2. The system as recited in claim 1, includes said ground truth binning control logic, responsive to combining the unbinned document with the first set of documents includes aggregating related entities contained in the first set of documents.
  • 3. The system as recited in claim 1, wherein said ground truth binning control logic, providing a user interface enables rapidly developing ground-truth sets needed to train a statistical cross-document coreference model operating over entity bin objects, while reducing complexity of the overall task.
  • 4. The system as recited in claim 1, wherein said ground truth binning control logic, providing a user interface includes highlighting spans in text passages from the unbinned document and a document collection of the first set of documents from which a relationship was identified.
  • 5. The system as recited in claim 1, wherein said ground truth binning control logic, providing a user interface includes providing visual guides to help a user identify key information and, therefore, speed up the decision-making process.
  • 6. The system as recited in claim 5, wherein said key information enables the user to speed up a decision-making process.
  • 7. The system as recited in claim 1, wherein said ground truth binning control logic, providing a user interface includes separately organizing and displaying an entity bin of an input document and the first set of documents.
  • 8. The system as recited in claim 1, wherein said ground truth binning control logic, providing a user interface includes providing a search capability allowing the user to filter the first set of documents for certain terms.
  • 9. A method for implementing an interface for rapid ground truth binning comprising: providing a ground truth binning control logic;said ground truth binning control logic tangibly embodied in a non-transitory machine readable medium used to implement rapid ground truth binning comprising;receiving a first set of documents wherein each document in the first set has at least one reference to a first entity;receiving an unbinned document wherein the unbinned document has at least one reference to a second entity;providing a user interface allowing a user to view passages and select options related to confirming or denying an equivalence between the second entity in the unbinned document and the first entity, wherein providing the user interface comprises: displaying a name of the first entity;displaying a title of each document in the first set of documents;displaying a set of references to the first entity, wherein each reference from the set of references is found in a document in the first set of documents;displaying a similarity between the first entity and the second entity; andresponsive to the user utilizing the user interface and denying the equivalence, creating a new output document entity bin for the unbinned document.
  • 10. The method as recited in claim 9, includes providing visual guides enabling a user to identify key information.
  • 11. The method as recited in claim 9, includes providing a search capability allowing a user to filter output document entity bins for certain terms.
  • 12. The system as recited in claim 1, includes: said ground truth binning control logic, receiving a second set of documents, wherein each document in the second set has at least one reference to a third entity;said ground truth binning control logic, determining that the second set of documents contains fewer references to the third entity than the first set of documents contains references to the first entity, and wherein providing the user interface further comprises: displaying a name of the third entity;displaying a title of each document in the second set of documents;displaying a set of references to the third entity, wherein each reference from the set of references is found in a document in the second set of documents;displaying a similarity between the third entity and the second entity; andbased on the determining, sorting the first set of documents and the second set of documents.
  • 13. A computer program product, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to: provide a ground truth binning control logic;receive a first set of documents wherein each document in the first set has at least one reference to a first entity;receive an unbinned document wherein the unbinned document has at least one reference to a second entity;provide a user interface allowing a user to view passages and select options related to confirming or denying an equivalence between the second entity in the unbinned document and the first entity, wherein providing the user interface comprises: displaying a name of the first entity;displaying a title of each document in the first set of documents;displaying a set of references to the first entity, wherein each reference from the set of references is found in a document in the first set of documents;displaying a similarity between the first entity and the second entity; andresponsive to the user utilizing the user interface and confirming the equivalence, combining the unbinned document with the first set of documents.
CONTRACTUAL ORIGIN OF THE INVENTION

The United States Government has rights in this invention made in the performance of work under a U.S. Government Contract between the United States of America and IBM Division holding the contract GBS Government Agency issuing the Prime Contract: Defense agencies.

US Referenced Citations (81)
Number Name Date Kind
6438543 Kazi et al. Aug 2002 B1
6658626 Aiken Dec 2003 B1
7657544 Suzuki Feb 2010 B2
7783976 Endo Aug 2010 B2
8010538 Pedersen Aug 2011 B2
8121989 Gengelbach Feb 2012 B1
8296247 Zhang Oct 2012 B2
8306940 Lee Nov 2012 B2
8594996 Liang Nov 2013 B2
8726144 Chidlovskii et al. May 2014 B2
8856109 Karidi Oct 2014 B2
8892596 Semturs Nov 2014 B1
9104960 Bottou Aug 2015 B2
9208179 Song Dec 2015 B1
9235624 Zhou Jan 2016 B2
9535894 Carrier Jan 2017 B2
9569732 Gosink Feb 2017 B2
9632994 Naim et al. Apr 2017 B2
9697475 Subramanya Jul 2017 B1
9852337 van Rotterdam Dec 2017 B1
9984050 O'Keeffe et al. May 2018 B2
10209867 Becker Feb 2019 B1
10304191 Mousavian May 2019 B1
10496754 Ferrucci Dec 2019 B1
10613785 Beskales Apr 2020 B1
10831725 Shimanovsky Nov 2020 B2
20020129342 Kil Sep 2002 A1
20040133560 Simske Jul 2004 A1
20050226261 Varadarajan Oct 2005 A1
20060010128 Suzuki Jan 2006 A1
20060048076 Vronay Mar 2006 A1
20070067285 Blume Mar 2007 A1
20070214133 Liberty Sep 2007 A1
20090083200 Pollara Mar 2009 A1
20090150388 Roseman Jun 2009 A1
20090281789 Waibel Nov 2009 A1
20090319560 Cheng Dec 2009 A1
20100250474 Richards Sep 2010 A1
20100280985 Duchon Nov 2010 A1
20110106807 Srihari May 2011 A1
20110213742 Lemmond Sep 2011 A1
20110320459 Chisholm Dec 2011 A1
20120051657 Lamanna Mar 2012 A1
20120253793 Ghannam Oct 2012 A1
20130007578 Shreck et al. Jan 2013 A1
20130138428 Chandramouli May 2013 A1
20140079297 Tadayon Mar 2014 A1
20140278448 Sadeghi Sep 2014 A1
20140279729 Delaney Sep 2014 A1
20140279756 Whitman Sep 2014 A1
20140324879 Trease Oct 2014 A1
20150044659 Basu Feb 2015 A1
20150066938 Ravid Mar 2015 A1
20150324454 Roberts Nov 2015 A1
20160099846 Allen Apr 2016 A1
20160125169 Finn May 2016 A1
20160180242 Byron Jun 2016 A1
20160267396 Gray Sep 2016 A1
20170109326 Tan Apr 2017 A1
20170140290 Block May 2017 A1
20170154015 O'Keeffe Jun 2017 A1
20170357852 Cai Dec 2017 A1
20180025008 Tan Jan 2018 A1
20180039910 Hari Haran et al. Feb 2018 A1
20180068221 Brennan Mar 2018 A1
20180068222 Brennan et al. Mar 2018 A1
20180068232 Hari Haran Mar 2018 A1
20180075368 Brennan Mar 2018 A1
20180113861 Cai Apr 2018 A1
20180121819 Manasse May 2018 A1
20180121820 Manasse May 2018 A1
20180225471 Goyal Aug 2018 A1
20180225590 Altaf Aug 2018 A1
20180300296 Ziraknejad Oct 2018 A1
20190130248 Zhong May 2019 A1
20190171871 Zhang Jun 2019 A1
20190172224 Vajda Jun 2019 A1
20190325084 Peng Oct 2019 A1
20200004873 Chang Jan 2020 A1
20200161005 Lyman May 2020 A1
20200226493 Hari Haran Jul 2020 A1
Non-Patent Literature Citations (6)
Entry
Disclosed Anonymously, IPCOM000231089D.pdf, Sep. 25, 2013, IP.com, pp. 1-3 (Year: 2013).
J. Mayfield, D. Alexander, B. J. Dorr, J. Eisner, T. Elsayed, T. Finin, C. Fink, M. Freedman, N. Garera, P. McNamee, and S. Mohammad, “Cross-Document Coreference Resolution: A Key Technology for Learning by Reading,” In AAAI Spring Symposium: Learning by Reading and Learning to Read (vol. 9, pp. 65-70). https://www.aaai.org/Library/Symposia/Spring/ss07-06.php.
Christian Morbidoni, and Alessio Piccioli, “Curating a document collection via crowdsourcing with Pundit 2.0,” In International Semantic Web Conference, pp. 102-106. Springer, Cham, 2015.
Stefanie Dipper et al., “Simple Annotation Tools for Complex Annotation Tasks: an Evaluation” Link: https://pdfs.semanticscholar.org/d341/35287eb7ca2eca326401b4b4918510badc11.pdf.
“Watson Knowledge Studio (WKS)” https://www.ibm.com/watson/services/knowledge-studio/.
Pontus Stenetorp et al, BRAT: a web-based tool for NLP-assisted text annotation Proceedings of 13th Conf. of European Chapter of Association for Computational Linguistics, pp. 102-107, Apr. 23-27, 2012.
Related Publications (1)
Number Date Country
20200142720 A1 May 2020 US