Elements: Towards a Robust Cyberinfrastructure for NLP-based Search and Discoverability over Scientific Literature

Information

  • NSF Award
  • 2104025
Owner
  • Award Id
    2104025
  • Award Effective Date
    5/1/2021 - 3 years ago
  • Award Expiration Date
    4/30/2024 - 28 days ago
  • Award Amount
    $ 399,566.00
  • Award Instrument
    Standard Grant

Elements: Towards a Robust Cyberinfrastructure for NLP-based Search and Discoverability over Scientific Literature

This project creates an open platform for accessing and mining information from scientific texts that provides access to an array of software, computing resources, and publication data. Current search technologies typically find many relevant documents, but do not extract and organize the information content of these documents or suggest new scientific hypotheses based on this organized content. Natural Language Processing (NLP) strategies are a recognized means to approach this problem, and this project develops the cyberinfrastructure to support sophisticated search and retrieval from scientific publications, use and augmentation of facilities for advanced and well-established natural language processing and machine learning tools, and extraction and aggregation of data from scientific publications. The project leverages two NSF-funded projects: the Language Applications (LAPPS) Grid, which has already proven to be an effective platform for development of NLP applications; and University of Wisconsin’s xDD (formerly, GeoDeepDive), a scalable, dependable infrastructure capable of rapidly growing a digital library of scientific publications, currently including over 13 million documents from multiple distributed commercial and open-access providers. The effort significantly enhances the value of these existing NSF-funded infrastructures by providing access to services for mining scientific publications and lowering the barriers to entry resulting from licensing, redistribution, and intellectual property issues. Scientists may perform large-scale text retrieval and mining using the University of Wisconsin’s high performance computing (HPC) infrastructure through a web-based interface. Iterative domain adaptation capabilities allow scientists to easily adapt existing services to specialized areas without configuring or installing additional components. The potential impact of the cyberinfrastructure is applicable to any community that relies on computational tools for mining large textual datasets, including researchers in sociology, psychology, economics, education, linguistics, digital media, and the humanities.<br/><br/>This project extends the LAPPS Grid to provide access to UW-xDD’s collection of scientific publications and UW’s High Performance Computing facilities, as well as means to rapidly adapt existing, well-established natural language processing and machine learning software tools to new domains and evaluate results. The LAPPs Grid provides a large collection of NLP tools from a wide variety of sources exposed as web services, together with multiple commonly used resources and a front-end document retrieval engine currently configured to access PubMed/PubMedCentral as well as nightly updates of the CORD-19 dataset. The LAPPS Grid is open source, and can be run from the web, on a user’s laptop or desktop, in the cloud, or as a self-contained docker image when it is necessary to protect sensitive or licensed data, when there is no network connection available, or for deployment on remote HPC facilities. All tools and resources can be used interoperably, eliminating the effort required to convert input and output formats to use a set of tools or resources together. xDD is one of the world’s largest single repositories of scientific publications that spans all domains of knowledge, incorporates new documents automatically and updates API endpoints every hour. xDD has accumulated millions of documents from multiple commercial and open-access publishers (over 13M publications). The xDD infrastructure is an integral part of the developing UW-COSMOS pipeline, which consists of a suite of services supporting document processing, including ingestion and parsing of PDFs; extraction of individual document objects such as text sections, figures, tables, and captions; and recall, which creates searchable Anserini and ElasticSearch indexes on the contexts and objects to enable retrieval of information. Specific project activities include implementing efficient retrieval and analysis of xDD’s vast holdings of scientific publications; extending the NLP capabilities of the LAPPS Grid for scientific publication mining and domain adaptation; developing full interoperability between the Grid and xDD/COSMOS; scaling LAPPS Grid services to handle the very large textual datasets available from UW-xDD; and surveying visualization techniques and integrating them into the Grid.<br/><br/>This award by the Office of Advanced Cyberinfrastructure is jointly supported by the NSF Division of Information and Intelligent Systems within the Directorate for Computer and Information Science and Engineering, and the NSF Public Access program.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Amy Waltonawalton@nsf.gov7032924538
  • Min Amd Letter Date
    4/12/2021 - 3 years ago
  • Max Amd Letter Date
    5/21/2021 - 3 years ago
  • ARRA Amount

Institutions

  • Name
    Brandeis University
  • City
    WALTHAM
  • State
    MA
  • Country
    United States
  • Address
    415 SOUTH ST MAILSTOP 116
  • Postal Code
    024532728
  • Phone Number
    7817362121

Investigators

  • First Name
    Nancy
  • Last Name
    Ide
  • Email Address
    ide@brandeis.edu
  • Start Date
    4/12/2021 12:00:00 AM
  • First Name
    James
  • Last Name
    Pustejovsky
  • Email Address
    pustejovsky@gmail.com
  • Start Date
    4/12/2021 12:00:00 AM

Program Element

  • Text
    NSF Public Access Initiative
  • Code
    7414
  • Text
    Data Cyberinfrastructure
  • Code
    7726
  • Text
    Software Institutes
  • Code
    8004

Program Reference

  • Text
    CSSI-1: Cyberinfr for Sustained Scientif
  • Text
    SMALL PROJECT
  • Code
    7923
  • Text
    Software Institutes
  • Code
    8004