Frameworks: arXiv as an accessible large-scale open research platform

Information

  • NSF Award
  • 2311521
Owner
  • Award Id
    2311521
  • Award Effective Date
    1/1/2024 - 5 months ago
  • Award Expiration Date
    12/31/2028 - 4 years from now
  • Award Amount
    $ 4,966,530.00
  • Award Instrument
    Standard Grant

Frameworks: arXiv as an accessible large-scale open research platform

arXiv is an open-access repository that has played a leading role in disciplines such as computer science, mathematics and physics for over 30 years. It hosts more than 2 million scientific papers and has a large user community. Each month there are approximately 5 million active users and 100 million web accesses. Despite its size and usage, arXiv has very limited search and recommendation functionality. In order to better serve the arXiv community, this project is building a new generation of search and recommendation functionality and simultaneously creating a research sandbox to reduce reliance on third-party, commercial services. To make arXiv's trove of scientific content accessible to the visually impaired, support is being added for well-structured HTML as well as PDF. Improved discovery of research results provides broad multidisciplinary benefits across areas of science. These include less researcher time wasted browsing through large amounts of irrelevant papers, revelation of "unknown unknowns," and accelerating research across different subject areas through unexpected synergies. Improved recommendation tools, which can provide unbiased and diverse sources of relevant research results and techniques, are urgently needed to break silos. arXiv will provide improved mechanisms for scientists to find out about important advances, both in their own field of expertise and in adjacent fields.<br/><br/>This project includes 4 major focus areas: Open A/B Testing, Neural Representations of Scientific Text, arXiv Dynamics, and Security & Privacy. (1) Open A/B Testing enables arXiv to become a platform for A/B testing of search and recommendation algorithms. In addition to online A/B testing, offline A/B testing is provided using historical data along with counterfactual estimators for policy rewards. (2) Neural Representation of Scientific Text provides a vector-based representation of scientific texts (documents, paragraphs, and sentences) appropriate for multiple tasks, including citation, author, title, and keyword prediction. Differentiable search indices are investigated due to their potential to provide additional search performance improvements without requiring incremental re-training. Finally, this supports the construction of a scientific question-answering system which can also be used as a context-sensitive "chat-bot" enabling researchers to converse with and get a list of recent publications relevant to their interests. (3) The arXiv Dynamics project investigates how scientific fields grow, shrink, and transform over time. Creating a "trending and emerging arXiv topics" pattern recognition system predicts how interesting current and historical articles are to researchers. Research is investigating methods to remove the "rich-get-richer" effect from this model, to correct the model for the effects of the users' historical interactions with the system, and to track performance and solicit user feedback as these models change over time. (4) Under Security & Privacy arXiv's privacy policy is updated so that users are aware of how their (meta-)data may be used and the protections that will be deployed to protect their privacy. A "Layer 1" API allows researchers to make coarse-grained queries on anonymized arXiv weblogs and a "Layer 2" API which allows researchers to securely experiment on arXiv metadata and weblogs. Privacy is preserved by a combination of query restrictions and researcher usage agreements. A machine-learning API layer is being developed which supports differential privacy, and allows researchers to investigate the utility of these tools for novel ML-based applications, such as free-form question answering about scientific texts, neural recommender systems, etc.<br/><br/>This award by the Office of Advanced Cyberinfrastructure is jointly supported by the Division of Information and Intelligent Systems in the Directorate for Computer and Information Science and Engineering and the Division of Physics within the Directorate for Mathematical and Physical Sciences.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Varun Chandolavchandol@nsf.gov7032922656
  • Min Amd Letter Date
    9/12/2023 - 8 months ago
  • Max Amd Letter Date
    9/12/2023 - 8 months ago
  • ARRA Amount

Institutions

  • Name
    Cornell University
  • City
    ITHACA
  • State
    NY
  • Country
    United States
  • Address
    341 PINE TREE RD
  • Postal Code
    148502820
  • Phone Number
    6072555014

Investigators

  • First Name
    Yoav
  • Last Name
    Artzi
  • Email Address
    yoav@cs.cornell.edu
  • Start Date
    9/12/2023 12:00:00 AM
  • First Name
    Sarah
  • Last Name
    Dean
  • Email Address
    sdean@cornell.edu
  • Start Date
    9/12/2023 12:00:00 AM
  • First Name
    Ramin
  • Last Name
    Zabih
  • Email Address
    rdz@cs.cornell.edu
  • Start Date
    9/12/2023 12:00:00 AM
  • First Name
    Vitaly
  • Last Name
    Shmatikov
  • Email Address
    shmat@cs.cornell.edu
  • Start Date
    9/12/2023 12:00:00 AM
  • First Name
    Thorsten
  • Last Name
    Joachims
  • Email Address
    tj@cs.cornell.edu
  • Start Date
    9/12/2023 12:00:00 AM

Program Element

  • Text
    Info Integration & Informatics
  • Code
    7364
  • Text
    PHYSICS AT THE INFO FRONTIER
  • Code
    7553
  • Text
    Software Institutes
  • Code
    8004

Program Reference

  • Text
    INTERDISCIPLINARY PROPOSALS
  • Code
    4444
  • Text
    Software Institutes
  • Code
    8004