Frameworks: arXiv as an accessible large-scale open research platform

Information

NSF Award
2311521

Owner

Cornell University

Award Id
2311521
Award Effective Date
1/1/2024 - 5 months ago
Award Expiration Date
12/31/2028 - 4 years from now
Award Amount
$ 4,966,530.00
Award Instrument
Standard Grant

Information

Frameworks: arXiv as an accessible large-scale open research platform

arXiv is an open-access repository that has played a leading role in disciplines such as computer science, mathematics and physics for over 30 years. It hosts more than 2 million scientific papers and has a large user community. Each month there are approximately 5 million active users and 100 million web accesses. Despite its size and usage, arXiv has very limited search and recommendation functionality. In order to better serve the arXiv community, this project is building a new generation of search and recommendation functionality and simultaneously creating a research sandbox to reduce reliance on third-party, commercial services. To make arXiv's trove of scientific content accessible to the visually impaired, support is being added for well-structured HTML as well as PDF. Improved discovery of research results provides broad multidisciplinary benefits across areas of science. These include less researcher time wasted browsing through large amounts of irrelevant papers, revelation of "unknown unknowns," and accelerating research across different subject areas through unexpected synergies. Improved recommendation tools, which can provide unbiased and diverse sources of relevant research results and techniques, are urgently needed to break silos. arXiv will provide improved mechanisms for scientists to find out about important advances, both in their own field of expertise and in adjacent fields. This project includes 4 major focus areas: Open A/B Testing, Neural Representations of Scientific Text, arXiv Dynamics, and Security & Privacy. (1) Open A/B Testing enables arXiv to become a platform for A/B testing of search and recommendation algorithms. In addition to online A/B testing, offline A/B testing is provided using historical data along with counterfactual estimators for policy rewards. (2) Neural Representation of Scientific Text provides a vector-based representation of scientific texts (documents, paragraphs, and sentences) appropriate for multiple tasks, including citation, author, title, and keyword prediction. Differentiable search indices are investigated due to their potential to provide additional search performance improvements without requiring incremental re-training. Finally, this supports the construction of a scientific question-answering system which can also be used as a context-sensitive "chat-bot" enabling researchers to converse with and get a list of recent publications relevant to their interests. (3) The arXiv Dynamics project investigates how scientific fields grow, shrink, and transform over time. Creating a "trending and emerging arXiv topics" pattern recognition system predicts how interesting current and historical articles are to researchers. Research is investigating methods to remove the "rich-get-richer" effect from this model, to correct the model for the effects of the users' historical interactions with the system, and to track performance and solicit user feedback as these models change over time. (4) Under Security & Privacy arXiv's privacy policy is updated so that users are aware of how their (meta-)data may be used and the protections that will be deployed to protect their privacy. A "Layer 1" API allows researchers to make coarse-grained queries on anonymized arXiv weblogs and a "Layer 2" API which allows researchers to securely experiment on arXiv metadata and weblogs. Privacy is preserved by a combination of query restrictions and researcher usage agreements. A machine-learning API layer is being developed which supports differential privacy, and allows researchers to investigate the utility of these tools for novel ML-based applications, such as free-form question answering about scientific texts, neural recommender systems, etc. This award by the Office of Advanced Cyberinfrastructure is jointly supported by the Division of Information and Intelligent Systems in the Directorate for Computer and Information Science and Engineering and the Division of Physics within the Directorate for Mathematical and Physical Sciences. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Program Officer
Varun Chandolavchandol@nsf.gov7032922656
Min Amd Letter Date
9/12/2023 - 8 months ago
Max Amd Letter Date
9/12/2023 - 8 months ago
ARRA Amount

Institutions

Name
Cornell University
City
ITHACA
State
NY
Country
United States
Address
341 PINE TREE RD
Postal Code
148502820
Phone Number
6072555014

Investigators

First Name
Yoav
Last Name
Artzi
Email Address
yoav@cs.cornell.edu
Start Date
9/12/2023 12:00:00 AM

First Name
Sarah
Last Name
Dean
Email Address
sdean@cornell.edu
Start Date
9/12/2023 12:00:00 AM

First Name
Ramin
Last Name
Zabih
Email Address
rdz@cs.cornell.edu
Start Date
9/12/2023 12:00:00 AM

First Name
Vitaly
Last Name
Shmatikov
Email Address
shmat@cs.cornell.edu
Start Date
9/12/2023 12:00:00 AM

First Name
Thorsten
Last Name
Joachims
Email Address
tj@cs.cornell.edu
Start Date
9/12/2023 12:00:00 AM

Program Element

Text
Info Integration & Informatics
Code
7364

Text
PHYSICS AT THE INFO FRONTIER
Code
7553

Text
Software Institutes
Code
8004

Program Reference

Text
INTERDISCIPLINARY PROPOSALS
Code
4444

Text
Software Institutes
Code
8004

Frameworks: arXiv as an accessible large-scale open research platform

Information

Owner

Award Id

Award Effective Date

Award Expiration Date

Award Amount

Award Instrument

Frameworks: arXiv as an accessible large-scale open research platform

Program Officer

Min Amd Letter Date

Max Amd Letter Date

ARRA Amount

Institutions

Name

City

State

Country

Address

Postal Code

Phone Number

Investigators

First Name

Last Name

Email Address

Start Date

First Name

Last Name

Email Address

Start Date

First Name

Last Name

Email Address

Start Date

First Name

Last Name

Email Address

Start Date

First Name

Last Name

Email Address

Start Date

Program Element

Text

Code

Text

Code

Text

Code

Program Reference

Text

Code

Text

Code