UniProt: A Protein Sequence and Function Resource for Biomedical Science

Information

Research Project
10267787

ApplicationId
10267787
Core Project Number
U24HG007822
Full Project Number
2U24HG007822-08
Serial Number
007822
FOA Number
PAR-20-097
Sub Project Id

Project Start Date
9/18/2014 - 11 years ago
Project End Date
5/31/2026 - 7 months from now
Program Officer Name
PILLAI, AJAY
Budget Start Date
9/17/2021 - 4 years ago
Budget End Date
5/31/2022 - 3 years ago
Fiscal Year
2021
Support Year
08
Suffix
Award Notice Date
9/17/2021 - 4 years ago

Organizations

European Molecular Biology Laboratory

Information

UniProt: A Protein Sequence and Function Resource for Biomedical Science

PROJECT SUMMARY/ABSTRACT This project continues the development of the UniProt Knowledgebase, which aims to provide the scientific community with a comprehensive, high-quality, and freely accessible resource of protein sequences and functional information. Proteins are an essential bridge between human genetics, the environment and phenotype. While human genetics has increasing power to find correlations between genotype and phenotype, knowledge of how proteins function, provided by UniProt, is essential for the mechanistic understanding critical to develop health outcomes through improved and personalized diagnostics, prognostics, and treatments. Biomedical research is being revolutionized by methods from the field of Artificial Intelligence, particularly Machine Learning (ML) approaches such as Deep Learning (DL). These approaches now outstrip the ability of humans in many fields and are state-of-the-art when sufficient data is available. UniProt provides gold standard training data for hundreds of ML applications in biomedical research. The work in this proposal will enhance the readiness of UniProt for use in ML and will integrate ML methods to enhance our efficiency. UniProt curators extract and synthesize experimental knowledge of proteins from papers in human and machine- readable forms using a range of standard ontologies. This proposal will further structure protein knowledge in UniProt, developing complete, machine-readable catalogs of the functional impact of human variation and of human protein networks and complexes, essential to understanding human disease. Efficiency of curation will be improved using DL models, developed in collaboration with text mining experts, to automate the identification of relevant papers and accelerate extraction of knowledge. This extracted knowledge will be validated by our expert curators and also the wider research community who will be actively engaged to further scale curation. ML approaches will also be used to infer annotations for proteins with no experimental characterization, using community challenges to develop faster, more accurate, scalable approaches to annotate the deluge of uncharacterized proteins. UniProt is an exemplar FAIR resource and has served the scientific community with metronomic data releases despite an exponential growth in data volumes. Streamlined production processes will scale efficiently and sustainably with both the growing data volume and complexity. We will explore novel technologies to ensure the continued timely release of data to the community according to the FAIR principles. UniProt is an international hub of protein data that serves hundreds of thousands of users annually. We will continue using user-centric approaches to develop the UniProt website in response to user needs and new data types. We will engage with our stakeholders and collaborators by introducing an annual strategic partnership meeting. We will engage our communities through webinars, social media, hackathons and attendance at scientific meetings to broaden the efficient and impactful use of our data.

IC Name

NATIONAL HUMAN GENOME RESEARCH INSTITUTE

Activity
U24
Administering IC
HG
Application Type
2

Direct Cost Amount
2955344
Indirect Cost Amount
94656
Total Cost
3050000
Sub Project Total Cost

ARRA Funded
False
CFDA Code
172
Ed Inst. Type
Funding ICs
NCI:500000\NHGRI:1000000\NHLBI:200000\NIDDK:200000\NIGMS:900000\OD:250000\
Funding Mechanism
OTHER RESEARCH-RELATED
Study Section
ZRG1
Study Section Name
Special Emphasis Panel

Organization Name
EUROPEAN MOLECULAR BIOLOGY LABORATORY
Organization Department
Organization DUNS
321691735
Organization City
HEIDELBERG
Organization State
Organization Country
GERMANY
Organization Zip Code
69117
Organization District
GERMANY

UniProt: A Protein Sequence and Function Resource for Biomedical Science

Information

ApplicationId

Core Project Number

Full Project Number

Serial Number

FOA Number

Sub Project Id

Project Start Date

Project End Date

Program Officer Name

Budget Start Date

Budget End Date

Fiscal Year

Support Year

Suffix

Award Notice Date

Organizations

UniProt: A Protein Sequence and Function Resource for Biomedical Science

IC Name

Activity

Administering IC

Application Type

Direct Cost Amount

Indirect Cost Amount

Total Cost

Sub Project Total Cost

ARRA Funded

CFDA Code

Ed Inst. Type

Funding ICs

Funding Mechanism

Study Section

Study Section Name

Organization Name

Organization Department

Organization DUNS

Organization City

Organization State

Organization Country

Organization Zip Code

Organization District