Collaborative Research: IIS: III: MEDIUM: Learning Protein-ish: Foundational Insight on Protein Language Models for Better Understanding, Democratized Access, and Discovery

Information

NSF Award
2310114

Owner

Emory University

Award Id
2310114
Award Effective Date
8/1/2023 - 10 months ago
Award Expiration Date
7/31/2026 - 2 years from now
Award Amount
$ 599,871.00
Award Instrument
Standard Grant

Information

Collaborative Research: IIS: III: MEDIUM: Learning Protein-ish: Foundational Insight on Protein Language Models for Better Understanding, Democratized Access, and Discovery

Large language models are massive neural networks that learn rich contextual representations of words and use such representations to address a variety of tasks in natural language processing (NLP). These models are a prominent example of generative artificial intelligence and are emerging as promising approaches for distilling and organizing the content of massive biological databases and for predicting a wide range of molecular bio-properties. Yet, we know surprisingly little about what these models capture in their learned representations, why they perform well on some tasks and not on others, and how they can produce deep insight into the relationships describing the biological space. If progress in NLP is any indication, the current trend of improving the performance of language models by drastically increasing the number of their trainable parameters is unsustainable both for our carbon footprint and for ensuring equity/accessibility of research and scholarship in the academic setting. <br/><br/>This project advances algorithmic research at the intersection of information integration and informatics using principled protein language models (PLMs) as computational vehicles for deeper insight into the structural, functional, and evolutionary organization across protein space at varying levels of detail and scale. It also aims to do so in a way that is resource-aware, sustainable, and accessible to all researchers. The research activities are organized in three thrusts: (1) encoding prior biological knowledge in PLMs for joint and resource-aware learning in composite spaces, (2) revealing fundamental properties and organizing the learned representation space to inform and connect what is captured with properties of interest, and (3) enabling PLMs to capture diverse contexts for deeper exploration of the structural, functional, and evolutionary organization across protein space. This interdisciplinary approach contributes to the fields of machine learning, bioinformatics, and molecular biology and provides opportunities at the interface of these disciplines for training under-represented students of all levels. The investigators are determined to bridge communities and disciplines, and they have planned activities to build and galvanize a trans-disciplinary community to further advance their research.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Program Officer
Sorin Draghicisdraghic@nsf.gov7032922232
Min Amd Letter Date
7/18/2023 - 10 months ago
Max Amd Letter Date
7/18/2023 - 10 months ago
ARRA Amount

Institutions

Name
Emory University
City
ATLANTA
State
GA
Country
United States
Address
201 DOWMAN DR
Postal Code
303221061
Phone Number
4047272503

Investigators

First Name
Yana
Last Name
Bromberg
Email Address
yana.bromberg@emory.edu
Start Date
7/18/2023 12:00:00 AM

Program Element

Text
Info Integration & Informatics
Code
7364

Program Reference

Text
INFO INTEGRATION & INFORMATICS
Code
7364

Text
MEDIUM PROJECT
Code
7924

Collaborative Research: IIS: III: MEDIUM: Learning Protein-ish: Foundational Insight on Protein Language Models for Better Understanding, Democratized Access, and Discovery

Information

Owner

Award Id

Award Effective Date

Award Expiration Date

Award Amount

Award Instrument

Collaborative Research: IIS: III: MEDIUM: Learning Protein-ish: Foundational Insight on Protein Language Models for Better Understanding, Democratized Access, and Discovery

Program Officer

Min Amd Letter Date

Max Amd Letter Date

ARRA Amount

Institutions

Name

City

State

Country

Address

Postal Code

Phone Number

Investigators

First Name

Last Name

Email Address

Start Date

Program Element

Text

Code

Program Reference

Text

Code

Text

Code