Collaborative Research: IIS: III: MEDIUM: Learning Protein-ish: Foundational Insight on Protein Language Models for Better Understanding, Democratized Access, and Discovery

Information

  • NSF Award
  • 2310114
Owner
  • Award Id
    2310114
  • Award Effective Date
    8/1/2023 - 10 months ago
  • Award Expiration Date
    7/31/2026 - 2 years from now
  • Award Amount
    $ 599,871.00
  • Award Instrument
    Standard Grant

Collaborative Research: IIS: III: MEDIUM: Learning Protein-ish: Foundational Insight on Protein Language Models for Better Understanding, Democratized Access, and Discovery

Large language models are massive neural networks that learn rich contextual representations of words and use such representations to address a variety of tasks in natural language processing (NLP). These models are a prominent example of generative artificial intelligence and are emerging as promising approaches for distilling and organizing the content of massive biological databases and for predicting a wide range of molecular bio-properties. Yet, we know surprisingly little about what these models capture in their learned representations, why they perform well on some tasks and not on others, and how they can produce deep insight into the relationships describing the biological space. If progress in NLP is any indication, the current trend of improving the performance of language models by drastically increasing the number of their trainable parameters is unsustainable both for our carbon footprint and for ensuring equity/accessibility of research and scholarship in the academic setting. <br/><br/>This project advances algorithmic research at the intersection of information integration and informatics using principled protein language models (PLMs) as computational vehicles for deeper insight into the structural, functional, and evolutionary organization across protein space at varying levels of detail and scale. It also aims to do so in a way that is resource-aware, sustainable, and accessible to all researchers. The research activities are organized in three thrusts: (1) encoding prior biological knowledge in PLMs for joint and resource-aware learning in composite spaces, (2) revealing fundamental properties and organizing the learned representation space to inform and connect what is captured with properties of interest, and (3) enabling PLMs to capture diverse contexts for deeper exploration of the structural, functional, and evolutionary organization across protein space. This interdisciplinary approach contributes to the fields of machine learning, bioinformatics, and molecular biology and provides opportunities at the interface of these disciplines for training under-represented students of all levels. The investigators are determined to bridge communities and disciplines, and they have planned activities to build and galvanize a trans-disciplinary community to further advance their research.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Sorin Draghicisdraghic@nsf.gov7032922232
  • Min Amd Letter Date
    7/18/2023 - 10 months ago
  • Max Amd Letter Date
    7/18/2023 - 10 months ago
  • ARRA Amount

Institutions

  • Name
    Emory University
  • City
    ATLANTA
  • State
    GA
  • Country
    United States
  • Address
    201 DOWMAN DR
  • Postal Code
    303221061
  • Phone Number
    4047272503

Investigators

  • First Name
    Yana
  • Last Name
    Bromberg
  • Email Address
    yana.bromberg@emory.edu
  • Start Date
    7/18/2023 12:00:00 AM

Program Element

  • Text
    Info Integration & Informatics
  • Code
    7364

Program Reference

  • Text
    INFO INTEGRATION & INFORMATICS
  • Code
    7364
  • Text
    MEDIUM PROJECT
  • Code
    7924