III: Small: Integrated prediction of intrinsic disorder and disorder functions with modular multi-label deep learning

Information

  • NSF Award
  • 2125218
Owner
  • Award Id
    2125218
  • Award Effective Date
    10/1/2021 - 2 years ago
  • Award Expiration Date
    9/30/2024 - 3 months from now
  • Award Amount
    $ 500,000.00
  • Award Instrument
    Standard Grant

III: Small: Integrated prediction of intrinsic disorder and disorder functions with modular multi-label deep learning

Proteins are remarkable biological machines. Hundreds of millions of protein sequences were decoded over the last two decades creating a significant knowledge gap related to the fact that we do not know what most of them do. A common way to decipher protein functions relies on the sequence-to-structure-to-function paradigm where protein function is learned from the protein structure that is produced from the sequence. However, recent research has identified a large family of the intrinsically disordered proteins that lack a stable structure under physiological conditions and which therefore cannot be characterized using the structure-based approaches. These proteins are particularly abundant in the eukaryotes and are involved in the pathogenesis of numerous human diseases. The discovery of the intrinsically disordered proteins has prompted the development of a new generation of computational methods that predict presence of intrinsic disorder directly from protein sequences. A recently completed Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment has shown that these methods are fast and provide accurate results. However, while intrinsic disorder can be readily and accurately identified in protein sequences, its function remains a mystery. This proposal will conceptualize, design, implement, test and deploy an innovative machine learning method that provides highly accurate and integrated predictions of disorder and disorder functions directly from protein sequences. The team will utilize this method to produce functional annotations of disorder on an unprecedented scale of dozens of millions of proteins, addressing the knowledge gap problem for this protein family. In the long run this project will advance understanding of fundamental biological processes and related human health issues in the context of the intrinsically disordered proteins. This project will also train STEM students and researchers via high-school outreach and multidisciplinary teaching and mentoring of undergraduate and graduate students and postdoctoral researchers, producing highly skilled researchers who are sought after by industry and academia.<br/><br/>An interdisciplinary and challenging problem of the structure of intrinsically disorder protein structure at the intersection of bioinformatics and machine learning fields is addressed by the team. Building on expertise in the computational analysis of intrinsic disorder and with focus on technical innovation, this project will deliver a novel deep sequential multi-label transformer architecture that provides accurate predictions of disorder and disorder functions. The solution will be designed to accommodate for the biological underpinnings of protein data, such as the inherently multi-label outcomes, imbalanced labels and sequential nature of protein data. Moreover, this architecture will feature modular design to facilitate transfer to other areas of protein and nucleic acids bioinformatics. The resulting method will be extensively benchmarked and disseminated to maximize impact. The code will be deposited into relevant public repositories and pre-computed functional annotations of intrinsic disorder will be made available using modern online resources, such as data repositories and webservers, in order to meet the needs of a broad spectrum of users including biologists, biochemist, biophysicists and bioinformaticians.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Sylvia Spenglersspengle@nsf.gov7032927347
  • Min Amd Letter Date
    8/31/2021 - 2 years ago
  • Max Amd Letter Date
    8/31/2021 - 2 years ago
  • ARRA Amount

Institutions

  • Name
    Virginia Commonwealth University
  • City
    RICHMOND
  • State
    VA
  • Country
    United States
  • Address
    P.O. Box 980568
  • Postal Code
    232980568
  • Phone Number
    8048286772

Investigators

  • First Name
    Lukasz
  • Last Name
    Kurgan
  • Email Address
    lkurgan@vcu.edu
  • Start Date
    8/31/2021 12:00:00 AM

Program Element

  • Text
    Info Integration & Informatics
  • Code
    7364

Program Reference

  • Text
    Harnessing the Data Revolution
  • Text
    INFO INTEGRATION & INFORMATICS
  • Code
    7364
  • Text
    SMALL PROJECT
  • Code
    7923