Exploring the unknown protein universe using evolutionary information

Information

  • Research Project
  • 10247669
  • ApplicationId
    10247669
  • Core Project Number
    DP5OD026389
  • Full Project Number
    5DP5OD026389-04
  • Serial Number
    026389
  • FOA Number
    RFA-RM-17-008
  • Sub Project Id
  • Project Start Date
    9/7/2018 - 7 years ago
  • Project End Date
    8/31/2023 - 2 years ago
  • Program Officer Name
    MILLER, BECKY
  • Budget Start Date
    9/1/2021 - 4 years ago
  • Budget End Date
    8/31/2022 - 3 years ago
  • Fiscal Year
    2021
  • Support Year
    04
  • Suffix
  • Award Notice Date
    9/8/2021 - 4 years ago
Organizations

Exploring the unknown protein universe using evolutionary information

Project Summary/Abstract: For billions of years, nature has been conducting the greatest experiment of all time. Imagine one day gaining access to the detailed notes from these experiments. Today, with worldwide expeditions to collect samples from all habitats, single-cell sequencing of unculturable microbes and the rapid drop in sequencing costs, we can finally tap into nature and gain access to these notes. All that is missing is a Rosetta Stone to interpret this data. The traditional approach, to interpreting sequence data, is through comparison to known information, such as annotated genomes and/or experimentally characterized protein families. Unfortunately, nearly half of metagenomic data (coming from either environmental samples or microbiomes) lacks any detectable sequence homology to any protein family, let alone to any isolated genome. Furthermore, the rate at which this ?dark matter? is discovered, far exceeds the rate at which experiments can be done to characterize it. An alternative approach is to learn a generative, statistical model of the evolutionary process itself. The parameters of this model should in turn provide the constraints on natural selection. For protein-coding genes, the constraints includes folding, stability, and function. Recently, it was shown that a global statistical model of a protein family that captures both conservation and coevolution patterns in the family possesses this quality. The strength of coevolution term is correlated with residue-residue contacts in 3D structure. These contacts have since been used to computationally determine the 3D structures of hundreds of unknown protein families and complexes. These in turn, have been used to predict the function by looking at arrangement of conserved residues and structural similarity to known protein structures. Structural matches can occur in the absence of detectable sequence similarity because structural similarity is retained over larger evolutionary distances. I propose to 1) Develop an improved, unified, statistical model of protein evolution that takes into account functional and lineage constraints; 2) Apply the model to mine metagenomic ?dark matter? sequences for new protein families, functions and protein-protein interactions; 3) Probe evolution of multicellularity through comparison of structures and interactions in the early tree of life. One of the results of the research will be a public database of new protein families and their predicted 3D structure and function. These will be used by structural, molecular and evolutionary biologists as a reference for future studies into the unknown protein universe.

IC Name
OFFICE OF THE DIRECTOR, NATIONAL INSTITUTES OF HEALTH
  • Activity
    DP5
  • Administering IC
    OD
  • Application Type
    5
  • Direct Cost Amount
    250000
  • Indirect Cost Amount
    172500
  • Total Cost
    422500
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    310
  • Ed Inst. Type
    SCHOOLS OF ARTS AND SCIENCES
  • Funding ICs
    NIDCR:1\OD:422499\
  • Funding Mechanism
    Non-SBIR/STTR RPGs
  • Study Section
    ZRG1
  • Study Section Name
    Special Emphasis Panel
  • Organization Name
    HARVARD UNIVERSITY
  • Organization Department
    NONE
  • Organization DUNS
    082359691
  • Organization City
    CAMBRIDGE
  • Organization State
    MA
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    021385319
  • Organization District
    UNITED STATES