Open data-driven infrastructure for building biomolecular force fields for predictive biophysics and drug design

Information

  • Research Project
  • 10412594
  • ApplicationId
    10412594
  • Core Project Number
    R01GM132386
  • Full Project Number
    3R01GM132386-02S1
  • Serial Number
    132386
  • FOA Number
    PA-20-272
  • Sub Project Id
  • Project Start Date
    3/1/2020 - 4 years ago
  • Project End Date
    2/29/2024 - 4 months ago
  • Program Officer Name
    LYSTER, PETER
  • Budget Start Date
    3/1/2021 - 3 years ago
  • Budget End Date
    2/28/2022 - 2 years ago
  • Fiscal Year
    2021
  • Support Year
    02
  • Suffix
    S1
  • Award Notice Date
    8/30/2021 - 2 years ago

Open data-driven infrastructure for building biomolecular force fields for predictive biophysics and drug design

PROJECT SUMMARY/ABSTRACT Current generation molecular simulation models are insuf?ciently accurate, and current generation tools for building those models are limited, not automated, and based on aging infrastructure. Our original R01, ?Open Data-driven Infrastructure for Building Biomolecular Force Fields for Predictive Biophysics and Drug Design,? aims to solve these problems, producing a modern infrastructure for building, applying, and improving accurate molecular mechanics force ?elds. As part of our NIH-funded project, we have collaborated closely with the Molecular Sciences Software Institute (MolSSI) to use the QCArchive ecosystem to gen- erate and continuously expand very large quantum chemical datasets relevant to biomolecular systems on a variety of supercomputing resources. QCArchive now contains over 42M quantum chemical calculations for over 39M molecules, and has become incredibly popular, with over 1.79M accesses/month. Large quantum chemical datasets relevant to biomolecular systems are incredibly valuable to the AI/ML community. Data is the key element needed for both fundamental research into ML architectures and constructing predictive models for downstream use. Unfortunately, quantum chemical datasets are incredibly expensive to generate, limiting in-house generation of large, useful datasets needed to drive AI/ML research to a few large companies and researchers with access to suf?cient computing resources. While AI/ML quantum chemical methods have shown immense promise for biomolecular systems, the limited access to large, curated datasets has greatly hindered researchers from making rapid progress in this area. We aim to bridge this gap by working closely with MolSSI QCArchive developers to address robustness, scal- ability, and data delivery challenges to meet the needs of the biomolecular AI/ML community requiring access to large quantum chemistry datasets (Aim 1). Additional software developers will enable improvements to the QCArchive infrastructure to meet the rapidly growing demands of the AI/ML community. As QCArchive is primarily maintained by a single MolSSI Software Scientist, additional developers are necessary for fully enabling the AI/ML community to take full advantage of the wealth of data generated by our NIH-funded project directly, as well as the data actively being generated by the tools our project has engineered to enable distributed, fault-tolerant quantum chemistry that is rapidly populating QCArchive. We will additionally develop interfaces and dashboards to enable facile discovery, retrieval, and import of quantum chemical datasets within popular machine learning frameworks (Aim 2). To ensure our tools are speci?cally useful for the most promising AI/ML applications, we will collaborate directly with AI researchers in the OpenMM, TorchMD, and SchNetPack communities actively developing and deploying quantum machine learning (QML) potentials for biomolecular simulation, with the goal of producing generally useful tools suitable for the wider community yet capable of driving these high-priority applications.

IC Name
NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
  • Activity
    R01
  • Administering IC
    GM
  • Application Type
    3
  • Direct Cost Amount
    132694
  • Indirect Cost Amount
    44992
  • Total Cost
    177686
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    859
  • Ed Inst. Type
    BIOMED ENGR/COL ENGR/ENGR STA
  • Funding ICs
    OD:177686\
  • Funding Mechanism
    Non-SBIR/STTR RPGs
  • Study Section
  • Study Section Name
  • Organization Name
    UNIVERSITY OF COLORADO
  • Organization Department
    ENGINEERING (ALL TYPES)
  • Organization DUNS
    007431505
  • Organization City
    Boulder
  • Organization State
    CO
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    803031058
  • Organization District
    UNITED STATES