Extrapolative Analyses for Reliable Machine Learning Driven Scientific Discovery

Information

  • NSF Award
  • 2324394
Owner
  • Award Id
    2324394
  • Award Effective Date
    9/1/2023 - 9 months ago
  • Award Expiration Date
    8/31/2026 - 2 years from now
  • Award Amount
    $ 391,595.00
  • Award Instrument
    Continuing Grant

Extrapolative Analyses for Reliable Machine Learning Driven Scientific Discovery

There has been a promising explosion in the production and analysis of digital data from experimental and observational sources, which presents many opportunities for machine learning (ML) driven scientific discovery in high-impact applications such as chemistry (cheminformatics) and biology (bioinformatics). Unfortunately, current ML methodology often fails to properly characterize data markedly distinct from what was seen during training (i.e., extrapolation). This, in turn, hampers our ability to make scientific discoveries that truly extend past our current knowledge. For example, this is of great consequence in chemical virtual screening campaigns, where one hopes to use ML predictions to guide potential targets for expensive real-world experimentation (e.g., in drug discovery applications). Poor extrapolative power of ML models can result in false positives, wasting time and resources through costly synthesis and experimental testing of novel chemical entities. The work stemming from this award will improve the real-world utility of ML models in scientific domains and prevent the faulty use of model predictions. The project also provides research training opportunities for graduate students. <br/><br/>This project develops various methodologies to more accurately assess the reliability of ML predictions on novel inputs and improve models' extrapolatory capabilities. First, the project develops empirical trials to more accurately evaluate the extrapolative capabilities of ML model fitting procedures on domains that lie beyond the training set distributional support. Second, utilizing extrapolative assessments, the project develops techniques to thoroughly explore the input space of possible extrapolation to anticipate and filter out likely unreliable predictions. Lastly, the project builds methodology to guide the acquisition of new training data that, once trained on, will improve model extrapolation.<br/><br/>This award by the Division of Mathematical Sciences is jointly supported by the NSF Office of Advanced Cyberinfrastructure.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Yong Zengyzeng@nsf.gov7032927299
  • Min Amd Letter Date
    8/14/2023 - 10 months ago
  • Max Amd Letter Date
    8/14/2023 - 10 months ago
  • ARRA Amount

Institutions

  • Name
    University of North Carolina at Chapel Hill
  • City
    CHAPEL HILL
  • State
    NC
  • Country
    United States
  • Address
    104 AIRPORT DR STE 2200
  • Postal Code
    275995023
  • Phone Number
    9199663411

Investigators

  • First Name
    Junier
  • Last Name
    Oliva
  • Email Address
    joliva@cs.unc.edu
  • Start Date
    8/14/2023 12:00:00 AM
  • First Name
    Alexander
  • Last Name
    Tropsha
  • Email Address
    alex_tropsha@unc.edu
  • Start Date
    8/14/2023 12:00:00 AM

Program Element

  • Text
    CDS&E-MSS
  • Code
    8069
  • Text
    CDS&E
  • Code
    8084

Program Reference

  • Text
    NSCI: National Strategic Computing Initi
  • Text
    Machine Learning Theory
  • Text
    COMPUTATIONAL SCIENCE & ENGING
  • Code
    9263