CAREER: Recovering complex associations from high-dimensional genomic data

Information

  • NSF Award
  • 2243341
Owner
  • Award Id
    2243341
  • Award Effective Date
    10/1/2022 - a year ago
  • Award Expiration Date
    3/31/2023 - a year ago
  • Award Amount
    $ 138,428.00
  • Award Instrument
    Continuing Grant

CAREER: Recovering complex associations from high-dimensional genomic data

In this proposal, the PI will analyze current genomic data using powerful modern machine learning methods to help make personalized medicine for each patient a reality. Imagine using genomic data and clinical traits to accurately predict risk for preterm birth early in gestation, or a college student's risk for future heart attacks, and tailoring preventative measures to the specific risk mechanisms. This research addresses a core theme of machine learning methods applied to scientific data: how to robustly and efficiently build predictive models using complex hidden structure in high dimensional data with limited numbers of samples, as is common in genomic and biomedical data. This proposal addresses a number of fundamental questions: How to use correlation among features to share strength across limited numbers of samples? How to test for causality of an observation in a cell on disease? How to encode biological structure in nonlinear functions? These fundamental questions in applied machine learning and statistical genetics will be addressed through the creation of hierarchical models and methods for computationally tractable analyses. These projects will enable recovery of genomic signals with predictive ability essential for personalized medicine. The PI also plans active engagement with underrepresented minorities in computer science and making publicly available software.<br/><br/>This research aims to develop computationally tractable structured hierarchical models to find complex signals in genomic data that are hidden to current methods that will be used to build predictive models using existing genomic study data, and to use these predictive variants to precisely quantify disease risk for each patient. Success of these goals impacts personalized medicine, enabling a complete understanding of the genetic regulators of disease and making individual-specific disease risk prediction and treatment a reality. Although linear models have been used to analyze scientific data for 125 years, these methods assume unlimited availability of samples and simple linear structure, and fail to recover variants with more complex associations. In genomic data, predictive signal is often compositional, including linear, sparse, low-rank, or nonlinear structure. This proposed research will drastically shift current scientific data analysis by developing efficient methods that recover predictive genetic variants with complex effects. This research is organized around three integrated projects. 1) High-dimensional correlations. Current methods for correlation do not exploit multiple, correlated traits to improve power to find relationships between two high-dimensional sets of observations. The PI will develop computationally tractable models and robust inference methods for structured latent variable models in the presence of substantial observation noise. 2) Sparse, nonlinear regression for prediction by exploiting nonaddictive effects. Standard predictive models for genomics assume that associations are sparse and additive across predictors; nonlinear terms are not regularized appropriately. The PI will develop a predictive model that robustly recovers variants with additive and nonadditive effects. 3) Causal inference to study the mechanism of genetic regulation of disease. Current models of causal inference in genomics make unrealistic assumptions and fail to exploit modern machine learning approaches to nonlinearity, regularization, and approximate inference. The PI will develop a hierarchical model for causal analysis to pinpoint the cellular mechanisms of disease.

  • Program Officer
    Sylvia Spenglersspengle@nsf.gov7032927347
  • Min Amd Letter Date
    10/13/2022 - a year ago
  • Max Amd Letter Date
    10/13/2022 - a year ago
  • ARRA Amount

Institutions

  • Name
    The J. David Gladstone Institutes
  • City
    SAN FRANCISCO
  • State
    CA
  • Country
    United States
  • Address
    1650 OWENS ST
  • Postal Code
    941582261
  • Phone Number
    4157342000

Investigators

  • First Name
    Barbara
  • Last Name
    Engelhardt
  • Email Address
    bee@princeton.edu
  • Start Date
    10/13/2022 12:00:00 AM

Program Element

  • Text
    Info Integration & Informatics
  • Code
    7364

Program Reference

  • Text
    CAREER-Faculty Erly Career Dev
  • Code
    1045
  • Text
    INFO INTEGRATION & INFORMATICS
  • Code
    7364
  • Text
    WOMEN, MINORITY, DISABLED, NEC
  • Code
    9102