CAREER: Recovering complex associations from high-dimensional genomic data

Information

NSF Award
2243341

Owner

THE J. DAVID GLADSTONE INSTITUTE

Award Id
2243341
Award Effective Date
10/1/2022 - a year ago
Award Expiration Date
3/31/2023 - a year ago
Award Amount
$ 138,428.00
Award Instrument
Continuing Grant

Information

CAREER: Recovering complex associations from high-dimensional genomic data

In this proposal, the PI will analyze current genomic data using powerful modern machine learning methods to help make personalized medicine for each patient a reality. Imagine using genomic data and clinical traits to accurately predict risk for preterm birth early in gestation, or a college student's risk for future heart attacks, and tailoring preventative measures to the specific risk mechanisms. This research addresses a core theme of machine learning methods applied to scientific data: how to robustly and efficiently build predictive models using complex hidden structure in high dimensional data with limited numbers of samples, as is common in genomic and biomedical data. This proposal addresses a number of fundamental questions: How to use correlation among features to share strength across limited numbers of samples? How to test for causality of an observation in a cell on disease? How to encode biological structure in nonlinear functions? These fundamental questions in applied machine learning and statistical genetics will be addressed through the creation of hierarchical models and methods for computationally tractable analyses. These projects will enable recovery of genomic signals with predictive ability essential for personalized medicine. The PI also plans active engagement with underrepresented minorities in computer science and making publicly available software.<br/><br/>This research aims to develop computationally tractable structured hierarchical models to find complex signals in genomic data that are hidden to current methods that will be used to build predictive models using existing genomic study data, and to use these predictive variants to precisely quantify disease risk for each patient. Success of these goals impacts personalized medicine, enabling a complete understanding of the genetic regulators of disease and making individual-specific disease risk prediction and treatment a reality. Although linear models have been used to analyze scientific data for 125 years, these methods assume unlimited availability of samples and simple linear structure, and fail to recover variants with more complex associations. In genomic data, predictive signal is often compositional, including linear, sparse, low-rank, or nonlinear structure. This proposed research will drastically shift current scientific data analysis by developing efficient methods that recover predictive genetic variants with complex effects. This research is organized around three integrated projects. 1) High-dimensional correlations. Current methods for correlation do not exploit multiple, correlated traits to improve power to find relationships between two high-dimensional sets of observations. The PI will develop computationally tractable models and robust inference methods for structured latent variable models in the presence of substantial observation noise. 2) Sparse, nonlinear regression for prediction by exploiting nonaddictive effects. Standard predictive models for genomics assume that associations are sparse and additive across predictors; nonlinear terms are not regularized appropriately. The PI will develop a predictive model that robustly recovers variants with additive and nonadditive effects. 3) Causal inference to study the mechanism of genetic regulation of disease. Current models of causal inference in genomics make unrealistic assumptions and fail to exploit modern machine learning approaches to nonlinearity, regularization, and approximate inference. The PI will develop a hierarchical model for causal analysis to pinpoint the cellular mechanisms of disease.

Program Officer
Sylvia Spenglersspengle@nsf.gov7032927347
Min Amd Letter Date
10/13/2022 - a year ago
Max Amd Letter Date
10/13/2022 - a year ago
ARRA Amount

Institutions

Name
The J. David Gladstone Institutes
City
SAN FRANCISCO
State
CA
Country
United States
Address
1650 OWENS ST
Postal Code
941582261
Phone Number
4157342000

Investigators

First Name
Barbara
Last Name
Engelhardt
Email Address
bee@princeton.edu
Start Date
10/13/2022 12:00:00 AM

Program Element

Text
Info Integration & Informatics
Code
7364

Program Reference

Text
CAREER-Faculty Erly Career Dev
Code
1045

Text
INFO INTEGRATION & INFORMATICS
Code
7364

Text
WOMEN, MINORITY, DISABLED, NEC
Code
9102

CAREER: Recovering complex associations from high-dimensional genomic data

Information

Owner

Award Id

Award Effective Date

Award Expiration Date

Award Amount

Award Instrument

CAREER: Recovering complex associations from high-dimensional genomic data

Program Officer

Min Amd Letter Date

Max Amd Letter Date

ARRA Amount

Institutions

Name

City

State

Country

Address

Postal Code

Phone Number

Investigators

First Name

Last Name

Email Address

Start Date

Program Element

Text

Code

Program Reference

Text

Code

Text

Code

Text

Code