Advancing Statistical Methods for Multi-Study Predictions

Information

NSF Award
2113707

Owner

DANA-FARBER CANCER INSTITUTE

Award Id
2113707
Award Effective Date
9/15/2021 - 4 years ago
Award Expiration Date
8/31/2024 - a year ago
Award Amount
$ 116,669.00
Award Instrument
Continuing Grant

Information

Advancing Statistical Methods for Multi-Study Predictions

Scientific predictions are more valuable when their accuracy is stable across studies. Steps towards more replicable predictions would increase public confidence in the scientific process, facilitate dissemination of results, and enhance public engagement with science and technology. As many areas of science and technology are becoming data-rich, multiple datasets are more commonly available for training prediction models. This project aims to develop new general strategies for multi-study ensemble learning and prediction that enhance replicability. The long-term goals are to help improve the well-being of individuals in society (for example, via algorithms for individualized disease prevention and medical treatment), improve national security, and benefit industries that use prediction approaches as key elements of their business plans (including for example finance, marketing, and real estate). The investigators plan to build free, open-source software to implement the successful strategies and to provide research training opportunities to graduate students. <br/><br/>Prior work developed a novel strategy for multi-study prediction, which groups together prediction models, each trained on a single study, and weights them to reward those that perform well outside their training study. This technique, called multi-study ensembling, shows promise to substantially improve prediction replicability. In this project, the investigators plan to generalize this approach in two ways. The first goal is to extend the study to include resampling concepts tailored to the multi-study setting. Specifically, they will consider the study strap ensemble, which fits models to bootstrapped datasets, or "pseudo-studies." These are generated by resampling from multiple studies with a hierarchical resampling scheme that generalizes the randomized cluster bootstrap. The study strap is controlled by a tuning parameter that determines the proportion of observations to draw from each study. When the parameter is set to its lowest value, each pseudo-study is resampled from only a single study. When it is high, the study strap ignores the multi-study structure and generates pseudo-studies by merging the datasets and drawing observations like a standard bootstrap. The second goal is to extend the concept of weight by building statistical models on the weights themselves to both handle high dimensionality and exploit useful structure of the multi-study collection. The work will be carried out within the framework of multi-study stacking, where predictions generated by study-specific models are used as features in a second-stage analysis (typically a regularized regression) performed on the merged dataset collection. Coefficients in this step reflect cross-study replicability. The research will evaluate a range of specific prediction techniques within this paradigm, investigate their statistical properties theoretically and empirically, and compare them to existing alternative multi-study statistical methods.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Program Officer
Huixia Wanghuiwang@nsf.gov7032922279
Min Amd Letter Date
9/15/2021 - 4 years ago
Max Amd Letter Date
9/15/2021 - 4 years ago
ARRA Amount

Institutions

Name
Dana-Farber Cancer Institute
City
Boston
State
MA
Country
United States
Address
Office of Grants and Contracts
Postal Code
022155450
Phone Number
6176323940

Investigators

First Name
Giovanni
Last Name
Parmigiani
Email Address
gp@jimmy.harvard.edu
Start Date
9/15/2021 12:00:00 AM

First Name
Lorenzo
Last Name
Trippa
Email Address
ltrippa@jimmy.harvard.edu
Start Date
9/15/2021 12:00:00 AM

Program Element

Text
STATISTICS
Code
1269

Program Reference

Text
Machine Learning Theory

Advancing Statistical Methods for Multi-Study Predictions

Information

Owner

Award Id

Award Effective Date

Award Expiration Date

Award Amount

Award Instrument

Advancing Statistical Methods for Multi-Study Predictions

Program Officer

Min Amd Letter Date

Max Amd Letter Date

ARRA Amount

Institutions

Name

City

State

Country

Address

Postal Code

Phone Number

Investigators

First Name

Last Name

Email Address

Start Date

First Name

Last Name

Email Address

Start Date

Program Element

Text

Code

Program Reference

Text