Advancing Statistical Methods for Multi-Study Predictions

Information

  • NSF Award
  • 2113707
Owner
  • Award Id
    2113707
  • Award Effective Date
    9/15/2021 - 2 years ago
  • Award Expiration Date
    8/31/2024 - 4 months from now
  • Award Amount
    $ 116,669.00
  • Award Instrument
    Continuing Grant

Advancing Statistical Methods for Multi-Study Predictions

Scientific predictions are more valuable when their accuracy is stable across studies. Steps towards more replicable predictions would increase public confidence in the scientific process, facilitate dissemination of results, and enhance public engagement with science and technology. As many areas of science and technology are becoming data-rich, multiple datasets are more commonly available for training prediction models. This project aims to develop new general strategies for multi-study ensemble learning and prediction that enhance replicability. The long-term goals are to help improve the well-being of individuals in society (for example, via algorithms for individualized disease prevention and medical treatment), improve national security, and benefit industries that use prediction approaches as key elements of their business plans (including for example finance, marketing, and real estate). The investigators plan to build free, open-source software to implement the successful strategies and to provide research training opportunities to graduate students. <br/><br/>Prior work developed a novel strategy for multi-study prediction, which groups together prediction models, each trained on a single study, and weights them to reward those that perform well outside their training study. This technique, called multi-study ensembling, shows promise to substantially improve prediction replicability. In this project, the investigators plan to generalize this approach in two ways. The first goal is to extend the study to include resampling concepts tailored to the multi-study setting. Specifically, they will consider the study strap ensemble, which fits models to bootstrapped datasets, or "pseudo-studies." These are generated by resampling from multiple studies with a hierarchical resampling scheme that generalizes the randomized cluster bootstrap. The study strap is controlled by a tuning parameter that determines the proportion of observations to draw from each study. When the parameter is set to its lowest value, each pseudo-study is resampled from only a single study. When it is high, the study strap ignores the multi-study structure and generates pseudo-studies by merging the datasets and drawing observations like a standard bootstrap. The second goal is to extend the concept of weight by building statistical models on the weights themselves to both handle high dimensionality and exploit useful structure of the multi-study collection. The work will be carried out within the framework of multi-study stacking, where predictions generated by study-specific models are used as features in a second-stage analysis (typically a regularized regression) performed on the merged dataset collection. Coefficients in this step reflect cross-study replicability. The research will evaluate a range of specific prediction techniques within this paradigm, investigate their statistical properties theoretically and empirically, and compare them to existing alternative multi-study statistical methods.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Huixia Wanghuiwang@nsf.gov7032922279
  • Min Amd Letter Date
    9/15/2021 - 2 years ago
  • Max Amd Letter Date
    9/15/2021 - 2 years ago
  • ARRA Amount

Institutions

  • Name
    Dana-Farber Cancer Institute
  • City
    Boston
  • State
    MA
  • Country
    United States
  • Address
    Office of Grants and Contracts
  • Postal Code
    022155450
  • Phone Number
    6176323940

Investigators

  • First Name
    Giovanni
  • Last Name
    Parmigiani
  • Email Address
    gp@jimmy.harvard.edu
  • Start Date
    9/15/2021 12:00:00 AM
  • First Name
    Lorenzo
  • Last Name
    Trippa
  • Email Address
    ltrippa@jimmy.harvard.edu
  • Start Date
    9/15/2021 12:00:00 AM

Program Element

  • Text
    STATISTICS
  • Code
    1269

Program Reference

  • Text
    Machine Learning Theory