Predictive Modeling with High-Dimensional lncomplete Data

Information

  • Research Project
  • 10242955
  • ApplicationId
    10242955
  • Core Project Number
    R01GM140463
  • Full Project Number
    5R01GM140463-02
  • Serial Number
    140463
  • FOA Number
    PAR-19-001
  • Sub Project Id
  • Project Start Date
    9/1/2020 - 4 years ago
  • Project End Date
    8/31/2023 - a year ago
  • Program Officer Name
    BRAZHNIK, PAUL
  • Budget Start Date
    9/1/2021 - 3 years ago
  • Budget End Date
    8/31/2022 - 2 years ago
  • Fiscal Year
    2021
  • Support Year
    02
  • Suffix
  • Award Notice Date
    9/27/2021 - 3 years ago

Predictive Modeling with High-Dimensional lncomplete Data

Predictive modeling is the cornerstone of individualized health care. The outcome of interest is most frequently the presence or absence of a health condition, and a large number of predictors are commonly available for model building. Both the high dimensional data and the missing data have posed great challenges in statistical inference related to predictive modeling. The overarching goal of this proposal is to address methodological challenges of predicting binary outcomes with high-dimensional incomplete data. Specifically, the PIs proposed to address the methodological challenges from the following two perspectives: (1) Quantify the uncertainty for the risk prediction based on the high-dimensional logistic model; (2) Accommodate two study designs where missingness happens in a structured way, including the ?Positive-only? study design and the two-phase design. Recent years have seen great breakthroughs in statistical inference methods for analyzing high-dimensional data arising from a wide spectrum of scientific fields, with a focus primarily on a single regression coefficient in the generalized linear models. Inferential methods for confidence interval construction and hypothesis testing for the predicted probability, which is a function of all regression coefficients, are largely lacking. We develop innovative statistical methods in this proposal towards filling this methodological gap in high dimensional data analysis. Our proposed method is innovative also because they accommodate the structured incomplete data which arises from important sampling designs. To our best knowledge, to date, statistical inference methods for high dimensional data analysis have exclusively focused on data arising from complete data arising from cross-sectional study designs. We additionally consider two important study designs with incomplete data, one is termed as the ?positive-only? study design that arises in EHR phenotyping, and the other is the two-phase design, an important cost-effective sampling design that aims to reduce cost for measuring expensive predictors. We elucidate methodological challenges of accommodating the missing data issues in downstream analysis and provide corresponding solutions.

IC Name
NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
  • Activity
    R01
  • Administering IC
    GM
  • Application Type
    5
  • Direct Cost Amount
    151357
  • Indirect Cost Amount
    32003
  • Total Cost
    183360
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    859
  • Ed Inst. Type
    SCHOOLS OF ARTS AND SCIENCES
  • Funding ICs
    NIGMS:183360\
  • Funding Mechanism
    Non-SBIR/STTR RPGs
  • Study Section
    ZGM1
  • Study Section Name
    Special Emphasis Panel
  • Organization Name
    RUTGERS, THE STATE UNIV OF N.J.
  • Organization Department
    BIOSTATISTICS & OTHER MATH SCI
  • Organization DUNS
    001912864
  • Organization City
    PISCATAWAY
  • Organization State
    NJ
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    088543925
  • Organization District
    UNITED STATES