Imputing single cell RNA sequencing data: Mathematical, statistical and computational challenges

Information

  • Research Project
  • 10242066
  • ApplicationId
    10242066
  • Core Project Number
    R01GM135928
  • Full Project Number
    5R01GM135928-03
  • Serial Number
    135928
  • FOA Number
    PAR-19-001
  • Sub Project Id
  • Project Start Date
    9/23/2019 - 5 years ago
  • Project End Date
    8/31/2022 - 2 years ago
  • Program Officer Name
    BRAZHNIK, PAUL
  • Budget Start Date
    9/1/2021 - 3 years ago
  • Budget End Date
    8/31/2022 - 2 years ago
  • Fiscal Year
    2021
  • Support Year
    03
  • Suffix
  • Award Notice Date
    8/22/2021 - 3 years ago

Imputing single cell RNA sequencing data: Mathematical, statistical and computational challenges

Novel single cell RNA sequencing (scRNA-seq) technologies can simultaneously measure the expression levels of all 30,000 genes over thousands to millions of individual cells. The analysis of scRNA-seq data has already led to fundamental advances in biology, including discovery of new cell types, detection of subtle differences between similar cells, and reconstruction of cellular developmental trajectories. Single- cell measurements involve amplification of tiny amounts of RNA and result in extremely sparse data matrices with many zeros, While some of these zeros are due to missing data (dropouts), others represent true biological inactivity. Yet, many scRNA-seq imputation methods treat all observed zero entries identically, leading to imputed matrices that often overestimate transcriptional activity. Other methods that do attempt to distinguish biological zeros from dropouts lack rigorous theoretical guarantees. The goals of this proposal are to develop models, supporting mathematical theory, and computational tools that explicitly take the existence of true biological zeros into account. Matrix imputation under this constraint involves both computational challenges as well as theoretical questions in random matrix theory and high dimensional statistics. These include rank estimation and low rank sparse matrix recovery from partially observed data, and biclustering in the presence of dropouts and zeros, We plan to develop novel approaches based on non-smooth continuous optimization, and derive accompanying statistical guarantees, We also plan to develop ensemble learning approaches that cleverly combine the outputs of multiple imputation algorithms. Finally, we hope to gain important insights regarding recovery from such data via a study of minimax rates and information lower bounds. To address these challenges, we will build on our promising preliminary results and the joint expertise of the investigators in spectral methods, high dimensional statistics, matrix analysis, numerical optimization, and genomics.

IC Name
NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
  • Activity
    R01
  • Administering IC
    GM
  • Application Type
    5
  • Direct Cost Amount
    199947
  • Indirect Cost Amount
    23319
  • Total Cost
    223266
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    859
  • Ed Inst. Type
  • Funding ICs
    NIGMS:223266\
  • Funding Mechanism
    Non-SBIR/STTR RPGs
  • Study Section
    ZGM1
  • Study Section Name
    Special Emphasis Panel
  • Organization Name
    NORTH CAROLINA STATE UNIVERSITY RALEIGH
  • Organization Department
  • Organization DUNS
    042092122
  • Organization City
    RALEIGH
  • Organization State
    NC
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    276957514
  • Organization District
    UNITED STATES