BIGDATA: Mid-Scale: DA: ESCE: Collaborative Research: Scalable Statistical Computing for Emerging Omics Data Streams

Information

  • NSF Award
  • 1247813
Owner
  • Award Id
    1247813
  • Award Effective Date
    8/1/2013 - 10 years ago
  • Award Expiration Date
    7/31/2015 - 8 years ago
  • Award Amount
    $ 1,200,000.00
  • Award Instrument
    Standard Grant

BIGDATA: Mid-Scale: DA: ESCE: Collaborative Research: Scalable Statistical Computing for Emerging Omics Data Streams

Bioinformatic data sets are large and complicated. Marshalling and managing necessary resources (e.g., hardware; computer and programmer time) requires significant skill. Effective analysis and comprehension involves sophisticated statistical understanding. Domains of application and available data types change rapidly, requiring flexible and familiar programming environments. Collaborations involve diverse research groups of heterogeneous size and expertise. This project develops and disseminates new and efficient approaches to solving present and emerging problems in statistical analysis and interpretation of very large data. The project combines the strengths of two very widely used and complementary bioinformatics projects, Bioconductor and Galaxy.<br/><br/>The project has three components. The first, providing scalable access, develops R programming paradigms appropriate for scalable analysis. R/Bioconductor software will be developed for efficient reduction of large data to statistical descriptions by iterating data through transformation kernels. Bioconductor will be deployed for use in an accessible cloud-based environment, and will be integrated into the Galaxy deployment scheme. The second component is to provide statistical methods for big genomic data bydeveloping high performance statistical methodologies for analysis of large bioinformatics data. This applies the initial technical achievements to specific requirements of statistical analysis in genomics. Domains of application include: quality assessment and normalization of very large raw data; data reduction and uncertainty measure calculation for downstream interrogation; and discovery, reporting and auditing of novel biological findings. Developments require novel computational approaches that avoid all-data-in-memory computational models (prevalent in current algorithm implementations), and that re-express monolithic algorithms as concurrently executable independent components. This emphasizes extensible and composable elements to yield a richer toolkit for statistical genomics. The aim leverages R?s strength as a language for rapid development of statistical methodologies, and emphasizes areas of proven strength in the Bioconductor project. The third component addresses decision making. This aspect provides integration of R / Bioconductor work flows into Galaxy. We will deploy key results from Aim 2 as Galaxy work flows. New real-time feedback for streaming analytics will be introduced to Galaxy, and leveraged by Bioconductor.<br/><br/>The project includes very significant capacity building. The Bioconductor project successfully solicits, tests, and disseminates over 600 R packages for the statistical analysis and comprehension of high-throughput genomic data. All packages include extensive documentation, including vignettes describing intent, function, and interoperability. Packages reflect contributions from a broad scientific community, and enable national and international graduate, post-graduate, and commercial research activities in statistical, bioinformatic, and computational domains. This project furthers the capacity building impact of Bioconductor by addressing memory and performance limitations to statistical analysis of large and complicated bioinformatic data. Galaxy enables broad access to computational resources for data intensive biomedical research. This project enhances the capacity building impacts of Galaxy by providing scalable processing of big bioinformatic data, and enabling exploratory analysis by a broad bioinformatic community. The coupling of Bioconductor and Galaxy provides significant synergy, facilitating rapid translation of statistical and bioinformatic research developed in R to broad use through Galaxy.

  • Program Officer
    Sylvia J. Spengler
  • Min Amd Letter Date
    7/23/2013 - 10 years ago
  • Max Amd Letter Date
    1/24/2014 - 10 years ago
  • ARRA Amount

Institutions

  • Name
    Fred Hutchinson Cancer Research Center
  • City
    Seattle
  • State
    WA
  • Country
    United States
  • Address
    1100 FAIRVIEW AVE N J6-300
  • Postal Code
    981094433
  • Phone Number
    2066674868

Investigators

  • First Name
    Martin
  • Last Name
    Morgan
  • Email Address
    mtmorgan@fhcrc.org
  • Start Date
    7/23/2013 12:00:00 AM

Program Element

  • Text
    INFORMATION TECHNOLOGY RESEARC
  • Code
    1640
  • Text
    INFO INTEGRATION & INFORMATICS
  • Code
    7364
  • Text
    Big Data Science &Engineering
  • Code
    8083

Program Reference

  • Text
    INFORMATION TECHNOLOGY RESEARC
  • Code
    1640
  • Text
    CyberInfra Frmwrk 21st (CIF21)
  • Code
    7433
  • Text
    MEDIUM PROJECT
  • Code
    7924
  • Text
    Big Data Science &Engineering
  • Code
    8083