BIGDATA: Mid-Scale: DA: ESCE: Collaborative Research: Scalable Statistical Computing for Emerging Omics Data Streams

Information

NSF Award
1247813

Owner

Fred Hutchinson Cancer Research Center, Inc.

Award Id
1247813
Award Effective Date
8/1/2013 - 10 years ago
Award Expiration Date
7/31/2015 - 8 years ago
Award Amount
$ 1,200,000.00
Award Instrument
Standard Grant

Information

BIGDATA: Mid-Scale: DA: ESCE: Collaborative Research: Scalable Statistical Computing for Emerging Omics Data Streams

Bioinformatic data sets are large and complicated. Marshalling and managing necessary resources (e.g., hardware; computer and programmer time) requires significant skill. Effective analysis and comprehension involves sophisticated statistical understanding. Domains of application and available data types change rapidly, requiring flexible and familiar programming environments. Collaborations involve diverse research groups of heterogeneous size and expertise. This project develops and disseminates new and efficient approaches to solving present and emerging problems in statistical analysis and interpretation of very large data. The project combines the strengths of two very widely used and complementary bioinformatics projects, Bioconductor and Galaxy.<br/><br/>The project has three components. The first, providing scalable access, develops R programming paradigms appropriate for scalable analysis. R/Bioconductor software will be developed for efficient reduction of large data to statistical descriptions by iterating data through transformation kernels. Bioconductor will be deployed for use in an accessible cloud-based environment, and will be integrated into the Galaxy deployment scheme. The second component is to provide statistical methods for big genomic data bydeveloping high performance statistical methodologies for analysis of large bioinformatics data. This applies the initial technical achievements to specific requirements of statistical analysis in genomics. Domains of application include: quality assessment and normalization of very large raw data; data reduction and uncertainty measure calculation for downstream interrogation; and discovery, reporting and auditing of novel biological findings. Developments require novel computational approaches that avoid all-data-in-memory computational models (prevalent in current algorithm implementations), and that re-express monolithic algorithms as concurrently executable independent components. This emphasizes extensible and composable elements to yield a richer toolkit for statistical genomics. The aim leverages R?s strength as a language for rapid development of statistical methodologies, and emphasizes areas of proven strength in the Bioconductor project. The third component addresses decision making. This aspect provides integration of R / Bioconductor work flows into Galaxy. We will deploy key results from Aim 2 as Galaxy work flows. New real-time feedback for streaming analytics will be introduced to Galaxy, and leveraged by Bioconductor.<br/><br/>The project includes very significant capacity building. The Bioconductor project successfully solicits, tests, and disseminates over 600 R packages for the statistical analysis and comprehension of high-throughput genomic data. All packages include extensive documentation, including vignettes describing intent, function, and interoperability. Packages reflect contributions from a broad scientific community, and enable national and international graduate, post-graduate, and commercial research activities in statistical, bioinformatic, and computational domains. This project furthers the capacity building impact of Bioconductor by addressing memory and performance limitations to statistical analysis of large and complicated bioinformatic data. Galaxy enables broad access to computational resources for data intensive biomedical research. This project enhances the capacity building impacts of Galaxy by providing scalable processing of big bioinformatic data, and enabling exploratory analysis by a broad bioinformatic community. The coupling of Bioconductor and Galaxy provides significant synergy, facilitating rapid translation of statistical and bioinformatic research developed in R to broad use through Galaxy.

Program Officer
Sylvia J. Spengler
Min Amd Letter Date
7/23/2013 - 10 years ago
Max Amd Letter Date
1/24/2014 - 10 years ago
ARRA Amount

Institutions

Name
Fred Hutchinson Cancer Research Center
City
Seattle
State
WA
Country
United States
Address
1100 FAIRVIEW AVE N J6-300
Postal Code
981094433
Phone Number
2066674868

Investigators

First Name
Martin
Last Name
Morgan
Email Address
mtmorgan@fhcrc.org
Start Date
7/23/2013 12:00:00 AM

Program Element

Text
INFORMATION TECHNOLOGY RESEARC
Code
1640

Text
INFO INTEGRATION & INFORMATICS
Code
7364

Text
Big Data Science &Engineering
Code
8083

Program Reference

Text
INFORMATION TECHNOLOGY RESEARC
Code
1640

Text
CyberInfra Frmwrk 21st (CIF21)
Code
7433

Text
MEDIUM PROJECT
Code
7924

Text
Big Data Science &Engineering
Code
8083

BIGDATA: Mid-Scale: DA: ESCE: Collaborative Research: Scalable Statistical Computing for Emerging Omics Data Streams

Information

Owner

Award Id

Award Effective Date

Award Expiration Date

Award Amount

Award Instrument

BIGDATA: Mid-Scale: DA: ESCE: Collaborative Research: Scalable Statistical Computing for Emerging Omics Data Streams

Program Officer

Min Amd Letter Date

Max Amd Letter Date

ARRA Amount

Institutions

Name

City

State

Country

Address

Postal Code

Phone Number

Investigators

First Name

Last Name

Email Address

Start Date

Program Element

Text

Code

Text

Code

Text

Code

Program Reference

Text

Code

Text

Code

Text

Code

Text

Code