Scalable Learning with Ensemble Techniques and Parallel Computing

Information

Research Project
7433144

ApplicationId
7433144
Core Project Number
R44GM083965
Full Project Number
1R44GM083965-01
Serial Number
83965
FOA Number
PAR-07-160
Sub Project Id

Project Start Date
5/1/2008 - 17 years ago
Project End Date
11/30/2008 - 17 years ago
Program Officer Name
COUCH, JENNIFER A
Budget Start Date
5/1/2008 - 17 years ago
Budget End Date
11/30/2008 - 17 years ago
Fiscal Year
2008
Support Year
1
Suffix
Award Notice Date
4/24/2008 - 17 years ago

Organizations

Insightful Corporation

Information

Scalable Learning with Ensemble Techniques and Parallel Computing

[unreadable] DESCRIPTION (provided by applicant): The ability to conduct basic and applied biomedical research is becoming increasingly dependent on data produced by new and emerging technologies. This data has an unprecedented amount of detail and volume. Researchers are therefore dependent on computing and computational tools to be able to visualize, analyze, model, and interpret these large and complex sets of data. Tools for disease detection, diagnosis, treatment, and prevention are common goals of many, if not all, biomedical research programs. Sound analytical and statistical theory and methodology for class pre- diction and class discovery lay the foundation for building these tools, of which the machine learning techniques of classification (supervised learning) and clustering (unsupervised learning) are crucial. Our goal is to produce software for analysis and interpretation of large data sets using ensemble machine learning techniques and parallel computing technologies. Ensemble techniques are recent advances in machine learning theory and methodology leading to great improvements in accuracy and stability in data set analysis and interpretation. The results from a committee of primary machine learners (classifiers or clusterers) that have been trained on different instance or feature subsets are combined through techniques such as voting. The high prediction accuracy of classifier ensembles (such as boosting, bagging, and random forests) has generated much excitement in the statistics and machine learning communities. Recent research extends the ensemble methodology to clustering, where class information is unavailable, also yielding superior performance in terms of accuracy and stability. In theory, most ensemble techniques are inherently parallel. However, existing implementations are generally serial and assume the data set is memory resident. Therefore current software will not scale to the large data sets produced in today's biomedical research. We propose to take two approaches to scale ensemble techniques to large data sets: data partitioning approaches and parallel computing. The focus of Phase I will be to prototype scalable classifier ensembles using parallel architectures. We intend to: establish the parallel computing infrastructures; produce a preliminary architecture and software design; investigate a wide range of ensemble generation schemes using data partitioning strategies; and implement scalable bagging and random forests based on the preliminary design. The focus of Phase II will be to complete the software architecture and implement the scalable classifier ensembles and scalable clusterer ensembles within this framework. We intend to: complete research and development of classifier ensembles; extend the classification framework to clusterer ensembles; research and develop a unified interface for building ensembles with differing generation mechanisms and combination strategies; and evaluate the effectiveness of the software on simulated and real data. PUBLIC HEALTH RELEVANCE: The common goals to many, if not all, biomedical research programs are the development of tools for disease detection, diagnosis, treatment, and prevention. These programs often rely on new types of data that have an unprecedented amount of detail and volume. Our goal is to produce software for the analysis and interpretation of large data sets using ensemble machine learning techniques and parallel computing technologies to enable researchers who are dependent on computational tools to have the ability to visualize, analyze, model, and interpret these large and complex sets of data. [unreadable] [unreadable] [unreadable]

IC Name

NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES

Activity
R44
Administering IC
GM
Application Type
1

Direct Cost Amount
Indirect Cost Amount
Total Cost
25548
Sub Project Total Cost

ARRA Funded
CFDA Code
859
Ed Inst. Type
Funding ICs
NIGMS:25548\
Funding Mechanism
Study Section
ZRG1
Study Section Name
Special Emphasis Panel

Organization Name
INSIGHTFUL CORPORATION
Organization Department
Organization DUNS
150683779
Organization City
SEATTLE
Organization State
WA
Organization Country
UNITED STATES
Organization Zip Code
98109
Organization District
UNITED STATES

Scalable Learning with Ensemble Techniques and Parallel Computing

Information

ApplicationId

Core Project Number

Full Project Number

Serial Number

FOA Number

Sub Project Id

Project Start Date

Project End Date

Program Officer Name

Budget Start Date

Budget End Date

Fiscal Year

Support Year

Suffix

Award Notice Date

Organizations

Scalable Learning with Ensemble Techniques and Parallel Computing

IC Name

Activity

Administering IC

Application Type

Direct Cost Amount

Indirect Cost Amount

Total Cost

Sub Project Total Cost

ARRA Funded

CFDA Code

Ed Inst. Type

Funding ICs

Funding Mechanism

Study Section

Study Section Name

Organization Name

Organization Department

Organization DUNS

Organization City

Organization State

Organization Country

Organization Zip Code

Organization District