Scalable Learning with Ensemble Techniques and Parallel Computing

Information

  • Research Project
  • 7433144
  • ApplicationId
    7433144
  • Core Project Number
    R44GM083965
  • Full Project Number
    1R44GM083965-01
  • Serial Number
    83965
  • FOA Number
    PAR-07-160
  • Sub Project Id
  • Project Start Date
    5/1/2008 - 16 years ago
  • Project End Date
    11/30/2008 - 16 years ago
  • Program Officer Name
    COUCH, JENNIFER A
  • Budget Start Date
    5/1/2008 - 16 years ago
  • Budget End Date
    11/30/2008 - 16 years ago
  • Fiscal Year
    2008
  • Support Year
    1
  • Suffix
  • Award Notice Date
    4/24/2008 - 16 years ago
Organizations

Scalable Learning with Ensemble Techniques and Parallel Computing

[unreadable] DESCRIPTION (provided by applicant): The ability to conduct basic and applied biomedical research is becoming increasingly dependent on data produced by new and emerging technologies. This data has an unprecedented amount of detail and volume. Researchers are therefore dependent on computing and computational tools to be able to visualize, analyze, model, and interpret these large and complex sets of data. Tools for disease detection, diagnosis, treatment, and prevention are common goals of many, if not all, biomedical research programs. Sound analytical and statistical theory and methodology for class pre- diction and class discovery lay the foundation for building these tools, of which the machine learning techniques of classification (supervised learning) and clustering (unsupervised learning) are crucial. Our goal is to produce software for analysis and interpretation of large data sets using ensemble machine learning techniques and parallel computing technologies. Ensemble techniques are recent advances in machine learning theory and methodology leading to great improvements in accuracy and stability in data set analysis and interpretation. The results from a committee of primary machine learners (classifiers or clusterers) that have been trained on different instance or feature subsets are combined through techniques such as voting. The high prediction accuracy of classifier ensembles (such as boosting, bagging, and random forests) has generated much excitement in the statistics and machine learning communities. Recent research extends the ensemble methodology to clustering, where class information is unavailable, also yielding superior performance in terms of accuracy and stability. In theory, most ensemble techniques are inherently parallel. However, existing implementations are generally serial and assume the data set is memory resident. Therefore current software will not scale to the large data sets produced in today's biomedical research. We propose to take two approaches to scale ensemble techniques to large data sets: data partitioning approaches and parallel computing. The focus of Phase I will be to prototype scalable classifier ensembles using parallel architectures. We intend to: establish the parallel computing infrastructures; produce a preliminary architecture and software design; investigate a wide range of ensemble generation schemes using data partitioning strategies; and implement scalable bagging and random forests based on the preliminary design. The focus of Phase II will be to complete the software architecture and implement the scalable classifier ensembles and scalable clusterer ensembles within this framework. We intend to: complete research and development of classifier ensembles; extend the classification framework to clusterer ensembles; research and develop a unified interface for building ensembles with differing generation mechanisms and combination strategies; and evaluate the effectiveness of the software on simulated and real data. PUBLIC HEALTH RELEVANCE: The common goals to many, if not all, biomedical research programs are the development of tools for disease detection, diagnosis, treatment, and prevention. These programs often rely on new types of data that have an unprecedented amount of detail and volume. Our goal is to produce software for the analysis and interpretation of large data sets using ensemble machine learning techniques and parallel computing technologies to enable researchers who are dependent on computational tools to have the ability to visualize, analyze, model, and interpret these large and complex sets of data. [unreadable] [unreadable] [unreadable]

IC Name
NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
  • Activity
    R44
  • Administering IC
    GM
  • Application Type
    1
  • Direct Cost Amount
  • Indirect Cost Amount
  • Total Cost
    25548
  • Sub Project Total Cost
  • ARRA Funded
  • CFDA Code
    859
  • Ed Inst. Type
  • Funding ICs
    NIGMS:25548\
  • Funding Mechanism
  • Study Section
    ZRG1
  • Study Section Name
    Special Emphasis Panel
  • Organization Name
    INSIGHTFUL CORPORATION
  • Organization Department
  • Organization DUNS
    150683779
  • Organization City
    SEATTLE
  • Organization State
    WA
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    98109
  • Organization District
    UNITED STATES