Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters

Information

  • NSF Award
  • 2403090
Owner
  • Award Id
    2403090
  • Award Effective Date
    10/1/2024 - 3 months from now
  • Award Expiration Date
    9/30/2027 - 3 years from now
  • Award Amount
    $ 150,000.00
  • Award Instrument
    Standard Grant

Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters

Machine Learning (ML) and Deep Learning (DL) (more specifically, Deep Neural Network (DNN)) workloads are beginning to dominate the High-Performance Computing (HPC) arena. Today, massive computational resources are required to train even a single state-of-the-art deep learning model (e.g., large language models or LLMs). As the need for training massive DNN models continues and expands from the private sector to NSF-supported scientists and engineers (who are more likely to use shared computing resources), efficient checkpointing is emerging as a critical need. Checkpointing not only helps deal with failures but also provides more scheduling flexibility on shared HPC resources, as a very long-running job can be broken into several shorter ones. The premise of the CropDL project is that efficient and automated application-level checkpoint and restart will be critical to facilitating the use of shared HPC clusters for long-running ML training tasks, drastically increasing the number of researchers that can successfully train large ML models for various applications. This project also contributes to education and diversity in multiple aspects, for example, 1) introducing courses (or course material) to bring attention to ML-related workloads in computer systems undergraduate and graduate education; 2) integrating research tasks from this project with synergistic research programs at universities to increase the participation of women and underrepresented minority groups; and 3) supporting and training PhD students in their research, creating momentum on systems and cyberinfrastructure research related to emerging ML workloads and popularizing integrative research that combines the properties of these workloads with the complexities of modern HPC hardware.<br/><br/>The overarching goal of CropDL is to support application-level checkpoints/restarts of deep learning applications for better resiliency, faster average completion time, and higher resource utilization. Particularly, several properties of DL workloads (as compared to scientific computations) create distinct sets of opportunities and challenges for checkpointing: 1) limited communication patterns during parallel execution, which can enable efficient coordinated checkpoints, 2) many unique opportunities for compression of checkpoints, and possibly taking uncoordinated checkpoints, and 3) malleable execution, where restarting from a different number of nodes is possible. Based on this observation, the first direction of this project is to exploit the properties of the DNN model(s) to be trained during checkpointing. This includes asynchronous versioned checkpointing for DL applications under a wide variety of parallelism models as well as content-based data reduction (compression and sparsification) techniques to reduce checkpoint volumes. The second direction of research focuses on using current and upcoming HPC systems' resources efficiently while checkpointing. It formulates tasks, data, and I/O requirements from DL applications into DAG representations and develops methods to schedule them. It also supports efficient I/O for deep learning applications with emerging I/O platforms. The last direction is to automate checkpointing through a compilation system based on the computational graph of DL workloads. All these efforts consider a variety of parallelization schemes for DNNs, i.e., data, model, and/or pipelined parallelism.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Varun Chandolavchandol@nsf.gov7032922656
  • Min Amd Letter Date
    4/15/2024 - a month ago
  • Max Amd Letter Date
    4/15/2024 - a month ago
  • ARRA Amount

Institutions

  • Name
    University of Georgia Research Foundation Inc
  • City
    ATHENS
  • State
    GA
  • Country
    United States
  • Address
    310 E CAMPUS RD RM 409
  • Postal Code
    306021589
  • Phone Number
    7065425939

Investigators

  • First Name
    Wei
  • Last Name
    Niu
  • Email Address
    wniu@uga.edu
  • Start Date
    4/15/2024 12:00:00 AM

Program Element

  • Text
    OAC-Advanced Cyberinfrast Core

Program Reference

  • Text
    NSCI: National Strategic Computing Initi
  • Text
    SMALL PROJECT
  • Code
    7923