Collaborative Research: Elements: VLCC-States: Versioned Lineage-Driven Checkpointing of Composable States

Information

  • NSF Award
  • 2411386
Owner
  • Award Id
    2411386
  • Award Effective Date
    10/1/2024 - 4 months from now
  • Award Expiration Date
    9/30/2027 - 3 years from now
  • Award Amount
    $ 300,000.00
  • Award Instrument
    Standard Grant

Collaborative Research: Elements: VLCC-States: Versioned Lineage-Driven Checkpointing of Composable States

Checkpointing is a fundamental pattern used by a variety of scientific applications at both small and large computing scales. Widely adopted for resilience purposes by long-running applications (i.e., checkpoint-restart), it has seen an explosion of additional use cases that directly help applications progress faster and reduce time-to-solution even in the absence of failures: adjoint computations (essential in financial modeling, weather prediction, computational fluid dynamics, seismic imaging, and control theory) need to capture a history of checkpoints in a forward pass, which are then revisited in a backward pass. Training artificial intelligence models, increasingly used by scientific applications, often results in trajectories that do not lead to convergence or may lead to undesirable patterns, prompting the need to backtrack to an earlier checkpoint of the learning model to try an alternative. Transfer learning and fine-tuning using a previous checkpoint of a learning model can be used to adapt the training more quickly, avoiding expensive training from scratch. Many other use cases are important in scientific computing: suspend-resume (e.g., to preempt a long-running job in favor of a higher priority job), migration (checkpoint on one machine, restart on another), debugging (replay a problematic code region to reproduce errors without starting from scratch), and reproducibility (checkpoint and compare intermediate data during repeated runs). Despite broad applicability, current state-of-the-art solutions lack the flexibility, performance, and scalability needed to address these scenarios efficiently. The Versioned Lineage-Driven Checkpointing of Composable States (VLCC-States) project aims to fill this gap. It will streamline the development and use of checkpointing patterns for scientific applications, which simplifies and improves the reusability of integration efforts across different communities, improves awareness of the multitude of checkpointing scenarios, reduces development effort and cost, and enables flexible customization to extract the best performance and scalability for the desired application scenario.<br/><br/>VLCC-States provides technical innovation in three areas. First, it introduces composable providers of intermediate states, which hide the complexity of capturing and assembling checkpoints of distributed data structures and their transformations across different modules and programming languages while optimizing their layout to eliminate redundancies, reduce sizes, and improve performance. Second, it provides multi-level co-optimized caching and prefetching techniques, which enable scalable management of the life cycle of checkpoints for interleavings of capture and reuse operations on heterogeneous storage stacks under concurrency. Third, it develops specialized checkpointing tools for large Artificial Intelligence models, with a focus on integration with PyTorch and DeepSpeed, to enable users to transparently take advantage of high-performance and scalable checkpointing using a familiar API. This project will engage partners in industry and national research laboratories to co-design VLCC-States, tune its capabilities, and evaluate its implementation. This project will undertake educational and broadening participation activities to improve community awareness and understanding of challenges in scientific data management.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Marlon Piercempierce@nsf.gov7032927743
  • Min Amd Letter Date
    4/30/2024 - 23 days ago
  • Max Amd Letter Date
    4/30/2024 - 23 days ago
  • ARRA Amount

Institutions

  • Name
    University of Chicago
  • City
    CHICAGO
  • State
    IL
  • Country
    United States
  • Address
    5801 S ELLIS AVE
  • Postal Code
    606375418
  • Phone Number
    7737028669

Investigators

  • First Name
    Bogdan
  • Last Name
    Nicolae
  • Email Address
    bogdan.nicolae@acm.org
  • Start Date
    4/30/2024 12:00:00 AM

Program Element

  • Text
    Software Institutes
  • Code
    800400