EAGER: Exploring Automatic Optimization of Multi-tiered HPC Storage Systems via Practical Reinforcement Learning

Information

  • NSF Award
  • 2412345
Owner
  • Award Id
    2412345
  • Award Effective Date
    7/1/2024 - 6 months ago
  • Award Expiration Date
    6/30/2025 - 6 months from now
  • Award Amount
    $ 133,980.00
  • Award Instrument
    Standard Grant

EAGER: Exploring Automatic Optimization of Multi-tiered HPC Storage Systems via Practical Reinforcement Learning

Nowadays, scientific discovery increasingly involves generating and analyzing large amounts of data. These data-intensive scientific applications pose significant challenges to the storage systems of high-performance computing (HPC) clusters, that are heterogeneous and extremely complex. Scientists who need high-speed data access often experience frustration in effectively using these heterogeneous storage options. There is need to build the long-missing automated HPC I/O (Input/Output) middleware to transparently help scientists achieve optimal data access performance without their manual efforts. Designing automated HPC I/O middleware for large-scale, heterogeneous, and shared HPC storage systems is an extremely challenging task. The researchers supported by this grant plan to leverage machine learning techniques to understand the requests and the current system status, intelligently and adaptively scheduling and coordinating I/O requests. The outcomes of this research are expected to work with existing storage components and minimize the impacts on both scientific applications and the HPC systems.<br/><br/>This project plans to tackle this grand challenge by exploring practical reinforcement learning-based (RL) methods and building relevant software infrastructure in an HPC environment. There are two main focuses in the project: 1) RL-based data placement for high storage utilization, and 2) RL-based I/O coordination for shared storage. Both tasks depend on identifying effective reinforcement learning methods and integrating these methods effectively into HPC systems. To achieve this goal, a novel, system-centric reinforcement learning framework will be developed. Moreover, in each research focus, various RL algorithms, deep neural network designs, and reward shaping will be proposed, implemented, rigorously benchmarked, and compared with state-of-the-art solutions.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Almadena Chtchelkanovaachtchel@nsf.gov7032927498
  • Min Amd Letter Date
    2/9/2024 - 10 months ago
  • Max Amd Letter Date
    2/9/2024 - 10 months ago
  • ARRA Amount

Institutions

  • Name
    University of North Carolina at Charlotte
  • City
    CHARLOTTE
  • State
    NC
  • Country
    United States
  • Address
    9201 UNIVERSITY CITY BLVD
  • Postal Code
    282230001
  • Phone Number
    7046871888

Investigators

  • First Name
    Dong
  • Last Name
    Dai
  • Email Address
    dai@udel.edu
  • Start Date
    2/9/2024 12:00:00 AM

Program Element

  • Text
    Software & Hardware Foundation
  • Code
    779800

Program Reference

  • Text
    EAGER
  • Code
    7916
  • Text
    HIGH-PERFORMANCE COMPUTING
  • Code
    7942