Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability

Information

  • NSF Award
  • 2405142
Owner
  • Award Id
    2405142
  • Award Effective Date
    10/1/2023 - a year ago
  • Award Expiration Date
    5/31/2025 - 4 months from now
  • Award Amount
    $ 42,696.00
  • Award Instrument
    Standard Grant

Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability

Advances throughout science and engineering have for several decades been driven by High Performance Computing (HPC), with the pace of discovery accelerating in concert with continued innovation in computing capabilities. But as semiconductor technology now faces fundamental physical limits, even while large-scale systems are reaching warehouse scales, new approaches are becoming essential to achieving efficient use of computing resources. In particular, given this divergence of scales, HPC systems have necessarily become more distributed and asynchronous (in the sense that system clocks are asynchronous), resulting in increasingly variable and unpredictable execution. While these effects are recognized as critical hindrances to HPC performance, the mechanisms are not yet fully understood. What is known, however, is that much HPC infrastructure is tasked with dealing with inefficiency derived from asynchrony, variability, and unpredictability, leading to a deep and complex hardware/software support stack. The project team's hypothesis is that while each stack element provides a local solution, it may also exacerbate the global problem: that complexity has resulted in more variability, not less, and made determining its causes more difficult. This project explores the possibility of reversing the trend of ever-increasing complexity by removing and simplifying support layers. This strategy’s achievable gains remain limited, however, while the underlying cause, execution asynchrony, remains unaddressed. The approach begins by leveraging recently developed technology that enables clocks to remain extremely accurate even when distributed on a planetary scale. Such accurate, distributed clocks serve to underpin a virtuous cycle where synchrony establishes baseline predictability, which, in turn, reduces variability, and at each stage of the cycle enables reduction in the complexity of the support stack. A benefit of this approach is that the individual steps are largely simple and can be applied directly to existing software systems. <br/><br/>This one-year project aims to obtain early findings and practical demonstrations for the importance of synchrony and predictability to increase HPC compute efficiency and thereby improve large-scale program execution. Five tasks are conducted. The first is to demonstrate the feasibility of accurate clock distribution by augmenting existing HPC network infrastructure. The second is to demonstrate the application of synchrony in the establishing a virtuous cycle enabling simplifications to the software/system support stack. The third is to devise mechanisms to model, measure, and validate systems using the proposed methods. The fourth is to investigate the relative benefits of applying the synchrony-based virtuous cycle with respect to various application classes. The fifth is to demonstrate the overall efficacy of the proposed approach through a case study involving a production application. Overall, the project works to determine whether added synchronization through accurate clocks enables significant improvements to HPC computations in terms of how efficiently they use computational resources.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Almadena Chtchelkanovaachtchel@nsf.gov7032927498
  • Min Amd Letter Date
    11/24/2023 - a year ago
  • Max Amd Letter Date
    11/24/2023 - a year ago
  • ARRA Amount

Institutions

  • Name
    Tennessee Technological University
  • City
    COOKEVILLE
  • State
    TN
  • Country
    United States
  • Address
    1 WILLIAM L JONES DR
  • Postal Code
    385050001
  • Phone Number
    9313723374

Investigators

  • First Name
    Anthony
  • Last Name
    Skjellum
  • Email Address
    askjellum@tntech.edu
  • Start Date
    11/24/2023 12:00:00 AM

Program Element

  • Text
    Software & Hardware Foundation
  • Code
    779800

Program Reference

  • Text
    EAGER
  • Code
    7916
  • Text
    HIGH-PERFORMANCE COMPUTING
  • Code
    7942
  • Text
    REU SUPP-Res Exp for Ugrd Supp
  • Code
    9251