Collaborative Research: SHF: Small: Learning Fault Tolerance at Scale

Information

  • NSF Award
  • 2135310
Owner
  • Award Id
    2135310
  • Award Effective Date
    1/1/2022 - 3 years ago
  • Award Expiration Date
    12/31/2024 - 10 months ago
  • Award Amount
    $ 199,647.00
  • Award Instrument
    Standard Grant

Collaborative Research: SHF: Small: Learning Fault Tolerance at Scale

In computer-aided design and analysis of engineered systems such as automobiles or semiconductor chips, computational models are simulated on high-performance computers to characterize and evaluate key attributes. The sheer scale of such high-performance computing systems, e.g., over 20 billion transistors in Summit (one of the world's fastest supercomputers), increases the likelihood of transient hardware faults from events such as cosmic radiation or processor-chip voltage fluctuations. The likelihood of such errors and their negative impacts are further increased as such simulations are typically long running, and the corruption of a single data field or variable may require weeks to months of re-computations before critical decisions can be made. This project will develop automated approaches that bring fault tolerance to hardware faults for such applications which are widely used not only across multiple industrial sectors but to also increase the predictive power of climate or weather models to aid critical decision making. <br/><br/>Traditional fault-tolerant schemes can be either application-specific, requiring significant programmer effort to redesign or customize large-scale software, or application-agnostic where all or most data are redundantly stored periodically to allow for recovery, thus limiting their scalability due to their significant memory and processing overheads. This project seeks to address these limitations by providing a theoretical foundation for a new class of fault-tolerant schemes that are suitable for the broad array of applications based on iterative numerical simulations that evolve over time on discretized spatial domains. This project is based on the premise that in such physics-based applications, the rate of change of the solution vector components across time steps (iterations) and spatial domains is a key metric to automatically identifying the critical computational variables, monitoring their evolution, and dynamically selecting the type of safeguarding techniques that should be applied. The investigators will pursue three key directions: (i) characterizing the intrinsic resiliency of the application by developing resiliency gradient metrics, (ii) developing and testing fault-tolerance schemes that adapt the level and type of protection to the resiliency gradient with the goal of reducing computational overheads and increasing scalability, and (iii) constructing an automatic online decision-based learning framework for adaptively selecting fault-tolerance methods in relation to the system's ability to use approximate computing and co-scheduling techniques. The investigators will also work closely with application and runtime system developers to seek broader use of this fault tolerance framework, develop specialized undergraduate and graduate curriculum for student training, and offer research experiences to high school students.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Almadena Chtchelkanovaachtchel@nsf.gov7032927498
  • Min Amd Letter Date
    8/31/2021 - 4 years ago
  • Max Amd Letter Date
    8/31/2021 - 4 years ago
  • ARRA Amount

Institutions

  • Name
    University of Alabama in Huntsville
  • City
    Huntsville
  • State
    AL
  • Country
    United States
  • Address
    301 Sparkman Drive
  • Postal Code
    358051911
  • Phone Number
    2568242657

Investigators

  • First Name
    Joshua
  • Last Name
    Booth
  • Email Address
    jdbst21@gmail.com
  • Start Date
    8/31/2021 12:00:00 AM

Program Element

  • Text
    Software & Hardware Foundation
  • Code
    7798

Program Reference

  • Text
    SMALL PROJECT
  • Code
    7923
  • Text
    HIGH-PERFORMANCE COMPUTING
  • Code
    7942