OAC Core: Improving Data Integrity for HPC Datasets using Sparsity Profile

Information

  • NSF Award
  • 2312982
Owner
  • Award Id
    2312982
  • Award Effective Date
    6/1/2023 - a year ago
  • Award Expiration Date
    5/31/2026 - a year from now
  • Award Amount
    $ 600,000.00
  • Award Instrument
    Standard Grant

OAC Core: Improving Data Integrity for HPC Datasets using Sparsity Profile

Scientists conduct analyses that rely on large-scale simulations to achieve breakthroughs in multiple scientific domains, such as climate, energy, quantum physics, and more. As system complexity increases, future large-scale systems and the data generated, processed, stored, and transmitted by them are subject to increasingly higher occurrences of soft errors or silent data corruption. Importantly, this silently compromised data may go undetected because current High-Performance Computing (HPC) software stacks largely lack mechanisms to inform scientists of silent data corruption that could adversely affect the integrity of their scientific interpretation. In order to combat silent data corruption in HPC systems, this project introduces highly efficient and cost-effective mechanisms to monitor and detect soft errors. Through the use of unsupervised error detection, this project increases scientists’ confidence in extreme-scale scientific simulations and data analyses, which advance the data-intensive science discovery needed to solve some of the world’s most complex contemporary problems, such as predicting severe weather conditions, designing new materials, making new energy sources pragmatic, and others. The methodologies of this project are also applicable to general-purpose computing systems, increasing security and reliability on traditional computing and Internet of Things devices.<br/><br/>This research applies compressive sensing and machine learning, especially an unsupervised approach, to accurately detect soft and hardware errors in current and future HPC systems. A compact representation that corresponds to the original dataset is efficiently obtained through compressive sensing coupled with a hardware-assisted data collection mechanism that requires no changes to existing infrastructure. This is used with a spatiotemporal anomaly detection model for in situ characterization of soft errors and errors caused by a hardware malfunction, detecting anomalies deviating from acceptable ranges. The approach is built into the scientific workflow and operates seamlessly with the application without requiring application modification or customization. Validation of the mechanism across multiple HPC platforms using scientific workflows allows scientists to analyze and verify their datasets with increased levels of trust.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Juan Lijjli@nsf.gov7032922625
  • Min Amd Letter Date
    5/25/2023 - a year ago
  • Max Amd Letter Date
    5/25/2023 - a year ago
  • ARRA Amount

Institutions

  • Name
    University of Massachusetts Lowell
  • City
    LOWELL
  • State
    MA
  • Country
    United States
  • Address
    600 SUFFOLK ST STE 212
  • Postal Code
    018543624
  • Phone Number
    9789344170

Investigators

  • First Name
    Seung Woo
  • Last Name
    Son
  • Email Address
    SeungWoo_Son@uml.edu
  • Start Date
    5/25/2023 12:00:00 AM
  • First Name
    Orlando
  • Last Name
    Arias
  • Email Address
    orlando_arias@uml.edu
  • Start Date
    5/25/2023 12:00:00 AM

Program Element

  • Text
    OAC-Advanced Cyberinfrast Core

Program Reference

  • Text
    NSCI: National Strategic Computing Initi
  • Text
    SMALL PROJECT
  • Code
    7923