Collaborative Research: CISE: Large: Cross-Layer Resilience to Silent Data Corruption

Information

  • NSF Award
  • 2321491
Owner
  • Award Id
    2321491
  • Award Effective Date
    10/1/2023 - 7 months ago
  • Award Expiration Date
    9/30/2028 - 4 years from now
  • Award Amount
    $ 187,500.00
  • Award Instrument
    Continuing Grant

Collaborative Research: CISE: Large: Cross-Layer Resilience to Silent Data Corruption

Hyperscalers (i.e., large cloud service providers) are reporting frequent silent data corruptions (or SDCs) within their datacenter infrastructures. SDCs are software errors for which the only symptom is an incorrect result. Remarkably, SDCs at-scale exhibit error occurrence rates on the order of one thousand faults per one million devices. Meanwhile, hardware manufacturers strive to achieve one hundred and close to zero defective parts per million for the commercial and automotive domains, respectively. This discrepancy between manufacturers’ goals and hyperscalers’ observations suggests that SDCs are a real threat to the reliability of all modern computing systems, and by extension their security and sustainability. This project explores whether it is possible to cooperatively design testing, detection, and mitigation approaches for SDCs that minimize performance impact on software applications, as well as additional carbon footprint expenditures associated with manufacturing and running computing systems. The project’s key novelties include: (1) leveraging reoccurring computational primitives in software (e.g., matrix multiplication in popular machine learning applications) and modern special-purpose hardware (e.g., Artificial Intelligence processors) to design domain-specific SDC solutions; (2) exploiting the fact that SDC testing can be performed throughout a device’s lifetime in the datacenter rather than for a few seconds to minutes — a strict limitation on the manufacturing test floor; (3) considering sustainability and carbon footprint as a core design metric. This project’s core impact will be a critical improvement in reliability and security for the countless applications to which we entrust computing systems today. A secondary core impact is an improvement in the longevity of computing devices, which has significant positive implications for sustainable computing. The research team will also train students and work with industry partners. <br/> <br/>To address the SDC challenge, the research team pursues four synergistic research thrusts that cut across diverse domains: Silicon Devices, Computer Architecture, Software, and Algorithms. Within each thrust, the team will study the SDC challenge through the lenses of: Testing, Detection, Mitigation, and Security implications. Thrust 1 explores device-level testing through novel test pattern metrics and continuous scan test deployment. Thrust 2 studies system-level testing (improving error detection latency and test coverage and adapting tests to be more representative of datacenter workloads), core-specific testing, defect characterization, hardware support for testing and mitigation, and system security implications. Thrust 3 investigates software detection and mitigation through (partial) redundancy, appropriate scan and system-level test scheduling, test-application fusion (where applications test themselves), and software security hardening against defect-induced vulnerabilities. Thrust 4 pursues algorithmic detection and mitigations with a particular emphasis on enabling robust non-linear computation for important datacenter workloads, like neural networks.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Danella Zhaodzhao@nsf.gov7032924434
  • Min Amd Letter Date
    9/6/2023 - 7 months ago
  • Max Amd Letter Date
    9/6/2023 - 7 months ago
  • ARRA Amount

Institutions

  • Name
    Carnegie-Mellon University
  • City
    PITTSBURGH
  • State
    PA
  • Country
    United States
  • Address
    5000 FORBES AVE
  • Postal Code
    152133815
  • Phone Number
    4122688746

Investigators

  • First Name
    Ronald
  • Last Name
    Blanton
  • Email Address
    blanton@ece.cmu.edu
  • Start Date
    9/6/2023 12:00:00 AM

Program Element

  • Text
    CISE Core: Large Projects

Program Reference

  • Text
    LARGE PROJECT
  • Code
    7925