Scientists conduct analyses that rely on large-scale simulations to achieve breakthroughs in multiple scientific domains, such as climate, energy, quantum physics, and more. As system complexity increases, future large-scale systems and the data generated, processed, stored, and transmitted by them are subject to increasingly higher occurrences of soft errors or silent data corruption. Importantly, this silently compromised data may go undetected because current High-Performance Computing (HPC) software stacks largely lack mechanisms to inform scientists of silent data corruption that could adversely affect the integrity of their scientific interpretation. In order to combat silent data corruption in HPC systems, this project introduces highly efficient and cost-effective mechanisms to monitor and detect soft errors. Through the use of unsupervised error detection, this project increases scientists’ confidence in extreme-scale scientific simulations and data analyses, which advance the data-intensive science discovery needed to solve some of the world’s most complex contemporary problems, such as predicting severe weather conditions, designing new materials, making new energy sources pragmatic, and others. The methodologies of this project are also applicable to general-purpose computing systems, increasing security and reliability on traditional computing and Internet of Things devices.<br/><br/>This research applies compressive sensing and machine learning, especially an unsupervised approach, to accurately detect soft and hardware errors in current and future HPC systems. A compact representation that corresponds to the original dataset is efficiently obtained through compressive sensing coupled with a hardware-assisted data collection mechanism that requires no changes to existing infrastructure. This is used with a spatiotemporal anomaly detection model for in situ characterization of soft errors and errors caused by a hardware malfunction, detecting anomalies deviating from acceptable ranges. The approach is built into the scientific workflow and operates seamlessly with the application without requiring application modification or customization. Validation of the mechanism across multiple HPC platforms using scientific workflows allows scientists to analyze and verify their datasets with increased levels of trust.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.