According to maintenance logs gathered from large data centers over two years, hard disk drive failures account for 70% of server hardware failures [1]. Major causes of hard drive failure include mechanical failure, firmware or logical issues, power surges, overheating, and manufacturing defects. To prevent data loss, data centers employ various preventive maintenance measures. These measures include monitoring temperature and power supply, conducting regular data backup and disk replacement, and implementing the Redundant Array of Independent Disk (RAID) configuration. RAID 6 is a commonly used RAID configuration in the industry. It distributes data across multiple hard drives and employs dual parity information, safeguarding against the failure of up to two drives within the drive array [2]. However, permanent data loss can occur if more than two drives fail in a drive array. When replacing a single drive, the whole array needs to be put on standby mode. The time it takes for RAID reconfiguration can vary from a few hours to several days, depending on the storage capacity of the drive being replaced.
Monitoring the health status of individual disks and predicting potential failures can reduce the risk of data loss and optimize maintenance planning. Modern hard drives are equipped with a built-in health monitoring system known as SMART (Self-Monitoring, Analysis, and Reporting Technology) [3]. SMART attributes record information such as model, serial number, workload, temperature, spindle performance, and read/write error rate. The operating system continuously monitors these attributes and triggers an alarm when any value exceeds a predefined threshold. The thresholds are determined by domain experts, and the model implementation is straightforward. However, the built-in threshold method has four limitations: (1) The threshold method provides poor prediction performance, where the failure detection rate is around 3-10% when the false alarm rate is 0.1% [4]. The predefined threshold cannot capture all possible failure modes. Also, it is challenging to achieve optimal failure detection while minimizing false alarms. (2) The threshold method does not provide remaining time before a failure occurs once an alarm is set off. This time-in-advance information is desirable to plan proactive actions. (3) The threshold method fails to leverage the time series characteristics of the SMART attributes. Valuable predictive information in the attribute value trends remains untapped. (4) The threshold method overlooks the potential benefits of population statistics. Data centers host masses of hard drives operating under a controlled environment and population statistics can improve anomaly detection.
Over the past two decades, researchers have developed data-driven solutions to overcome these limitations. Initially, statistical approaches were used to model and distinguish SMART attribute distributions from operational and failure classes. The rank-sum test and reverse arrangement test were among the most effective methods [5, 6], and became the basis for feature selection in later studies. As open-source hard drive datasets became more accessible, modern machine learning techniques have outperformed statistical methods in prediction accuracy.
The most effective machine learning models for this particular use case, such as random forests [7] and Long Short-Term Memory (LSTM) recurrent neural networks [8], are often referred to as black-box models due to their lack of transparency and interpretability. This poses a challenge for the maintenance of the hard drives because it must be understood how the model comes up with a prediction to identify new failure modes and enhance maintenance actions.
Described herein are methods and systems for a hard drive failure prediction system to determine a predicted hard drive failure within a time-in-advance determined time period. The hard drive failure prediction system includes a machine-learning model, such as an XGBoost model, trained using historical hard drive operational data from a data center. The historical hard drive operation data corresponds to SMART attributes. Statistical analysis is performed to determine a set of SMART attributes most indicative of hard drive failure. The machine-learning model is trained using the historical hard drive operational data corresponding to the determined set of SMART attributes indicative of hard drive failure. The accuracy results of the machine-learning model are used to determine a time-in-advance value that balances accuracy with failure lead time. A signal length of operational hard drive data for a particular hard drive is provided as input to the trained machine-learning model and an output of the model indicates if the hard drive will fail within the time period of the time-in-advance value.
In one aspect, a computer-implemented method for predicting hard drive failure is provided. The method includes receiving operational data for a hard drive. The method also includes selecting a portion of the operational data corresponding to a time period based on a predetermined signal length. The method also includes extracting, from the portion of the operational data, input data representing a set of features indicative of hard drive health. The method also includes determining, using the input data as input to a trained hard drive failure prediction model, hard drive health output data, wherein the trained hard drive failure prediction model is trained using training data from a plurality of hard drives and wherein the training data corresponds to the predetermined signal length and a predetermined time-in-advance value. The method also includes extracting, from the hard drive health output data, a hard drive failure indicator. The method also includes determining whether the hard drive failure indicator represents failure. The method also includes, in response to determining that the hard drive failure indicator represents failure, generating an alert indicating a time span, based on the time-in-advance value, until failure of the hard drive.
In some embodiments, the method also includes, in response to determining that the hard drive failure indicator represents failure, analyzing the set of features indicative of hard drive health corresponding to the hard drive to generate an explainable dataset and generating a graphical user interface using the explainable dataset. In some embodiments, analyzing the set of features indicative of hard drive health includes using Shapley additive explanations (SHAP) analysis. In some embodiments, the operational data are Self-Monitoring, Analysis, and Reporting Technology (SMART) attributes and received from a SMART monitoring system. In some embodiments, the method also includes, prior to receiving the operational data, performing statistical analysis of the training data, determining, based on the statistical analysis, a set of SMART attributes correlated to hard drive failure, and storing the set of SMART attributes as the set of features indicative of hard drive health. In some embodiments, the set of SMART attributes includes at least one of: number of bad sectors, number of data blocks with uncorrectable errors, number of parity errors, number of unrecoverable errors using hardware error correction code, and number of sectors waiting to be remapped due to unrecoverable read errors.
In some embodiments, the method also includes, prior to receiving the operational data, determining the predetermined time-in-advance value by: executing the trained hard drive failure prediction model using the training data and a time-in-advance value, determining an accuracy value for the execution of the trained hard drive failure prediction model, determining the accuracy value exceeds an accuracy threshold value, and storing the time-in-advance value as the predetermined time-in-advance value. In some embodiments, the trained hard drive failure prediction model is an extreme gradient boosting (XGBoost) model.
In another aspect, a system for predicting hard drive failure is provided. The system includes at least one processor and at least one memory. The memory includes instructions that, when executed by the at least one processor, cause the system to receive operational data for a hard drive. The memory also includes instructions that, when executed by the at least one processor, cause the system to select a portion of the operational data corresponding to a time period based on a predetermined signal length. The memory also includes instructions that, when executed by the at least one processor, cause the system to extract, from the portion of the operational data, input data representing a set of features indicative of hard drive health. The memory also includes instructions that, when executed by the at least one processor, cause the system to determine, using the input data as input to a trained hard drive failure prediction model, hard drive health output data, wherein the trained hard drive failure prediction model is trained using training data from a plurality of hard drives and wherein the training data corresponds to the predetermined signal length and a predetermined time-in-advance value. The memory also includes instructions that, when executed by the at least one processor, cause the system to extract, from the hard drive health output data, a hard drive failure indicator. The memory also includes instructions that, when executed by the at least one processor, cause the system to determine the hard drive failure indicator represents failure. The memory also includes instructions that, when executed by the at least one processor, cause the system to, in response to determining the hard drive failure indicator represents failure, generate an alert indicating a time span, based on the time-in-advance value, until failure of the hard drive.
In some embodiments, the memory also includes instructions that, when executed by the at least one processor, cause the system to, in response to determining the hard drive failure indicator represents failure, analyze the set of features indicative of hard drive health corresponding to the hard drive to generate an explainable dataset and generate a graphical user interface using the explainable dataset. In some embodiments, analyzing the set of features indicative of hard drive health includes using Shapley additive explanations (SHAP) analysis. In some embodiments, the operational data are Self-Monitoring, Analysis, and Reporting Technology (SMART) attributes and received from a SMART monitoring system. In some embodiments, the memory also includes instructions that, when executed by the at least one processor, cause the system to, prior to receiving the operational data, perform statistical analysis of the training data, determine, based on the statistical analysis, a set of SMART attributes correlated to hard drive failure, and store the set of SMART attributes as the set of features indicative of hard drive health. In some embodiments, the set of SMART attributes includes at least one of: number of bad sectors, number of data blocks with uncorrectable errors, number of parity errors, number of unrecoverable errors using hardware error correction code, and number of sectors waiting to be remapped due to unrecoverable read errors.
In some embodiments, the memory also includes instructions that, when executed by the at least one processor, cause the system to, prior to receiving the operational data, determine the predetermined time-in-advance value, which further includes instructions to: execute the trained hard drive failure prediction model using the training data and a time-in-advance value, determine an accuracy value for the execution of the trained hard drive failure prediction model, determine the accuracy value exceeds an accuracy threshold value, and store the time-in-advance value as the predetermined time-in-advance value. In some embodiments, the trained hard drive failure prediction model is an extreme gradient boosting (XGBoost) model.
In another aspect, a computing device for predicting hard drive failure is provided. T computing device is configured to connect to a pod of hard drives in a data center. The computing device is also configured to operate as an edge computing device. The computing device is also configured to receive, using message queuing telemetry transport (MQTT) protocol, Self-Monitoring, Analysis, and Reporting Technology (SMART) data from the pod of hard drives. The computing device is also configured to select a portion of the SMART data corresponding to a time period based on a predetermined signal length. The computing device is also configured to extract, from the portion of the SMART data, input data representing a set of SMART attributes indicative of hard drive health. The computing device is also configured to determine, using the input data as input to a trained hard drive failure prediction model, hard drive health output data, wherein the trained hard drive failure prediction model is trained using training data from a plurality of hard drives and wherein the training data corresponds to the predetermined signal length and a predetermined time-in-advance value. The computing device is also configured to generate, using the hard drive health output data, a graphical user interface with a failure indicator corresponding to each hard drive from the pod of hard drives. In some embodiments, the SMART data is received in real-time from the pod of hard drives. In some embodiments, the graphical user interface further includes a visualization of SMART attributes corresponding to the failure indicator of each hard drive from the pod of hard drives. In some embodiments, wherein the set of SMART attributes indicative of hard drive health include at least one of: number of bad sectors, number of data blocks with uncorrectable errors, number of parity errors, number of unrecoverable errors using hardware error correction code, and number of sectors waiting to be remapped due to unrecoverable read errors.
Additional features and aspects of the technology include the following:
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The methods and techniques described herein include a model framework that addresses the limitations of existing models. As shown in
The hard disk drive failure prediction problem falls under the standard scope of the Prognostics and Health Management (PHM). The PHM lifecycle may be divided into three phases, namely design, development, and decision (DE3) [9].
The design phase includes three stages: system function design, framework design, and verification & validation design [9]. The design phase may include a thorough analysis of the PHM system function, its interaction with other systems, and the quantitative benefits it provides. For hard drive failure prediction, the PHM system function may be one of the three types: failure detection, health status classification, or remaining useful life (RUL) prediction. The modeling framework is then established by selecting inputs, machine learning algorithms, and designing hardware infrastructure. Finally, during the verification & validation stage, specific indicators are designed to evaluate the effectiveness of the system.
As part of the system function design,
The hard drive failure prediction system works in conjunction with other subsystems as part of data center information technology (IT) infrastructure. The threshold-based model framework utilizes SMART sensor signals as input, to determine and then display the health status on the operating system. Recent publications have improved upon this model framework and enhanced its modeling capacity. An enhanced model is usually trained offline with data from a drive population to capture the SMART attribute distribution. The trained model then scores incoming data on the operating system. This enhanced framework does not require additional hardware. The computational requirement of the operating system depends on the complexity of the trained model. Novel frameworks have been developed to further enhance modeling capability. One group of novel frameworks facilitates online learning to prevent model aging. A particular online learning framework proposed using online random forests (ORF) to address the disk class imbalance problem [15]. The online bagging strategy selects records from the failed class to update the model more frequently than those from the operational class. The concept drift phenomenon in the hard drive population may be observed, where the statistical patterns of SMART attributes change over time and the change is unpredictable [16]. To detect concept drift, the online learning model Stream disk failure prediction (StreamDFP) was developed with state-of-the-art incremental learning algorithms. The other group of frameworks involves combining system-level signals with SMART attributes. In one instance, Microsoft Azure was used to combine workload performance counters with SMART attributes to improve prediction accuracy and achieved success in their experiments [17]. However, there were difficulties in implementing the system due to the need for third-party data and hardware divergence in production. In another instance, a Cloud Disk Error Forecasting (CDEF) system incorporated system-level signals, such as Windows events and file system operation errors, with SMART attributes as input [18]. By combining these signals, the prediction accuracy increased. Overall, retrofitting the hard drive failure prediction system leads to a tradeoff between the framework novelty and implementation difficulty. These novel frameworks demand real-time memory and processing capacity, which challenges their scalability at a large-scale data center.
To validate the prediction accuracy and to compare with other published models, failure prediction models have been trained and tested on open-source data. The SMART dataset from the Center of Magnetic Recording Research (CMRR) at the University of California San Diego (UCSD) [4] is the first open-source SMART dataset. The dataset contains 369 hard drives, out of which 191 have failed. The SMART attributes were sampled every two hours, and the most recent 25 days of records were kept. In some instances, the operational disk data was from a reliability demonstration test run and the failed disk data was collected from the field. Thus, in some instances the model may be trained to differentiate between the two classes would be based on the difference between operation environments instead of disk performance. A common dataset in publications is the Backblaze SMART dataset. The Backblaze datacenter has been releasing SMART data collected from more than 100,000 hard drives from various models quarterly [19]. The large data volume allows researchers to extract subsets that match their modeling goals. Other non-open-source data from commercial data centers include Microsoft Azure [17, 18], EMC Corporation [20], Baidu [21, 22], Tencent [23], and others [24]. A small-scale run-to-failure test may be performed to identify hard drive failure indicators [20]. When evaluating the performance of hard drive failure prediction models, the Failure Detection Rate (FDR) and False Alarm Rate (FAR) are the preferred metrics. A Markov model may be used to demonstrate that with a relatively low detection rate of around 40%-50%, the reliability of the overall system in RAID configurations can be significantly improved [26]. Therefore, the focus is on achieving the highest possible FDR and on controlling the FAR to avoid unnecessary disk replacements. Other performance evaluation metrics focus on data migration, which often comes into play when a disk failure occurs. Migration rate and mis-migration rate are proposed to measure protected and lost data, but this may not apply to disks that are configured with RAID [27]. A metric was developed that estimates the virtual machine migration cost at a data center by considering both unnecessary migration cost and data loss cost [18]. A cost-sensitive ranking model was trained to detect hard drive failure.
In the development phase, technical details of the designed architecture are established and tested. In the hard drive failure prediction system, the focus of the phase is on training the machine learning model, with models ranging from Failure Modes, Mechanisms, and Effects Analysis (FEMMA) to data-driven deep learning models.
Failure mechanisms refer to the ways by which physical, electrical, and mechanical stressors can lead to failures. The indicators of these stressors can help predict potential failures. The UCSD hard drive dataset was analyzed using the FMMEA data sheet and identified high-risk failure modes like head crashes, disk scratches, and head offtrack [28]. SMART features that were correlated with these high-risk failure mechanisms were chosen, and a Mahalanobis-distance-based model was applied to detect anomalies [29]. Similarly, SMART data was analyzed from 433 failed hard drives from a real-life data center and classified failure modes as logical, bad sector, or read/write head failures [24]. Then the failure degradation signatures were calculated for each failure mode and used to predict the degradation stage of the hard drive. Additionally, sector errors were demonstrated to be highly correlated with whole-disk error through domain expertise and extensive data analysis [20].
The distribution of SMART attribute from the operational class and the failed class was initially modeled using statistical models. The Naïve Bayes Expectation-maximization (NBEM) achieved a 30% failure detection rate, which was three times better than the threshold-based model at the same false positive rate [30]. Non-parametric statistical tests, namely rans-sum and reverse arrangements test, may be used to capture SMART attributes that display significant trends for most failed drives and close to no trends for operational drives [4, 5, 6]. Hidden Markov Models (HMMs) and Hidden Semi-Markov Models (HSMMs) may be used for both operational and failed disk SMART time series [31]. A test record may be classified based on which model fits better. Logistic regression may be utilized with Li regularization and to observe enhanced prediction performance after incorporating SMART time series features [22]. In some instances, decision trees are preferable in this application because they may capture the combined effects of multiple SMART features and are inherently interpretable. Failure prediction systems were proposed based on classification trees that achieved a high failure detection rate of over 90% with a low false alarm rate of less than 1% on real-life large-scale datasets [12, 32, 33]. Specifically, both trees used “Power-on Hours” as a predictive feature. “Power-on Hours” was found to be a more preferable feature, with a feature importance measurement five times higher than the second most important feature [12]. While these findings reflect genuine data patterns, the observed inverse correlation between age and failure likelihood is counterintuitive and requires a nuanced explanation. Based on analysis of the Backblaze dataset, the classification tree indicates that the top SMART features that distinguish two classes include 197 (Current Pending Sector Count), 187 (Reported Uncorrectable Errors), 5 (Reallocated Sectors Count), and 188 (Command Timeout) 34]. Similarly, an optimal survival tree applied to a year of Backblaze data and may indicate the most important features are SMART 5, 187, 3 (Spin-up Time), 190 (Temperature Difference), 7 (Seek Error Rate), and 188 [35]. These results align well with the failure mechanism analysis. Given many successes in accurate failure prediction with basic models, ensemble models provide accurate predictions because they aggregate results from basic models. Some successful ensemble model structures for hard drive failure prediction include a Combined Bayesian Network (CBN) consisting of four basic classifiers [111], a two-stage model combining a decision tree and a logistic regression model [17], AdaBoost with transfer learning capacity [23], and random forests [7].
Moreover, neural network models have been investigated for predicting hard disk drive failures. The use of a convolution neural network and LSTM (CNN+LSTM) model emphasized the importance of integrating server performance and location information [36]. A recurrent neural network (RNN)-based model that treats SMART attributes as sequence data [10], had results showed that the RNN-based model outperformed sequence-independent models. A temporal CNN model that combines daily SMART input and recent SMART time series features as input [37] showed that the temporal CNN performed better than LSTM models due to its resilience to noise. During the development phase of the hard drive failure prediction system, two major challenges arise: the data imbalance challenge and the model update/generalization challenge.
In response to the data imbalance challenge, both undersampling and oversampling techniques have been explored. Undersampling methods may reduce the number of samples from the majority class to balance the dataset [16, 23, 38]. On the other hand, oversampling techniques may inflate the number of samples from the minority class [12, 39]. Another approach to addressing the data imbalance challenge is the use of an improved loss function. In an instance, a loss function that combines the binary cross-entropy with a sign function [37], where the modified loss function allocates more loss and gradients to false alarm samples, thereby reducing the impact of false positives and improving the model's ability to distinguish between healthy and failing disks. Moreover, techniques to enhance data volume and extract feature have been explored. A SMART-GAN method is proposed, which leverages generative adversarial networks (GANs) to increase the data volume for both classes [40]. By generating synthetic samples, SMART-GAN successfully expanded the dataset and improved the performance of the random forest model.
In response to the model update/generalization challenge, various strategies have been proposed to enhance model training and generalization capabilities. One approach is the adoption of the instances map to disk algorithm (IMDA) [23]. IMDA classifies the health state of a disk as a failure if any of the past 14 days of records are classified as failures. By considering historical information, IMDA provides a more comprehensive assessment of disk health and aids in timely failure prediction. Furthermore, online training techniques are proposed specifically for random forest models [15]. Poisson distributions are introduced to model the sequential arrival of positive and negative samples, giving a smaller chance for negative samples to be selected for model updates. This selective updating strategy ensures that negative samples are balanced with positive samples during training, improving the overall performance of the model.
Another approach uses active learning and semi-supervised learning models. A framework developed called StreamDFP, which incorporates active learning and semi-supervised learning to select suitable samples for training [23]. The learners predict samples in learning windows, and the results are used as references for online labeling. This approach leverages the iterative process of active learning and semi-supervised learning to improve the model's classification ability and adapt to evolving data patterns. To address the model generalization challenge, techniques are considered such as normalizing SMART attribute distributions across different drive models [37]. This normalization process enables the training of a single model that can be applied to various drive models, thereby improving the model's generalization capability. Additionally, transfer learning methods have been investigated to achieve cross-model training [23]. By selecting a majority model with a similar attribute distribution as the minority model, knowledge transfer may be facilitated, enabling effective training on minority drive models.
In the final phase of the proposed PHM framework, the focus shifts to making informed maintenance decisions based on the outcomes predicted by the model. This decision-making process leverages the accumulated knowledge from the earlier stages to optimize maintenance actions and ensure efficient resource allocation. One approach to inform maintenance actions is the adoption of a rule-based disk replacement policy [38]. This policy sets guidelines for determining when a disk should be replaced based on predefined rules. By considering factors such as failure predictions, historical data, and specific thresholds, this policy assists in making timely replacements and reducing the risk of catastrophic failures. To further optimize maintenance actions, adapting the speed of the scrubber is proposed based on error predictions. In one approach, the scrubber's speed is adjusted based on the predicted errors [41]. However, this approach has two limitations: it may slow down the system's overall performance and introduce new errors. Despite these limitations, this adaptive approach offers the opportunity to fine-tune maintenance actions based on real-time error predictions, thereby enhancing system reliability, and minimizing downtime.
Moreover, the utilization of the optimal survival tree offers an interpretable path to failure detection [35]. This tree-based model predicts the time to failure and provides insights into the factors and patterns contributing to the failure. Another form of interpretable approach is using a global model interpretation method that decodes the prediction results from non-interpretable model. Shapley additive explanations (SHAP) and local interpretable model-agnostic explanations (LIME) may be used to analyze the SMART feature importance [43]. The model-agnostic approach leaves space to integrate crafted features and model training techniques to create a more customized and nuanced solution. This may have the desired real-world result of maintenance staff trusting the model outcomes and confidently making maintenance decisions based on the insights provided.
The explored techniques highlight the need for a model framework that can be retrofitted into existing data center management system, predicts failure accurately with a low false alarm rate, and offers transparency and interpretability. The methods and techniques described herein are for a hard drive failure prediction system that is configured to receive hard drive operational data (e.g., SMART data) and extract key features for processing by a trained machine learning model (e.g., XGBoost model) to determine a hard drive failure prediction with explainable reasoning behind the prediction.
At operation 410, the explainable features are extracted from the SMART data and the data is labeled. The feature extraction is both data-driven and failure-mechanism driven. The operation 410 may include a statistical analysis to correlate the features extracted by catch-22 from the SMART time series with failures and identify the features that have the most relevance to explaining a hard drive failure. The operation 410 may further include failure mode analysis to guide toward features related to mechanical failures, such as high temperature, logical failures, and sector errors. Additionally, the historical hard drive data may be labeled, such as an indication of operational or failed, for training the machine-learning model.
At operation 415, the machine-learning model is trained using the extracted and labeled hard drive operational data. As described below, the extreme gradient boosting (XGBoost) model may be selected as the model. The signal length and time-in-advance that is selected for the amount of historical data to train the machine-learning model will thus infer the time period that the machine-learning model may perform failure predictions. For example, if the signal length is thirty days, then the machine-learning model may be trained with thirty days of historical data, and thus failure predictions may be based on a failure occurring within a time period determined by the time-in-advance.
At operation 420, based on the training and performance of the machine-learning model, time window slicing and feature selection may be further determined. The time window slicing and feature selection may be provided back to operation 410 to further refine the feature extraction and then further train the machine-learning model at operation 415 with more historical data based on the refined feature extraction.
At operation 425, the prediction results generated may be analyzed and used to determine the impact on drive failures. A SHAP analysis may be performed to explain the failure prediction. SHAP analysis may reveal which features contribute the most to the failure prediction on both the population level and the individual hard drive level.
At operation 510, the preset SMART features may be extracted from the operational hard drive data. In the model training process, such as operation 410, the SMART data was analyzed to determine the features with the most relevance to hard drive failure. Thus, at operation 510, the selected SMART features may be extracted from the operational hard drive data (e.g., SMART time series data) to be used as the input for the model.
At operation 515, the selected SMART features of the operational hard drive data may be provided to the trained machine-learning model as input and the model may determine failure predictions corresponding to the hard drives of the input data. The model input may be in the form of a data table, where each row represents a hard drive and each column includes a feature. The trained XGBoost model may take the input table and classify each hard drive with a failure prediction or an operational prediction. When a hard drive is classified with a failure prediction, that is an indication that the data indicates the hard drive will fail in the number of days determined by the time-in-advance. For example, if the time-in-advance is set to seven days, then the hard drive is likely to fail within seven days. For real-world applications, the time-in-advance is adjusted to provide the most accurate predictive results while also providing sufficient time for the hard drive to be replaced.
At operation 520, analysis is performed to generate a visualization of the reasoning for the predicted hard drive failure. After the execution of the model with the operation hard drive data as input, a SHAP analysis may be performed to explain the prediction result, such as identifying the SMART features that were most indicative of the predicted failure. For example, a maintenance crew of a data center may run the SHAP analysis to explain the prediction results.
A computing device, such as the NVIDIA Jetson Nano, may be configured with the hard drive failure prediction system. The hard drive failure prediction system may scale the computational infrastructure (e.g., the computing device) based on the size of the hard drive population (e.g., data center). In some embodiments, the hard drive failure prediction system may be trained and tested through cloud services.
The hard drive failure prediction system may be implemented on an edge computing device. Implementing on an edge computing device instead of a central server may provide advantages in scalability, data privacy, and operational resilience.
In common data centers, hard drives may be organized in pods. Pods of hard drives are connected to form larger units. If the hard drive failure prediction system was configured as part of a centralized server or the cloud, then the SMART data from each hard drive may be constantly transmitted to the central server. Instead, by using an edge computing device, the edge computing devices may be small but include powerful processors that are embedded in each pod or shelf, thus resulting in the failure prediction to execute locally without burdening central resources. This approach enhances data security and enables continuous local operation, even during network disruptions.
The edge computing device executing the hard drive failure prediction system may collect SMART data from the operational hard drives of a pod using message queuing telemetry transport (MQTT) as the publish/subscribe architecture. The MQTT protocol is a standardized communication protocol utilized by the Organization for the Advancement of Structured Information Standards (OASIS) and International Organization for Standardization (ISO) that provides a scalable and reliable way to connect devices with a small code footprint and minimal network bandwidth. The hard drives of the pods are configured as the publisher and the edge computing device as the subscriber. The edge computing device receives SMART data from the hard drives in real-time.
Extreme gradient boosting (XGBoost) is an implementation of a gradient boosting model that utilizes decision trees as its base learner [44]. The performance of the gradient boosting model is based on training the base learner sequentially to minimize the loss function. The XGBoost model may be used for its performance and the ability to scale on many commercial computation systems. The data processing pipeline transforms raw time series data into a tabular format, where each row represents a hard drive, and each column represents a crafted time series feature. The XGBoost binary classifier may be used to model hard drive failure data and quantify the significance of features through SHAP analysis. To avoid data dredging and ensure the validity of the results, data may be used from Q1 and Q2 2023 for training and Q3 2023 for testing. Given that the data is imbalanced, a random undersampling is applied to the operational class in the training set to prevent bias in the model towards the operational class.
When the training dataset is reduced in size through undersampling, there is an increased risk of the model overfitting. The XGBoost classifier has several hyper-parameters that are adjustable to control the regularization effect. A grid search combined with cross-validation is applied to find the optimal combination of hyper-parameters.
The XGBoost model generates a feature importance graph that ranks features based on their weights. However, the feature importance does not directly account for the magnitude or direction of a feature's effect, and the feature ranking does not provide information about individual predictions. To examine both the global effect of each feature and the reasoning behind individual predictions, the Shapley additive explanations (SHAP) analysis is used as the model interpretation method.
SHAP is a model interpretation method that applies the classic Shapley value from cooperative game theory to assign importance scores among individual features based on their marginal contribution to the prediction result [45]. Given a prediction model f, the classic Shapley regression trains the model on all feature subsets S⊆F, where F is the set of all features [46]. The prediction for a specific input x from a model trained with feature i is fS∪{i}(xS∪{i}), and the prediction for x from a model trained with feature i withheld is fS(xS). The difference between two prediction results evaluates the marginal effect of feature i. The Shapley value of feature i is calculated as a weighted average of prediction differences for all possible subsets S⊆F\{i}:
Shapley values satisfy three important properties: local accuracy, consistency, and missingness, which in turn results in a single unique solution of score attribution. The local accuracy property (shown in equation 2) states that the Shapley value for each feature i sums up to the model prediction f(x), ensuring the attribution accuracy for any specific input.
The consistency property states that the Shapley value of feature i will not decrease when this feature's contribution to the prediction increases or stays the same as the model changes. Given model f and f′, if the condition in equation 3 is satisfied for all feature subset S, the important score attribution retains the consistent trend, that is ϕi(f′,x)≥ϕi(f,x).
The missingness property states that feature with no effect on the predicted value should have a Shapley value of 0. Thus, if fx(S∪i)=fx(S) for all feature subset S, ϕi(f,x)=0 S, φi(f,x)=0. This property ensures non-contributing features do not receive undue importance, which can lead to misleading prediction and operational decisions.
SHAP evolves from the classic Shapley values by approximating fS(xS) in equation 1 with the conditional expectation function E[fS(x)|xS]. The iterative calculation starts with setting all feature values to zero. Then the features in S are introduced one at a time to calculate the conditional expectation and record their contribution to the prediction. The SHAP value of a feature i is the average conditional expectation value over all feature orderings.
In addition, SHAP analysis quantifies pairwise feature interactions through Shapley interaction index [47]. Given feature i and j, the Shapley interaction value ϕi,j is:
The interaction value ϕi,j is equal to ϕj,i, and the total interaction effect between i and j is ϕi,j+ϕj,i.
Computational complexity is a constraint for SHAP calculations, especially with a large input dataset with many features. Since SHAP values are model-specific, approximation methods are used for various model types to calculate SHAP values efficiently. For the system described herein, the TreeExplainer [48] method was implemented to complement the XGBoost model. This may be performed under Python 3.8 environment. The Scikit-learn library [49], XGBoost [44] and SHAP [48] packages are utilized for model training, evaluation, and interpretation.
A fundamental step towards developing an artificial intelligence (AI) system that aligns with human values is to prioritize explainability in system design and implementation [49]. In some embodiments, the hard drive failure prediction system prioritizes explainability on three levels: the feature, the prediction, and the policy.
At the feature level, the SMART attribute trends are analyzed and derive time series features that are indicators of hard drive deterioration. These features are rooted in established domain knowledge, offering interpretable metrics for domain experts.
At the prediction level, the XGBoost model is used for its balance between prediction accuracy and implementation complexity, and then supplementing the model outcome with SHAP analysis, which disentangle each prediction into contributions from individual features. Through examination of these contributions, the model's rationale is compared with ground truth and human judgment. If the model emphasizes the right features in its predictions, it indicated greater trust in the system. However, if discrepancies arise, SHAP analysis is used to identify the issues, and adjust the model to better align with human understanding and interests. This comparison serves as a validation method, ensuring that the model's predictions resonate with domain expertise and industry knowledge.
At the policy level, rigorous sensitivity analyses on both the signal length and the time-in-advance parameters were used to identify a region with high performance stability. This region guarantees that there is a segment of signals containing the key time series features related to a failure, and the model performance tolerates data preprocess parameter changes.
The quality of data directly impacts prediction and insights. Input data for hard drive failure analysis usually comes from three data sources: data centers, retail merchants, and accelerated degradation experiments. In this instance, the training data is used from the Backblaze data center. As a cloud data storage company, Backblaze monitors the SMART attributes of large numbers of hard drives in controlled operating conditions. The SMART attributes of the hard drives are recorded daily and published quarterly, making the Backblaze dataset an extensive public dataset of the hard drive SMART time series. The historical statistics indicate that the failure rate differs greatly across models, and some SMART attributes have inconsistent meanings across brands. Therefore, records may be grouped by model, in this instance the Seagate ST4000DM000 was used to analyze the prognostic features. Representing 8% of total drive counts, ST4000DM000 contributed the most disk failures among all models in Q3 2023 for Backblaze. Despite relatively high failure rates, it is the fourth most used drive model in Backblaze data centers due to its affordability [50]. In addition, this selection allows model performance comparison with many existing hard drive prognostic models.
Additionally, the timeframe for analysis is limited to the three quarters: Q1, Q2, and Q3 2023, and locate the records of 18802 drives. By the end of Q3 2023, 18331 drives were functional. Thus, 98% of these functional drives worked through three quarters, and the rest were deployed during these three quarters. Meanwhile, 471 drives have failed or been removed by the maintenance crew.
In real-world data centers, it is impractical to store lifetime data for each hard drive. As a result, the available time series length of hard drives is usually variable. Therefore, a cropping strategy may be used to ensure time series length uniformity without losing failure-related information.
Sensitivity analysis is performed on signal length and time-in-advance to analyze the impact of time series slicing strategy on the prediction performance. Another goal is to identify a range of variable values that optimize predictive performance and model robustness. This analysis ties back to the goal of making clear and explainable policy decisions.
The chosen Seagate drive model has 23 SMART attributes in both raw and normalized time series format. Only attributes in raw format are considered to avoid information loss from the manufacturer-specified normalization calculation. Therefore, each labeled hard drive record contains 23 even-length time series. Then 24 features are extracted from each SMART time series, including mean, variance, and catch22 highly comparative features [51].
A total of 20 SMART attribute-feature pairs were chosen as degradation indicators because they show a clear trend as hard drive performance degrades over time.
Table 1 summarizes the description of the selected SMART attributes, and Table 2 summarizes the time series features that are extracted from each SMART attribute. Most selected SMART attributes are error indicators that naturally increase toward the end of drive life. Therefore, time-domain features are selected that reflect the magnitude and persistence of increase. In an instance, the long stretch of incremental decrease was modified from the original catch22 to measure the incremental increase. The mean and variance are computed from the raw time series, and the long stretch of incremental increase and pNN40 [51] are computed from the z-score standardized time series.
The performance of the hard drive failure prediction system is evaluated based on the model performance using failure detection rate (FDR) and false alarm rate (FAR), and interpret the feature importance and individual prediction using user-friendly visualizations.
The core metrics used to evaluate model performance are failure detection rate (FDR) and false alarm rate (FAR). FDR, which is equivalent to recall or sensitivity in classic classification metrics, measures the proportion of positive records (failed drives) that are correctly identified. FAR is the inverse of specificity, and it measures the proportion of negative records that are incorrectly classified as positive. The objective of the modeling is to improve the FDR while minimizing the FAR.
The process of slicing SMART time series data requires consideration of two key variables: signal length and time-in-advance. Many factors weigh in when deciding on the values of these two variables, such as information richness, storage requirement, processing capacity, predictive accuracy, and maintenance flexibility. To gauge the impact of these two variables on predictive performance, a sensitivity analysis may be conducted on exhaustive combinations of these variables within a practical range. The feature extraction calls for time series as input, so the range for signal length begins at 2 and concludes with 50 days. The range for time-in-advance spans from 0 to 35 days. This indicates that the model may issue an alarm up to a month in advance, which is considered a reasonable upper limit. For each combination, XGBoost models were optimized for 1000 trials in a space of various model parameters like tree depth, booster type, regularization weights, learning rate, minimum loss reduction, and tree growth policies. This optimization process was executed for 1764 combinations, resulting in a total of 1,764,000 independent trials.
The XGBoost model is re-fit on the training set, and Table 3 presents the classification performance on the test set, which contains Backblaze hard drive data from 2023 Q3. The proposed model achieves 74.7% failure detection rate and a false alarm rate of 0.73%. The model outperforms the interpretable optimal survival tree in both metrics. This demonstrates that the crafted and selected features from the raw SMART time series data effectively represent the predictive power of the model while maintaining explainability.
The SHAP summary plot in
Multiple studies have shown that the value of SMART 187 is a key predictor for hard drive failure [10, 11]. The data shown in the summary plot of
The mean value of SMART 197 also displays consistently high SHAP values. However, unlike the mean of SMART 187, when its SHAP value is positive, the correlation between the feature value and its SHAP value is not distinct. The pNN40 feature, which represents the percentage of successive differences exceeding 40% of the standard deviation, can complement the predictive power of the mean of SMART 197. The results from the summary plot suggest that a higher proportion of large increases in the SMART 197 pNN40 value has a positive effect on the failure risk.
On the other hand, the impact of the raw variance of SMART 183 and the pNN40 of SMART 5 on the likelihood of failure is different from that of other features. Higher values of these features correspond to a negative SHAP value, which indicates a reduced risk of failure. This seemingly counterintuitive result in fact improves the false positive rate of the predictive model. It is important to note that the objective of this study is to predict the potential hard drive failure within a relatively short time-in-advance, compared with the average lifetime of a hard drive. An increase in these features indicates a decline in hard drive performance but does not necessarily mean that the drive will fail in the short term.
A crucial aspect of the alignment strategy is using SHAP analysis to make the model's predictions transparent and interpretable. With SHAP, each prediction may be broken into individual feature contributions and then comparisons may be performed for the model's reasoning between with ground truth and human judgment. In some embodiments, a prediction result interface with a dual-graph representation may be used, which includes a waterfall graph and a radial plot.
The waterfall graph shows the sequential contributions of different features towards a specific prediction. The features are analyzed from the bottom going up to see how each feature contributes additively to the likelihood of failure. The radial plot is placed adjacent to the waterfall graph and charts the normalized SMART values over a set monitoring interval. The radial axis corresponds to a distinct SMART attribute, and displays the temporal evolution of these attributes. The silhouette surrounding the plot represents the evolution of SMART attributes over the monitoring period, where the time series from this period serves as model input.
For the hard drive identified by the serial number S300WQBL, the waterfall graph from
For the hard drive identified by the serial number S301KMSG, the key failure indicator is the mean value of SMART 184, as shown in
The hard drive S300WDVC is a typical operational drive with no SMART attribute change over the monitoring period. In the waterfall graph shown in
In conclusion, the methods and techniques described herein for predicting hard drive failure through the extraction of explainable features from SMART time series data shows promising results. The SHAP feature importance analysis revealed that the mean values of SMART 187 and 197, as well as the pNN40 value of SMART 197, had a positive impact on the likelihood of failure. On the other hand, a higher value of the raw variance of SMART 183, pNN40, and the long stretch of incremental difference of SMART 5 corresponded to a reduced risk of failure in the short term. The re-fit XGBoost model achieved a 74.7% failure detection rate and a 0.73% false alarm rate, outperforming the interpretable optimal survival tree in both metrics.
The hard drive failure prediction system may deliver accurate, interpretable, and actionable predictions for disk health monitoring. By leveraging explainable time series features, conducting sensitivity analyses, and utilizing explainable machine learning models, the models enhance the effectiveness of disk prognostics, enabling proactive maintenance and optimization of resources in various implementation environments such as data center pod operations and NAS pods. Additionally, the possibility of deploying these models of the hard drive failure prediction system on edge computing devices further extends their application potential, enabling real-time monitoring and localized decision-making for enhanced disk system management.
As used herein, “consisting essentially of” allows the inclusion of materials or steps that do not materially affect the basic and novel characteristics of the claim. Any recitation herein of the term “comprising”, particularly in a description of components of a composition or in a description of elements of a device, can be exchanged with “consisting essentially of” or “consisting of”.
While the present invention has been described in conjunction with certain preferred embodiments, one of ordinary skill, after reading the foregoing specification, will be able to effect various changes, substitutions of equivalents, and other alterations to the compositions and methods set forth herein.
This application claims the priority of U.S. Provisional Application No. 63/464,171 filed May 4, 2023 and entitled “Hard Disk Drive Analysis Method”, the whole of which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63464171 | May 2023 | US |