FORECASTING DISK FAILURE
There is around 2% total average annual failure rate of physical drives used in many data centers. These sudden crashes contribute to maintenance and downtime costs. In this case study, we aim on predicting disk failure events in advance of several weeks. This enables to perform some reasonable predictive maintenance, like proactive replacement of disks and rebuild of the RAID network. The model we present is straightforwardly data-based, but besides providing predictions it also uncovers correlations between individual S.M.A.R.T. attributes and their relative importances in determining possible disk failures.
We used the dataset provided by the Backblaze data storage company [BB1, BB2] with nearly 100,000 monitored physical drives. These data made possible to train an effective classification model predicting whether a disk is likely going to crash in the near future or not. Forecasting disk failures represents a typical application of machine learning in the growing area of predictive maintenance [simafore, Arimo].
Each HDD as well as SDD medium monitors and regularly records the so-called S.M.A.R.T. parameters [wiki-SMART] which represent various physical quantities reflecting actual state of the device [BB3, BB4]. Parameters include temperature, number of bad sectors (reallocation counts), probational counts, power-on-hours, read/write error rate, spin-up time, power cycle count, head stability and many more (often less easily interpretable) features. These attributes are provided on a daily basis by Backblaze along with information whether each disk failed (or was proactively replaced by Backblaze) or not on the following day.
Major problem with S.M.A.R.T. parameters is that their availability across individual models varies widely since each vendor provides a different collection of the attributes. This naturally leads to an idea of training and using a unique model for each medium type. Yet another reason for using a separate model for each disk type is that the recorded S.M.A.R.T. values may vary in meaning (or physical unit) based on drive manufacturer. Last not least, some of the S.M.A.R.T. attributes may affect different disk types in different ways (e.g. temperature was found to affect Seagate HDDs more than other types [BB5]). For the purpose of our study, we chose to investigate the Seagate ST4000DM000 model with more than 35,000 physical drives.
A snapshot of how the parameters might behave before the disk crashes is illustrated in Fig. 1. Visual inspection of the actual dataset was leading the ideas about how to properly preprocess data.
Fig. 1 – (Upper picture) Visualization of raw S.M.A.R.T. parameters for several failed and healthy disks. (Lower picture) Detail on two randomly chosen crashed disks – the vertical line divides between the two individual serials running in time from left to right (one time point = one disk day). The disk on the left shows clear signs of change before the crash in one S.M.A.R.T. attribute (violet dots), while the right one does not show any signs of the upcoming crash at all. The orange 1/0 bi-valued function represents whether there are more (0) or less (1) than n=12 days before the disk failed.
Processing the dataset
During the construction of the input matrix ready to be fed into some machine learning model, all S.M.A.R.T. attributes were subject of several preprocessing steps. First, we have analyzed the correlations between the attributes and upcoming crash as well as within the attributes. By this procedure, we isolated 7 with crash most correlated S.M.A.R.T. parameters (#5, #9, #187, #193, #197, #240 and #242 for Seagate models) and included them into features. After then, we have created an additional set of features by transforming the original ones and included them into common input dataset.
The output vector was created by recording either 0 or 1 for disks at those days that were in a healthy status or going to crash in n days, respectively. Number n represents the number of days we wish the disk crash to be forecast in advance and it de facto serves as a tunable model hyperparameter. Several values between 1-6 weeks were tested and the best value was found to be around 4 weeks.
Choosing the model & evaluating performance
We have trained several machine learning algorithms, e.g. logistic regression, simple neural network (multilayer perceptron with a few hidden layers), Naive Bayes and the random forests ensemble model, which showed to be working best.
Performance was measured in terms of true and false positive/negative predictions (TP, FP, TN, FN) where positives refer to disks that are predicted to crash (within the n-days criterion) and negatives to the majority of healthy disk days. Performance indicators were evaluated on a validation set and cover Acc – accuracy, P – precision, R – recall (sensitivity), F1-score, κ – Cohen’s kappa as well as sensitivity S (equal to recall) and specificity Sp representing accuracies for the two individual classes. Indicators are defined by following formulas:
Acc = (TP+TN) / (AP+AN)
P = TP / (TP+FP) S = TP / AP
R = TP / (TP+FN) Sp = TN / AN
F1 = 2PR / (P+R)
κ = (p0 – pe) / (1 – pe),
p_0 = (TP + TN) / (AP + AN),
p_e = ((TP + FP)(TP + FN) + (FN + TN)(FP + TN)) / (AP + AN)^2
where AP = TP + FN and AN = TN + FP are abbrevations for all positives and all negatives, respectively.
In the mathematical formulation of our predictive model, a single sample represents one disk living in one day. However, the below presented performance indicators are not evaluated on such microscopic examples, but instead aim to provide a useful summary about how the model performs on predicting health status of individual disks. We call these macroscopic accuracy indicators and true/false positive and negative predictions refer to correctly/uncorrectly predicted healthy and failed disks (one sample is represented by one disk in its entire history).
Results – macroscopic accuracy indicators
The results from random forests model applied for the ST4000DM000 medium type are provided in the table below. The model was trained and tested on disks randomly taken from 2016-2017 data which contain around 36,000 healthy disks and cca. 1,800 crashed ones. Training the model on more examples should reach even better results. The optimal hyperparameter configuration found for „days before crash considered as crash“ is n = 4 weeks.
Tab. 1 – Macroscopic skills for the random forests mode with n=28 days. Train and test-set results are practically identical indicating that our model is not overfitted.
Results on the test set show that the random forests model is able to correctly catch nearly 60 % of disks that are going to fail in the near future. However, previous researches demonstrated that only a fraction of disk crashes is predictable and can be linked to changes in the S.M.A.R.T. parameters [Pinhero, wiki-SMART]. The proportion of predictable crashes is not known exactly (and will vary for models), but the estimated value of 64% [Pinhero] suggests that results of our models are near to the theoretical limit, since around one third of all disk failures may be caused by unpredictable events. At the same time, the model maintains a low number of false alarms that is kept under 3% (less than 500 FP from over 16,000 disks), which is a fairly good result for such highly imbalanced classes.
The time evolution of disk parameters along with functions of real and predicted remaining disk lifetime are illustrated in Fig. 2. We can see that the real (orange) and the predicted (blue) functions are in a great correspondence, for this particular disk.
Fig. 2 – Visualization of one disk from the ST4000DM000 series in time showing S.M.A.R.T. features and the output functions. Orange function shows real remaining lifetime of the disk – value 0 means that disk is healthy (there are more than n=9 days before it crashes), while value 1 means there are less than n=9 days before the crash (disk crashes on the day where the second vertical line is placed). The blue dashed-dotted line is the corresponding prediction function, which would ideally be equal to the orange line. The continuous green dashed curve is the predicted continuous probability function representing the actual „health status“ of the disk. We can see that the probability of upcoming crash starts to rise from 0 about 2 weeks before the crash and reaches the 0.5 threshold cca. 1week before the crash, yielding the binary prediction function (green) to switch to 1 and predicts this particular disk is going to crash.
We note that our problem deals with very strongly imbalanced classes where only about 2% of drives fail annually and such strong imbalance is always challenging for any predictive model. Moreover, it is a well-known fact that there are many crash events caused by sudden electronic malfunctions or unpredictable catastrophic mechanical failures, which cannot be simply predicted. The obtained results, especially for the failed disks-class, are therefore very satisfactory taking into account that the proportion of disks whose failures may be actually predicted is probably very close to our results.
Moreover, it has to be emphasized that the model described in this short report is very simplified and there are many more options how the problem could be improved or solved completely differently. Eventually applicable improvements range from using larger datasets, downsampling the negative class in the training set or missing-data imputation, through using different classification algorithms (e.g. more complex neural networks), to employing completely different approaches like treating the problem as anomaly detection or incorporating the vendor-provided threshold values for certain S.M.A.R.T. parameters into the AI model. More advanced techniques may also include training ensembles of classifiers or using penalized models and many more sophisticated techniques. Nevertheless, the results presented here demonstrate the ability of even a simple machine learning model to predict a majority of disk crashes. In the continuation of our study, we are going to present results with some of the above suggested improvements included.
The greatest challenge, besides improving the presented model for Seagate drives, is the development of a general (= model-unspecific) predictive model that would be able to predict crash of any type of HDD or SSD medium within a single predictive model. This will require, along some R&D, gathering a very large amount of data over an extended period of time, which is something that Backblaze is still doing for us!
[BB1] Hard drive data and stats. https://www.backblaze.com/b2/hard-drive-test-data.html
[BB2] Reliability Data Set For 41,000 Hard Drives Now Open Source https://www.backblaze.com/blog/hard-drive-data-feb2015/
[wiki-SMART] S.M.A.R.T. https://en.wikipedia.org/wiki/S.M.A.R.T.
[simafore] Manufacturing analytics: predictive models for machine failure http://www.simafore.com/blog/bid/103702/Manufacturing-Analytics-predictive-models-for-machine-failure
[Arimo] Manufacturing downtime cost reduction with predictive maintenance https://arimo.com/machine-learning/2016/manufacturing-downtime-cost-reduction-predictive-maintenance/
[BB3] Hard Drive SMART Stats https://www.backblaze.com/blog/hard-drive-smart-stats/
[BB4] What SMART Stats Tell Us About Hard Drives https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/
[BB5] Hard Drive Temperature – Does It Matter? https://www.backblaze.com/blog/hard-drive-temperature-does-it-matter/
[Pinhero] Pinheiro, E., Weber, W.-D., and Barroso, L. A. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07), 2007
[Botezatu] Botezatu, M., et al. Predicting Disk Replacement towards Reliable Data Centers. Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining, San Francisco, USA, 2016.
[El-Shimi] El-Shimi, A. Predicting Storage Failures. VAULT-Linux Storage and File Systems Conference, Cambridge 2017
[Li] Li, W., Suarez, I, Camacho, J., Proactive Prediction of Hard Disk Drive Failure.