HDD Health Analysis

Every day I get a daily report from my NAS. It includes bunch of data about ZFS datasets and general machine health. However, one thing was missing - I didn't really capture hard disk SMART errors.

As disk will report a bunch of values in SMART, I first had to decide which ones to use. A great help here came from BackBlaze as they publish hard drive test data and stats. It is wealth of information and I recommend reading it all. If you decide on shortcut, one of links contains SMART stats they've found indicate data failure quite reliably.

First one is Reallocated Sectors Count (5). It is essentially counter of bad sectors found during drive's operation. Ideally you want this number to be 0. As soon as it starts increasing, one should think about replacing the drive. All my drives so far have this value at 0.

Second attribute I track is Reported Uncorrectable Errors (187). This one shows number of errors that could not be corrected internally using ECC and that resulted in OS-visible read failure. Interestingly only my SSD cache supports this attribute.

One I decided not to track is Command Timeout (188) as, curiously, none of my drives actually report it. Looking into BackBlaze's data it seems that this one is also the most unreliable of the bunch so no great loss here.

I do track Current Pending Sector Count (197) attribute. While this one doesn't necessarily mean anything major is wrong and it is transient in nature (i.e. its value can change between some number and 0), I decided to track its value as it indicates potential issues with platter - even if data can be read at later time. This attribute is present (and 0) on my spinning disks while SSD doesn't support it.

Fifth attribute they mentioned, Uncorrectable Sector Count (198), I do not track. While value could indicate potential issues with platters and disk surface, it is updated only via offline test. As I don't do those, this value will never actually change. Interestingly, my SSD doesn't even support this attribute.

I additionally track Power-On Hours (9). I do not have actual threshold nor I plan to replace the drive when it reaches certain value but it will definitely come in handy in correlation with other (potential) errors as all my disks support this attribute. Interestingly, BackBlaze found that failure rates significantly rise after three years. I do expect my drives to last significantly longer as my NAS isn't stressed nearly as much as BackBlaze's data center.

Lastly, I track Temperature (194). Again, I track it only to see if everything is ok with cooling. All my drives support it and, as expected, SSD's temperature is about 10 degrees higher than for spinning drives.

Here is a small and incomplete bash example of commands I use to capture these stats on NAS4Free:

DEVICE=ada0
DISK_SMART_OUTPUT=`smartctl -a /dev/$DEVICE 2> /dev/null`
DISK_REALLOCATED=`echo "$DISK_SMART_OUTPUT" | egrep "^ 5 Reallocated_Sector_Ct" | awk '{print $10}' | cut -dh -f1`
DISK_HOURS=`echo "$DISK_SMART_OUTPUT" | egrep "^ 9 Power_On_Hours" | awk '{print $10}' | cut -dh -f1`
DISK_UNCORRECTABLE=`echo "$DISK_SMART_OUTPUT" | egrep "^187 Reported_Uncorrect" | awk '{print $10}' | cut -dh -f1`
DISK_TEMPERATURE=`echo "$DISK_SMART_OUTPUT" | egrep "^194 Temperature_Celsius" | awk '{print $10}' | cut -dh -f1`
DISK_PENDING=`echo "$DISK_SMART_OUTPUT" | egrep "^197 Current_Pending_Sector" | awk '{print $10}' | cut -dh -f1`

Note that I capture the whole smartctl output into a variable instead of multiple calls. This is just a bit of a time saver and there is no issue (other than speed) with simply calling smartctl multiple times. If you do decide to call it only once, do not forget quotes around "echoed" variable as they instruct bash to preserve whitespace.

PS: For curious, drives I use are 2x WD Red 4 TB (3.5"), 2x Seagate 2 TB (2.5"), and Mushkin 120GB (mSATA) SSD cache.

[2018-07-22: NAS4Free has been renamed to XigmaNAS as of July 2018]

Leave a Reply

Your email address will not be published. Required fields are marked *