One of my systems recently marked a ZFS pool as DEGRADED because one of the mirror discs had too many errors.
I suspect what happened was it failed during a routine scrub over the weekend and then the system attempted to resilver the mirror after I rebooted for other reasons. It looked like this:
pool: rpool state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: resilvered 276G in 0 days 15:39:02 with 0 errors on Mon Dec 14 22:55:58 2020 config: NAME STATE READ WRITE CKSUM rpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 sda2 FAULTED 0 12 211 too many errors sdb2 ONLINE 0 0 0 cache sde ONLINE 0 0 0
This pool is configured as a simple RAID-1 mirror of two Western Digital RED drives (model WD40EFRX-68WT0N0) with a small 128GB Micron SSD acting as a read cache. This configuration just proved itself by keeping the system running when one of the discs failed.
The disc itself has been powered on for over 42300 hours, which at 4.8(-ish) years is nearly the expected useful life of 5 years powered on. (These drives have a 5 year warranty, so I’ll have to take that up with wherever I bought these things from 5 years ago). I’ve ordered some replacement drives, but while I wait for them to show up, I thought I’d try to repair the errors, if possible.
These discs are S.M.A.R.T. enabled so they can run some self-tests that will help us figure out where the problems are.
smartctl -t short /dev/sda
will run a short self-test, which stops at the first error. After a couple of minutes, we can run
smartctl --log=selftest /dev/sda
to see the the results. They look like this:
=== START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 10% 42389 116522400 # 2 Short offline Completed: read failure 10% 42390 116522400
I found this page about handling bad blocks detected by smartmontools quite helpful in explaining some of the steps you can take, but I found I could skip most of the maths related to calculating block offsets for filesystems and just use hdparm to talk to the disc directly.
Dangerous Repair Instructions
DANGER! Before we go any further, be aware that these hints here include some very dangerous and destructive commands. Do not try this on a system you care about and don’t have known-good and working backups for.
If you’re feeling brave and/or foolish, read on!
In the smartctl self-test results above, we found the Logical Block Address (LBA) of the first error was at 116522400. This actually corresponds to the sector number on the disc as well, in my case, which means can verify it using hdparm like this:
hdparm --read-sector 116522400 /dev/sda
Which gives us output like this:
/dev/sda: reading sector 116522401: SG_IO: bad/missing sense data, sb[]: 70 00 03 00 00 00 00 0a 40 51 e6 01 11 04 00 00 00 a1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 succeeded 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ...
This confirms a bad read. A good read looks like this:
/dev/sda: reading sector 3302449: succeeded 7433 3264 3574 664e 6c47 4d73 584c 4c46 364e 7666 5453 5365 6e50 6f63 7078 6363 7357 0a36 3733 6370 316e 6d54 6c5a 6e46 3372 7572 4766 346b 3538 7245 784c 6f39 3631 7264 3763 3958 6b78 7938 4764 5a6e 5565 354e 3838 5257 6954 6336 747a 4675 3253 4d62 3672 706a 6752 6c30 7062 740a ...
Forcing Sector Reallocation
We can try to force the disc to reallocate this sector by writing to it with hdparm. Hard discs controllers have all sorts of fancy computers and software in them to make the physical hardware of discs pretend they are vastly more stable and functional than they really are. They hide all kinds of errors from us most of the time, so we can try to take advantage of their helpful nature.
The command we want is
hdparm --write-sector 116522400 /dev/sda
which gives us this:
/dev/sda: Use of --write-sector is VERY DANGEROUS. You are trying to deliberately overwrite a low-level sector on the media. This is a BAD idea, and can easily result in total data loss. Please supply the --yes-i-know-what-i-am-doing flag if you really want this. Program aborted.
I told you this was dangerous.
Because we are totally YOLO-ing this we will lie to the computer that we know what we’re doing.
hdparm --yes-i-know-what-i-am-doing --write-sector 116522400 /dev/sda
Which gives us this result:
/dev/sda: re-writing sector 116522400: succeeded
Hooray!
Let’s check this worked correctly by reading the sector again:
hdparm --read-sector 116522400 /dev/sda
/dev/sda: reading sector 116522400: succeeded 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ...
Success!
We can now repeat the self-test check to see if there are any other bad sectors. Turns out, yes. #sadface
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 10% 42390 116522401 # 2 Short offline Completed: read failure 10% 42390 116522400 # 3 Short offline Completed: read failure 10% 42389 116522400
Breaking Things At Scale
Rather than waiting 2 minutes for each self-test, I figured I’d just check each sector nearby could be read with hdparm, and fix the ones with errors.
And because I like to let computers help me make mistakes at scale, I wrote a short bash script to automate the process:
#!/bin/bash # DISC="/dev/sda" START_SECTOR="116522400" MAX_SECTORS=10 END_SECTOR=$((${START_SECTOR}+${MAX_SECTORS})) echo "Checking sectors ${START_SECTOR} to ${END_SECTOR}" for i in $(seq ${START_SECTOR} 1 ${END_SECTOR}); do echo "Checking sector $i" result=`hdparm --read-sector $i ${DISC} 2>&1 | grep "reading sector"` echo "Got result: ${result}" echo $result | grep -q "bad/missing sense data" if [ $? -eq 0 ]; then echo "Bad sector found. Attempting to correct..." hdparm --yes-i-know-what-i-am-doing --write-sector $i ${DISC} else echo "Sector seems okay." fi done
There were about 6 bad sectors in this region to reallocate, and then another couple in another region, but after a couple of iterations of this procedure, I was able to get the self-test to pass.
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 42391 - # 2 Short offline Completed: read failure 10% 42390 116526024 # 3 Short offline Completed: read failure 10% 42390 116522401 # 4 Short offline Completed: read failure 10% 42390 116522400 # 5 Short offline Completed: read failure 10% 42389 116522400
Now we can mark the disc as repaired and the pool can resilver:
zpool clear rpool
Now I just need to wait for the pool to resilver over the next 18 hours or so
pool: rpool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Dec 15 08:12:08 2020 856G scanned at 296M/s, 148G issued at 51.1M/s, 3.27T total 118G resilvered, 4.40% done, 0 days 17:49:53 to go config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sda2 ONLINE 0 0 0 (resilvering) sdb2 ONLINE 0 0 0 cache sde ONLINE 0 0 0 errors: No known data errors
Update: Success!
pool: rpool state: ONLINE scan: resilvered 3.21T in 0 days 14:25:27 with 0 errors on Tue Dec 15 22:37:35 2020 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sda2 ONLINE 0 0 0 sdb2 ONLINE 0 0 0 cache sde ONLINE 0 0 0 errors: No known data errors
Happy disc repairing!