steamsprocket.org.uk

ZFS: One or more devices has experienced an unrecoverable error

I’m using [[ZFS]] (via ZFS-FUSE), and at one point a zpool status gave me this rather scary report:

zpool status
  pool: srv
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME                                                      STATE     READ WRITE CKSUM
        srv                                                       ONLINE       0     0     0
          disk/by-id/usb-Samsung_STORY_Station_0000002CE0C43-0:0  ONLINE       0     0     1
          disk/by-id/scsi-SATA_ST3500630AS_9QG3JRW0               ONLINE       0     0     0
          disk/by-id/scsi-SATA_SAMSUNG_HD103UJS13PJDWS516679      ONLINE       0     0     0

errors: No known data errors

‘Unrecoverable error’, eh? Crap. Wait, how can applications be unaffected by an unrecoverable error? How can there be ‘no known data errors’? Also, how is there a checksum error, but no read or write errors? What other operation could there be? Better check that link

…Well apparently ‘the device has experienced a read I/O error, write I/O error, or checksum validation error’. I guess that implies an answer to the last question: ‘READ’ and ‘WRITE’ refer specifically to disk I/O errors. Not to errors in reading or writing in general, just those where the disk itself has detected an error and reported it back. The way that information is presented is pretty apalling as you just need to know that peculiarity in order to interpret it, but okay, let’s press on.

‘Because the device is part of a mirror or RAID-Z device, ZFS was able to recover from the error and subsequently repair the damaged data’. What? No it isn’t. This is just a striped pool: no mirroring or RAID-Z involved. Any anyway, you said it was unrecoverable. Now I’m beginning to worry. The documentation claims that ZFS was able to recover from an unrecoverable error using data replication that doesn’t exist; WTF does that mean? Well obviously it means that the documentation was written by an imbecile, but what’s the message they’re clumsily trying to get across?

After some googling led me to this thread, I did eventually work this out. You get the message about an unrecoverable error if and only if (and this part’s genius) ZFS was able to recover. How very… special. If it wasn’t able to recover, you’ll instead be told that ‘a file or directory could not be read due to corrupt data’. No mention of the word ‘unrecoverable’ there.

But wait, how could it recover from it if there’s no data redundancy, and why does it think it’s a mirrored or RAID-Z device? The answer to that would appear to be that, through some happenstance, the error corrupted some metadata rather than actual file data. Since metadata always has at least one redundant copy, it corrected it as if it were mirrored. Phew.

So to recap: so far as ZFS is concerned, the alarming phrase ‘unrecoverable error’ means ‘error from which ZFS has successfully recovered’. Thanks for that Sun.

Facepalm

3 Responses to “ZFS: One or more devices has experienced an unrecoverable error”

MortenApril 6th, 2010 at 13:07

And once more it is demonstrated why hardware engineers should not be allowed anywhere near software development… 🙂

Michael MolJuly 27th, 2010 at 16:18

Ew, ew, ew. A SAMSUNG HD103* hard drive.

I’ve had two such drives. I bought an HD103UI when it was on sale on Newegg during some Black Friday hardware purchase binge. It failed within a few months. Found bugs evident in the firmware by looking at it via smartctl (test result reporting and logging was just screwed *up*)), RMA’d drive with attached note about buggy firmware. Got another HD103UI back from them. Ran smartctl on it, noticed same firmware revision. Twigged out on me within a couple months. Attempted to RMA, never heard back.

As for your ZFS error, it was reporting a failed *device*, which is different from filesystem failure. Think of it like a device dropping out of a redundant RAID setup (which ZFS shares some functionality with); yeah, the device died. Throw in a replacement, rebuild (or apparently “zpool replace”, in the case of ZFS), move on.

If you’ve ever used a redundant RAID setup like 1, 5 or 6, or possibly even LVM, you probably got used to a distinction between the filesystem layer and the logical volume the filesystem sat on. ZFS blurs those boundaries by (optionally) using enough redundancy across enough of your physical volumes that, should one fail, you shouldn’t see an operational difference unless you lose more before replacing it.

Interesting parts of that status message: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

No offense intended, but you’re a sysadmin, using a filesystem targeted at data centers and people who are paid a hefty salary to focus on keeping servers running. Don’t be surprised if there’s a learning curve.

nyeJuly 31st, 2010 at 12:00

Do you happen to know what firmware version you had trouble with? Currently that’s the only drive I still haven’t been able to afford to mirror, so data loss on that one would be particularly bad. That said, it has been in constant use for over a year now without any errors, so I’m not *too* worried. Rather more worried about that godawful Seagate (which is now mirrored with an identical drive) – Seagate used to be good once upon a time :-(.

Anyway, I still firmly believe that the message is stupid. It’s not even *correct* – in reality there was no device error; there was a checksum error. The most likely reason for that is that the block in question (or its checksum) was written incorrectly in the first place, due to a spot of background radiation or whatever.

Not bothering to think about decent user interfaces and documentation just because it’s targetted at the highly paid is a poor excuse.

The fact that software intended for enterprisey use is almost universally unpleasant to use due to needlessly poor design is another rant…

Leave a Response