ZFS: One or more devices has experienced an unrecoverable error

I’m using [[ZFS]] (via ZFS-FUSE), and at one point a zpool status gave me this rather scary report:

zpool status
  pool: srv
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME                                                      STATE     READ WRITE CKSUM
        srv                                                       ONLINE       0     0     0
          disk/by-id/usb-Samsung_STORY_Station_0000002CE0C43-0:0  ONLINE       0     0     1
          disk/by-id/scsi-SATA_ST3500630AS_9QG3JRW0               ONLINE       0     0     0
          disk/by-id/scsi-SATA_SAMSUNG_HD103UJS13PJDWS516679      ONLINE       0     0     0

errors: No known data errors

‘Unrecoverable error’, eh? Crap. Wait, how can applications be unaffected by an unrecoverable error? How can there be ‘no known data errors’? Also, how is there a checksum error, but no read or write errors? What other operation could there be? Better check that link…

…Well apparently ‘the device has experienced a read I/O error, write I/O error, or checksum validation error’. I guess that implies an answer to the last question: ‘READ’ and ‘WRITE’ refer specifically to disk I/O errors. Not to errors in reading or writing in general, just those where the disk itself has detected an error and reported it back. The way that information is presented is pretty apalling as you just need to know that peculiarity in order to interpret it, but okay, let’s press on.

‘Because the device is part of a mirror or RAID-Z device, ZFS was able to recover from the error and subsequently repair the damaged data’. What? No it isn’t. This is just a striped pool: no mirroring or RAID-Z involved. Any anyway, you said it was unrecoverable. Now I’m beginning to worry. The documentation claims that ZFS was able to recover from an unrecoverable error using data replication that doesn’t exist; WTF does that mean? Well obviously it means that the documentation was written by an imbecile, but what’s the message they’re clumsily trying to get across?

After some googling led me to this thread, I did eventually work this out. You get the message about an unrecoverable error if and only if (and this part’s genius) ZFS was able to recover. How very… special. If it wasn’t able to recover, you’ll instead be told that ‘a file or directory could not be read due to corrupt data’. No mention of the word ‘unrecoverable’ there.

But wait, how could it recover from it if there’s no data redundancy, and why does it think it’s a mirrored or RAID-Z device? The answer to that would appear to be that, through some happenstance, the error corrupted some metadata rather than actual file data. Since metadata always has at least one redundant copy, it corrected it as if it were mirrored. Phew.

So to recap: so far as ZFS is concerned, the alarming phrase ‘unrecoverable error’ means ‘error from which ZFS has successfully recovered’. Thanks for that Sun.

Facepalm

Michael Mol • July 27th, 2010 at 16:18

Ew, ew, ew. A SAMSUNG HD103* hard drive.

I’ve had two such drives. I bought an HD103UI when it was on sale on Newegg during some Black Friday hardware purchase binge. It failed within a few months. Found bugs evident in the firmware by looking at it via smartctl (test result reporting and logging was just screwed *up*)), RMA’d drive with attached note about buggy firmware. Got another HD103UI back from them. Ran smartctl on it, noticed same firmware revision. Twigged out on me within a couple months. Attempted to RMA, never heard back.

As for your ZFS error, it was reporting a failed *device*, which is different from filesystem failure. Think of it like a device dropping out of a redundant RAID setup (which ZFS shares some functionality with); yeah, the device died. Throw in a replacement, rebuild (or apparently “zpool replace”, in the case of ZFS), move on.

If you’ve ever used a redundant RAID setup like 1, 5 or 6, or possibly even LVM, you probably got used to a distinction between the filesystem layer and the logical volume the filesystem sat on. ZFS blurs those boundaries by (optionally) using enough redundancy across enough of your physical volumes that, should one fail, you shouldn’t see an operational difference unless you lose more before replacing it.

Interesting parts of that status message: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

No offense intended, but you’re a sysadmin, using a filesystem targeted at data centers and people who are paid a hefty salary to focus on keeping servers running. Don’t be surprised if there’s a learning curve.

steamsprocket.org.uk

ZFS: One or more devices has experienced an unrecoverable error

3 Responses to “ZFS: One or more devices has experienced an unrecoverable error”

Morten • April 6th, 2010 at 13:07

Michael Mol • July 27th, 2010 at 16:18

nye • July 31st, 2010 at 12:00

Leave a Response