ZFS: One or more devices has experienced an unrecoverable error
I’m using [[ZFS]] (via ZFS-FUSE), and at one point a zpool status
gave me this rather scary report:
zpool status pool: srv state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM srv ONLINE 0 0 0 disk/by-id/usb-Samsung_STORY_Station_0000002CE0C43-0:0 ONLINE 0 0 1 disk/by-id/scsi-SATA_ST3500630AS_9QG3JRW0 ONLINE 0 0 0 disk/by-id/scsi-SATA_SAMSUNG_HD103UJS13PJDWS516679 ONLINE 0 0 0 errors: No known data errors
‘Unrecoverable error’, eh? Crap. Wait, how can applications be unaffected by an unrecoverable error? How can there be ‘no known data errors’? Also, how is there a checksum error, but no read or write errors? What other operation could there be? Better check that link…
…Well apparently ‘the device has experienced a read I/O error, write I/O error, or checksum validation error’. I guess that implies an answer to the last question: ‘READ’ and ‘WRITE’ refer specifically to disk I/O errors. Not to errors in reading or writing in general, just those where the disk itself has detected an error and reported it back. The way that information is presented is pretty apalling as you just need to know that peculiarity in order to interpret it, but okay, let’s press on.
‘Because the device is part of a mirror or RAID-Z device, ZFS was able to recover from the error and subsequently repair the damaged data’. What? No it isn’t. This is just a striped pool: no mirroring or RAID-Z involved. Any anyway, you said it was unrecoverable. Now I’m beginning to worry. The documentation claims that ZFS was able to recover from an unrecoverable error using data replication that doesn’t exist; WTF does that mean? Well obviously it means that the documentation was written by an imbecile, but what’s the message they’re clumsily trying to get across?
After some googling led me to this thread, I did eventually work this out. You get the message about an unrecoverable error if and only if (and this part’s genius) ZFS was able to recover. How very… special. If it wasn’t able to recover, you’ll instead be told that ‘a file or directory could not be read due to corrupt data’. No mention of the word ‘unrecoverable’ there.
But wait, how could it recover from it if there’s no data redundancy, and why does it think it’s a mirrored or RAID-Z device? The answer to that would appear to be that, through some happenstance, the error corrupted some metadata rather than actual file data. Since metadata always has at least one redundant copy, it corrected it as if it were mirrored. Phew.
So to recap: so far as ZFS is concerned, the alarming phrase ‘unrecoverable error’ means ‘error from which ZFS has successfully recovered’. Thanks for that Sun.