2024-04-15 - ZFS Pool Suspended - Part 2
What went wrong?
The ZFS pool was suspended.
All media related services unavailable and unable to auto-correct.
Timeline
Time | Event |
---|---|
05:16 BST | Pool suspended |
05:17 BST | Page recieved after transmission service crashed due to I/O issues |
08:50 BST | 2 disks in "FAULTED (corrupted)" state in VDev 1, 2 disks in "FAULTED (1 corrupted, 1 invalid label)" state in VDev 2 |
08:55 BST | Cables and drives for VDev 1 and 2 was changed to another row in backplane. Server restarted. No change in state |
18:30 BST | Drives checked one by one, 2 identified as "dying" |
19:20 BST | 138K data errors on Pool |
20:00 BST | 512K data errors on Pool |
20:13 BST | Lost hope |
April 20th - 20:43 BST | Server taken offline for reseating connectors |
April 20th - 20:50 BST | While reconnecting the HBA to the motherboard a bent PCI bracket was found. Screwing the PCI bracket down causes the HBA to tilt upwards - causing broken connections |
April 20th - 20:51 BST | London-B rebooted without HBA being screwed down |
April 20th - 20:55 BST | All VDevs reporting healthy in ZFS |
April 20th - 21:05 BST | 1 drive from VDev 2 in "REPAIRING" state. Checksum errors for sde increasing. (This is normal for repairing ZFS) |
April 20th - 22:10 BST | Incident declared over |
Resolution
ZFS Pool suspended due to a misaligned PCI connector on the HBA. A bent PCI bracket caused the card to tilt upwards in the back when the card was screwed in.
After the card was reseated without the bracket not being screwed in, the pool was brought back online.
1 drive from VDev 2 was in "REPAIRING" state. The drive has since stabilized and is now reporting no errors.
Corrective Actions
- HBA PCI bracket to be replaced
- HBA PCI bracket to be screwed in after replacement