Home Back to RCA

2024-04-15 - ZFS Pool Suspended - Part 2

What went wrong?

The ZFS pool was suspended.

All media related services unavailable and unable to auto-correct.

Timeline

Time Event
05:16 BST Pool suspended
05:17 BST Page recieved after transmission service crashed due to I/O issues
08:50 BST 2 disks in "FAULTED (corrupted)" state in VDev 1, 2 disks in "FAULTED (1 corrupted, 1 invalid label)" state in VDev 2
08:55 BST Cables and drives for VDev 1 and 2 was changed to another row in backplane. Server restarted. No change in state
18:30 BST Drives checked one by one, 2 identified as "dying"
19:20 BST 138K data errors on Pool
20:00 BST 512K data errors on Pool
20:13 BST Lost hope
April 20th - 20:43 BST Server taken offline for reseating connectors
April 20th - 20:50 BST While reconnecting the HBA to the motherboard a bent PCI bracket was found. Screwing the PCI bracket down causes the HBA to tilt upwards - causing broken connections
April 20th - 20:51 BST London-B rebooted without HBA being screwed down
April 20th - 20:55 BST All VDevs reporting healthy in ZFS
April 20th - 21:05 BST 1 drive from VDev 2 in "REPAIRING" state. Checksum errors for sde increasing. (This is normal for repairing ZFS)
April 20th - 22:10 BST Incident declared over

Resolution

ZFS Pool suspended due to a misaligned PCI connector on the HBA. A bent PCI bracket caused the card to tilt upwards in the back when the card was screwed in.

After the card was reseated without the bracket not being screwed in, the pool was brought back online.

1 drive from VDev 2 was in "REPAIRING" state. The drive has since stabilized and is now reporting no errors.

Corrective Actions

Related Images

Bent PCI Bracket