Incident report 20250212-20250213

Around 1500Z on Wednesday, some routine maintenance work was taking place on the “procyon” server which hosts the Calpol Matrix server and some miscellaneous other personal services.

Technical details

As part of our ongoing commitment to service reliability, we were adding a redundant copy of metadata to the filesystem on the server. The filesystem is a Linux btrfs filesystem, which previously had its metadata in SINGLE mode. The task was to convert it to DUP mode, where it stores a second copy of each metadata block elsewhere in case one copy is corrupted.

Unfortunately, due to historical space issues on the filesystem, while there was ample free space, it was all allocated to the DATA group, leaving it unavailable for the rebalance operation. As a result, the balance ran out of space, causing the start of the issues.

Timeline

~20250212T1500Z

The rebalance operation is started.

~20250212T1530Z

Server admins noticed that the filesystem had been remounted as read-only due to errors (later discovered to be the out-of-space issue described above). Attempts to remount read-write failed, so the server was rebooted to try to resolve the issues.

The server did not boot, freezing after the bootloader. A ticket was opened with the hosting provider to establish whether this was a wider issue affecting other customers. A swift response eliminated that possibility, meaning that the issues were limited to the “procyon” server.

~20250212T1545Z

Booting from the rescue image at the hosting provider reveals that the issue is the btrfs filesystem running out of space. Initial attempts to resolve the issue are unsuccessful and 3rd-party support is sought from btrfs experts on IRC, which allows us to mount the partition in read-only mode on the rescue image.

~20250212T1600Z

Server admins realise they have to finish their actual paid job so leave the server at the bootloader screen to see if it sorts itself out on its own. Subsequently, other important tasks like cycling home, eating dinner and playing Deadlock with friends mean that little progress is possible.

~20250212T2230Z

With the important other tasks completed, attention can return to the issues on “procyon”. An apparent kernel bug prevents mounting the filesystem in read-write mode to continue recovery, so the decision is made to copy the 80G filesystem to local hardware to fix there. A transfer is initiated to run overnight.

~20250212T2300Z

Server admins go to bed.

20250213T0830Z

The transfer has successfully completed. Unfortunately, the same kernel bug appears to be present on the recovery system so it can only be mounted read-only. The decision is made not to continue to recover this filesystem directly, but to reimage the server and copy the data back from the read-only image.

20250213T1030Z

Filesystem recreated on VPS and transfer of files initiated back from local system.

20250213T1520Z

File transfer completed. Testing reboot into new filesystem.

20250213T1550Z

Successfully booted, system seems to be properly restored with no data loss. May take a while to return to normal.

20250213T1615Z

System rebooted again to complete kernel upgrade, but seems otherwise stable. Declaring this incident resolved. All affected users are entitled to a full service refund and will receive a complimentary month of JoshOps Gold™.