Around 1500Z on Wednesday, some routine maintenance work was taking place on the “procyon” server which hosts the Calpol Matrix server and some miscellaneous other personal services.
As part of our ongoing commitment to service reliability, we were adding a redundant copy of metadata to the filesystem on the server. The filesystem is a Linux btrfs filesystem, which previously had its metadata in SINGLE mode. The task was to convert it to DUP mode, where it stores a second copy of each metadata block elsewhere in case one copy is corrupted.
Unfortunately, due to historical space issues on the filesystem, while there was ample free space, it was all allocated to the DATA group, leaving it unavailable for the rebalance operation. As a result, the balance ran out of space, causing the start of the issues.
The rebalance operation is started.
Server admins noticed that the filesystem had been remounted as read-only due to errors (later discovered to be the out-of-space issue described above). Attempts to remount read-write failed, so the server was rebooted to try to resolve the issues.
The server did not boot, freezing after the bootloader. A ticket was opened with the hosting provider to establish whether this was a wider issue affecting other customers. A swift response eliminated that possibility, meaning that the issues were limited to the “procyon” server.
Booting from the rescue image at the hosting provider reveals that the issue is the btrfs filesystem running out of space. Initial attempts to resolve the issue are unsuccessful and 3rd-party support is sought from btrfs experts on IRC, which allows us to mount the partition in read-only mode on the rescue image.
Server admins realise they have to finish their actual paid job so leave the server at the bootloader screen to see if it sorts itself out on its own. Subsequently, other important tasks like cycling home, eating dinner and playing Deadlock with friends mean that little progress is possible.
With the important other tasks completed, attention can return to the issues on “procyon”. An apparent kernel bug prevents mounting the filesystem in read-write mode to continue recovery, so the decision is made to copy the 80G filesystem to local hardware to fix there. A transfer is initiated to run overnight.
Server admins go to bed.
The transfer has successfully completed. Unfortunately, the same kernel bug appears to be present on the recovery system so it can only be mounted read-only. The decision is made not to continue to recover this filesystem directly, but to reimage the server and copy the data back from the read-only image.
Filesystem recreated on VPS and transfer of files initiated back from local system.
File transfer completed. Testing reboot into new filesystem.
Successfully booted, system seems to be properly restored with no data loss. May take a while to return to normal.
System rebooted again to complete kernel upgrade, but seems otherwise stable. Declaring this incident resolved. All affected users are entitled to a full service refund and will receive a complimentary month of JoshOps Gold™.