Over two years ago I discovered a kernel bug #205833 where a close
operation on a frozen xfs
mount would trigger the trim operation.
This means that the filesystem can truncate away blocks beyond EOF of the file that it has reserved previously.
This is a write operation and the operation correctly blocks your process.
However, the truncate operation should not have been started in the first place on the mount the has the fsfreeze
active.
I stumble over this bug while I was working for a client in 2019/2020 on their kubernetes backup solution that I built for them with Velero and Restic.
The goal was: Backing up the data volumes of applications that manage a folder atomically and do not have an export command.
With xfs
however, this lead to the bug and hanging backup jobs.
Two years later: I verify
Today, I switched away from Ubuntu completely. Arch for development and Debian with systemd on the servers has been a great pattern for me. On the latest Arch Linux I am no longer able to reproduce the error where the second process hangs.
While in the last two years the default appears to have shifted from xfs
to ext4
most tutorials seem to still favour xfs
Testing
To verify the problems I have created two projects that can run on any Linux. As of my own testing Debian 11 and Arch Linux since August 2022 no longer suffer from this bug.
If you are running a kubernetes cluster and your backup freezes for no apparent reason, this might be it.
Feel free to use my two tools and consider supporting my work. Thank you!
- Quick test: https://gitlab.com/dns2utf8/xfs_fsfreeze_test/
- Heavy load: https://gitlab.com/dns2utf8/multi_file_writer/