April 9, 2014

Upcoming Btrfs Features for Linux Containers

Brandon Philips is CTO at CoreOS, a new Linux distribution that has been rearchitected to provide features needed to run massive server deployments.

Brandon-Philips-CoreOSContainers have made huge progress in the last year with the addition of user namespaces to the Kernel, the introduction of Docker, LXC 1.0, and the maturing of Check Point Restore in Userspace (CRIU). And at the annual Linux Foundation Collaboration Summit last month there were a number of people talking about containers and their application in the Linux ecosystem.

In the hallway track I had a chance to catch up with two btrfs developers: Chris Mason and Zach Brown. With containers on my mind I asked them about two important features in btrfs and how they see them developing and being used.

First, I asked about subvolumes and snapshots; in particular the workloads that Docker puts btrfs through. Docker application containers have the very useful property that every time you start one it gets a fresh clean filesystem. A simple way to implement this would be to copy the "gold master container" into a new directory and start the new container there. If you had a lot of running containers this duplication would get expensive in both time to copy and disk space. Instead of this naive approach, Docker can use btrfs subvolumes and snapshots.

By using a btrfs snapshot of the "gold master container" Docker can make a new playground for this container with a single syscall and avoid the cost of duplicating all of that data into another directory. It sounds like the perfect use of the feature. But, I wanted to hear it from Chris himself, so I asked him what he thought of the "Docker Workload" and in his words he said: "I really want to see this use of btrfs and its features to be successful, please let me know if you run into any problems. I want Docker's workload to work great."

It was great to hear this sort of affirmation from the mainter of btrfs. It was icing on the cake since we had, a few weeks earlier, made the decision to make btrfs the root filesystem for CoreOS too.

The second topic was around cryptographic hashes of the filesystem data. Currently, btrfs uses CRC checksums which are great for catching data corruption. But, CRC can't be used like a SHA hash to cryptographically verify the contents weren't changed. Having the checksumming of btrfs extended to support this would open up interesting possibilities to mount filesystems only if they are the exact hash you expected.

Zach and Chris hope to start work on this feature this year. And said that btrfs was designed with this sort of use case in mind: the metadata space for checksums is 256bits with the possibility to expand.

This feature would be useful to ensure that your distro partition wasn't tampered with by an attacker. Or to verify that the copy of the files you have on disk are the exact version you expect. On CoreOS this would make it very straightfoward for us to use btrfs exclusively and verify that our read-only updates were applied correctly.

With all of this progress on containers it is great to know that we have a filesystem that plans to keep up. Thanks to Chris and Zach for explaining their plans and aspirations for btrfs and container goodness.

Click Here!