October 15, 2010

Weekend Project: Get Started with Btrfs


The B-tree file system Btrfs is a next-generation filesystem for Linux, and although it is still undergoing rapid development, you can use it for day-to-day tasks. Even if you are not prepared to migrate your production servers over to Btrfs, you should take some time to explore what it can do. It offers significant time- and space-efficiency improvements over ext3/ext4 — not to mention considerably simpler volume management.

For those unfamiliar, Btrfs is a clean break from the approach used in Linux's ext filesystems in years past. It uses b-trees to store generic "items" of varying data types in a single, unified data structure. Items are sorted by their 136-bit key, which groups related items together via a shared key prefix (and thus automatically optimizes the filesystem for large read and write operations). Small files can be stored directly in the tree leaves, while large files are written in extents —which lowers the overhead and reduces fragmentation.

Nodes in the tree are also check-summed, and include both reference counts and back-references, which makes checking for correctness and moving or resizing the filesystem simpler. Finally, the system uses a copy-on-write strategy that writes changed data to disk first, then updates the references in the tree. This crash-proofs the filesystem, but without the overhead of maintaining a journal.

There are still more advantages at the filesystem tools level. Btrfs includes built-in support for RAID, including balancing multiple devices and recovering from corruption, and it supports on-line resizing, device addition, and device removal. This essentially rolls much of the functionality of Linux's multi-device (MD) and logical volume manager (LVM) tools into the filesystem itself. Btrfs can also use transparent compression, create filesystem snapshots, and create subvolumes —which do away with much of the need for having separate disk partitions.

Getting Started

Most of the major Linux distributions have enabled at least experimental support for Btrfs in their recent releases. Because development on the filesystem is rapid, however, it is recommended that you run at least kernel 2.6.33 if possible. You may also have to install the Btrfs userspace utilities in a separate package, such as btrfs-progs.

Filesystem creation is performed with the mkfs.btrfs command. The main control program (used for manipulating snapshots, subvolumes, and to inspect the filesystem) is called btrfs. You may still find references to an older version of this tool that was called btrfsctl; if so, be sure to consult the Btrfs documentation before following older tutorials, as options or syntax may have changed.

There is also a btrfsck that can run filesystem checks on unmounted Btrfs filesystems, and a few other utilities used for troubleshooting and debugging. For example, btrfs-image can dump an image of your filesystem with the actual data zeroed out; you can send this to Btrfs developers when asking for help debugging a problematic filesystem.

Btrfs has is own mount-specific options as well, but you do not need to install a different version of mount in order to use them.

Basic Operations: Creating Filesystems, Multi-Device Arrays, and Resizing

The basic command for creating a Btrfs filesystem on a device is simply mkfs.btrfs device_name. This creates a new filesystem on the device, at maximum capacity. You can specify a smaller size with -b size_in_bytes. You can also specify an non-default leaf size by appending -l size_in_bytes to the end of the command, or a sector size with -s size_in_bytes.

The real fun, though, comes when creating a RAID array. The syntax is mkfs.btrfs one_device_nameanother_device_nameyet_another_device_name. That's right; to create a RAID array, you simply provide all of the block devices in a single command; Btrfs does the rest. By default, this will stripe all of the data evenly between the disks (as in RAID 0), and mirror the metadata on every disk (as in RAID 1). You can specify a different profile by appending a -m profile argument to mkfs.btrfs for the metadata behavior, or a -d profile for the data. Currently, raid0, raid1, raid10, and single (i.e., no RAID) are the only accepted values.

You mount a Btrfs filesystem with mount -t btrfs device mountpoint. For RAID arrays, you only need to specify one of the devices used in the array; Btrfs will find the rest and mount them together automatically. For instance, if you created a two-disk array with mkfs.btrfs /dev/sda /dev/sdb, you could mount it with mount -t btrfs /dev/sda /mnt/bigarray.

This is especially helpful if you want to add additional drives to the array —you can keep the relevant line in /etc/fstab the same. To add a third disk to the array, run btrfs device add /dev/sdc /mnt/bigarray. This must be run on a mounted filesystem. After you add the new disk, you can tell Btrfs to redistribute the array's data across all three disks with btrfs filesystem balance /mnt/bigarray. Obviously, this could take a bit of time, if the array is large.

In the event that one disk in your RAID array becomes corrupted, you can mount that disk with the degraded option to "mount," e.g. mount -t btrfs -o degraded /dev/sdb /mnt/bigarray, which will suppress error messages from the failing disk. You can then remove the disk from the array with btrfs device delete /dev/sdb /mnt/bigarray, which will move all file data off onto the remaining disks (assuming there is space; if not you will need to add another drive first).

A filesystem can be resized with btrfs filesystem resize filesystem_name size. You have three options for the size argument: a specific size (such as 1024M or 7G), an increment or decrement value (such as +200M or -2G), or "max," which will expand the filesystem to fill all of the available space on the underlying device or partition.

Essentially, all of this basic filesystem manipulation commands are self-explanatory —Btrfs simply makes good default choices and educated guesses to save you the trouble of having to provide extra parameters. That is because plain vanilla filesystems (even RAID arrays) do not stray much from the time-tested model used in most other familiar Linux filesystems. To really see something new, we will have to take a look at Btrfs subvolumes.

Subvolumes, Snapshots, and Conversion

Subvolumes in Btrfs are sub-trees of the primary Btrfs filesystem tree. They are created in-place in the existing filesystem, but can be treated like separate filesystems, with their own mount point, options, and policy. Unlike creating multiple disk partitions, however, subvolumes do not require allocating additional space on the disk; they are just empty directories until you begin adding files to them, at which point they grow to fit. Not only is that space-efficient, it also means that you can create all of the subvolumes you need in a single filesystem, and add additional storage to it whenever it fills up, regardless of which subvolumes take up the most room.

In essence, then, you might think of a subvolume as a directory that can be mounted as if it were a device, or a virtual disk image in a VM. You create one with btrfs subvolume create path/if/neccesary/volume_name. If you leave off the path, it will be created in the current directory. You can then mount the subvolume anywhere you like by supplying the subvolume option to the mount command. For example, if you created a subvolume named "mysubvolume" in /mnt/bigarray, you could mount it with mount -t btrfs -o subvol=mysubvolume /dev/sda /mnt/notsobig. If you forget precisely where you've created your various subvolumes, btrfs subvolume list /mnt/bigarray will list them for you. To delete one, run btrfs subvolume delete subvolume_name.

In practice, then, you can create as many subvolumes as you need, all within one Btrfs filesystem. But simply creating separate mount points is not all that subvolumes are good for; Btrfs supports a specific type of subvolume useful for system maintenance, the snapshot.

The syntax is virtually identical; just add "snapshot" to the btrfs command in place of "create," e.g.btrfs subvolume snapshot /mnt/bigarray /mnt/backups/October15. This creates a subvolume in /mnt/backups/October15 that is a snapshot of /mnt/bigarray, which you can then write to removal storage and place in the fire safe (or whatever your backup strategy dictates).

The nice part is that Btrfs creates this snapshot not by duplicating the file data, but by creating a duplicate b-tree pointing to the same data. If you don't alter any of the files in /mnt/bigarray, the existence of the snapshot consumes no extra space. If you do alter any of the files in /mnt/bigarray, however, only then does Btrfs write changes to disk, by preserving the original copy in the snapshot, and writing the new data in the main filesystem.

This is the essence of copy-on-write. Most of the time, the majority of files will not be touched, so the snapshots are extremely space-efficient. There is another interesting case that makes use of this property, though: converting an existing ext3 (or ext4) filesystem to Btrfs.

The btrfs-convert utility can create a Btrfs filesystem in place on top of an existing ext3/4 filesystem, by reading the ext filesystem and creating the necessary b-trees in the free space. Much like making a snapshot, this second filesystem takes up no additional space if no files are altered. When a file is changed, the original version of the ext filesystem is preserved, so you can even roll back the entire conversion process and restore the filesystem to its pre-Btrfs state.

You should first run fsck on your ext filesystem to check for corruption. When satisfied, run btrfs-convert device to convert the device, then mount -t btrfs devicethe_btrfs_mountpoint to mount the newly-minted Btrfs filesystem. Your original ext filesystem is preserved in a snapshot named "ext2_saved" (even if it was ext3 or ext4 format). You can even mount the snapshot with mount -t btrfs -o subvol=ext2_saved device /mnt/ext2_saved. If the novelty wears off, you can roll back to the original ext filesystem snapshot (including undoing all changes) with btrfs-convert -r device.

Extra Credit: Mount Options

The compress and compress-force options enable transparent data compression in the filesystem; with the force option attempting to compress even files that typically do not compress well (such as compressed audio and video formats). The ssd option is useful for those users with solid-state disks; it turns on several optimizations that increase performance for these already-speedy devices.

Btrfs is still undergoing rapid development; support for additional RAID configurations, deduplication, and online filesystem checks are still planned. In the meantime, consider how the merging of partitions, arrays, and logical volumes into one filesystem could simplify your system administration, and how snapshots could change your backup plan — you might not feel like waiting.

Click Here!