November 25, 2010

Weekend Project: Linux Filesystem Tune-up


If the thought of getting up at 3AM on "Black Friday" and dragging yourself across town to stand in line for sales doesn't fill you with the holiday spirit, why not spend your weekend doing something more meaningful, like cleaning up your Linux filesystems? To be sure, a modern Linux file server probably isn't in need of being torn down and rebuilt, but if you're like a lot of us, you partitioned those disks several releases ago with lofty intentions of leveraging extents, delayed allocation, B+ trees, and all sorts of other advanced features, only to let them languish at their default settings instead. Well, the time to tune the filesystem is now: grab a storage medium, a terminal, and optionally a turkey leg, and let's get to work.

We can break down the task by looking at each advanced filesystem in turn. You may run Ext4 on all of your disks, but you may also have a mix of other modern filesystems in there, too. In order to not get confused, run mount -l and jot down the device name (e.g., /dev/sda6) and filesystem (whatever is listed right after "type" in the output) for each mounted disk, then consult the appropriate section.


The XFS filesystem originally written for Irix at Silicon Graphics is designed to be high-performance, especially when dealing with "large" files. That makes it a good choice for your media server, perhaps, but not the best option for managing your public source code repository. XFS is included by default in most distros, but make sure you also install the main XFS utilities, the "xfsprogs" package.

One of the easiest things you can do is defragment an XFS filesystem. Because XFS is optimized for high disk throughput, it slows down when files are split up into multiple, spread out sectors on the disk. XFS's xfs_fsr tool can defragment a mounted partition without interrupting other work. Just run sudo xfs_fsr -v /your/mount/point to start the process. The utility will make multiple passes through the partition, noting the most fragmented files on each pass, and moving the top 10% of them on the list each time. By default it runs for two hours, though you can change this with the -t numberofseconds flag. You get the best results the first time you run it; on subsequent runs there will be fewer and fewer fragmented files to consolidate.

Moving forward, you can improve performance by changing the pre-allocation chunk size when you mount the filesystem. Edit /etc/fstab and add the allocsize=X option to the options list for the filesystem. A large value for X, like 1G will give you the smoothest allocated files.

Another helpful feature of XFS that is not enabled by default is disk quotas. You can assign quotas on a per-user, per-group, or per-directory basis — the latter is referred to officially as "project quotas" — with the xfs_quota utility. Using this feature you can reign in troublesome processes by, for example, setting a quota on /var/spool/mycrazyserver. You would do this by assigning a project number to the chosen directory in /etc/projects (such as 101:/var/spool/mycrazyserver), then mounting the disk with quota support turned on: mount -o prjquota /dev/sda9 /var.

Finally, activate the quota with xfs_quota -x -c 'project -s 101' /var/spool/mycrazyserver; xfs_quota -x -c 'limit -p bhard=20g 101' /var/spool/mycrazyserver. The "bhard" setting indicates a hard limit of 20 gigabytes for the specified directory. You can also specify a soft limit, in which case going over the limit is logged, but is not enforced. That might be a nicer option to start off with for user limits.

Although XFS is a journaling filesystem, that does not free you from the responsibility of making backups and restoring from them. Buy you do not have to reply on external tools like rsync; instead, xfsdump can back up an entire filesystem (complete with all of its extended attributes). Xfsdump can write to tape for those with traditional backup drives, but it works just as well copying to another directory when you chain it to xfsrestore. Just run xfsdump -J - /directory/to/backup | xfsrestore -J /destination/directory.

If you are backing up a multi-user system or a running server, make sure that you suspend access to it first by running xfs_freeze -f /directory/to/backup, and "thaw" it out again once you are finished with xfs_freeze -u /directory/to/backup. XFS will temporarily suspend write access to the filesystem while it is frozen, but not destructively so — as soon as it thaws out, all frozen disk operations will continue.


At the moment, Ext4 is the default filesystem for several popular desktop and server Linux distributions. It does not have as much in the way of fancy features as XFS, but there are still options at your disposal to tune performance. For example, there are three available journaling modes that you can select between at mount time.

You can add data=journal, data=ordered, or data=writeback to your mount command to change the journaling behavior. The "journal" option is the most reliable; it logs both data and metadata in the journal before committing each write. The "ordered" mode journals the metadata as it is writing the data (which is the default behavior). The "writeback" mode only journals metadata, which makes it the fastest option, but at the expense of the possibility that the journal and the disk could get out of sync if there is a failure at just the wrong time.

Ext4 also support pre-allocation via the "reservation" option; just add reservation to your mount command, such as mount -t ext4 -o reservation /dev/sdb4 /my/data. This allows the kernel to reserve space for a newly created file, rather than forcing an application (such as a P2P file download tool or a database) to do it; the result is that the filesystem can optimize the newly reserved space, making it a contiguous extent rather than fragmented.

In case you're curious, you can check an Ext4 filesystem's fragmentation state with e2freefrag /dev/your/filesystem, but unfortunately there is no online defragmentation tool; there was an e2defrag several years ago, but it was never updated to handle the journaling features introduced by Ext3. Online defragmentation is slated to appear in future updates to Ext4.


The Journaled File System JFS also sports some features that you must study in order to take full advantage of. The filesystem is a standard part of the kernel these days, but you will also want to make sure you have installed the JFS tools package, named JFSutils.

One of JFS's most interesting options is to store the filesystem journal on a separate device. This can improve both performance and reliability, because the journal is naturally the most written-to section of the disk, and is almost never physically adjacent to the blocks being changed by the filesystem operation.

If you are newly creating your JFS filesystem, you can specify the journal location with the -J option. To select a different device, first create the journal with jfs_mkfs -J journal_dev /heres/my/external/journal, then create the filesystem to use it with jfs_mkfs -J device=/heres/my/external/journal /dev/sdb5.

But what if you didn't realize that you had such an option when you first created the filesystem? That's where jfs_tune comes in. With jfs_tune, you can move the existing journal to a different device. You first need to create the external journal with the first jfs_mkfs command from the paragraph above. But then instead of creating a new filesystem, you simply tell the existing JFS filesystem to use it, with jfs_tune -J device=/heres/my/external/journal /dev/sdb4.


ReiserFS version 3 is still popular in many Linux deployments, despite the uncertainty over the Reiser4 filesystem. Like Ext4, it can use writeback mode to speed up journal performance. The syntax is the same as Ext4's; just add data=writeback to the mount command. Or, alternately, choose data=journal to use the slowest, but most secure, journaling option.

There is also a nolog option available which disables the journal entirely, which is the fastest possible option — but of course it comes at the cost of quick recoverability in the event of a crash.

ReiserFS does some interesting tricks to prevent fragmentation, such as storing small files directly within the filesystem's tree rather than in separate blocks. You can disable this feature with the notail option, although in most cases, this slows down read and write performance.

Anything Else

If you are using anything other than the modern filesystems mentioned here, then although there are tools to help you tune many of the available options, your time might be better spent migrating your files to a new filesystem altogether. Certainly there are exceptions, such as special-purpose encrypted filesystems like EncFS or distributed filesystems like Coda, but the current work is all on even newer options like Btrfs, which will obsolete even the venerable XFS and Ext4.

Nevertheless, if you need to get started, always begin by reading the manual page for mount with man mount. The section pertaining to your filesystem will list all of the supported mount-time options in the current Linux kernel. For the most part, you can try out these different options without worrying about rendering your system unbootable — although you still risk messing things up. From there, you can look for the filesystem's home page, which will delve deeper into the available utilities. Following that, a good bet is to search the Web for online resources for tweaking server performance — Red Hat, Novell, and most of the commercial distros all have such information available, and provide good coverage for older filesystems.

Of course, if the filesystem you are interested in happens to be Ext3, then you have it even easier: just update the filesystem in-place to Ext4. It is forward-compatible, meaning that you will not lose any information. You can even enable most of Ext4's features, such as extents — however, once you do so, you can never go back to Ext3. But why would you do that anyway?