This article is excerpted from the recently published book Linux Power Tools.
Beyond picking a filesystem, you should be familiar with various filesystem tools. Filesystem creation options and performance enhancing tools can improve disk throughput, and partition resizers enable you to grow or shrink a partition to better suit your storage needs. Filesystems sometimes become corrupted, and fixing these problems is critical when they occur. Finally, one very common problem is that of accidentally or prematurely deleted files. Knowing how to recover such files can save you or your users a lot of time and effort.
Picking the right filesystem
When you installed Linux, the installation program gave you options relating to the filesystems you could use. Most distributions that ship with 2.4.x and later kernels support ext2fs, ext3fs, and ReiserFS. Some also support JFS and XFS. Even if your distribution doesn't support JFS or XFS, though, you can add that support by downloading the appropriate kernel patches or prepatched kernels from the JFS or XFS sites and compiling this support as a module or into the kernel proper. You can then convert a partition from one filesystem to another by backing up, creating the new filesystem, and restoring. (Support for JFS has been added to the 2.4.20 and 2.5.6 kernels, and XFS has been added to the 2.5.36 kernel. Thus, these filesystems are likely to become options for most distributions at install time.)
Unfortunately, the best filesystem to use is not always obvious. For many installations, it's not even terribly important, but for some applications it is. Filesystem design differences mean that some perform some tasks better than others. Varying support tools also mean that advanced filesystem features differ. This section describes the pros and cons of the popular Linux filesystems in several different areas, such as filesystem portability, disk check times, disk speed, disk space consumption, support for large numbers of files, and advanced security features.
Maximizing filesystem portability
Ext2fs is the most portable native Linux filesystem. Drivers and access tools for ext2fs are available in many different OSs, meaning that you can access ext2fs data from many non-Linux OSs. Unfortunately, most of these tools are limited in various ways -- for instance, they may be access utilities rather than true drivers, they may not work with the latest versions of ext2fs, they may be able to read but not write ext2fs, or they may run a risk of causing filesystem corruption when writing to ext2fs. Therefore, ext2fs's portability is limited.
Ext3fs is a journaling extension to ext2fs. (The next section, "Reducing Disk Check Times," describes journaling in more detail.) As such, many of the ext2fs access tools can handle ext3fs, although some disable write access on ext3 filesystems.
IBM wrote JFS for its AIX OS, and later ported it to OS/2. IBM then open sourced the OS/2 JFS implementation, leading to the Linux JFS support. This heritage makes JFS a good choice for systems that multiboot Linux and OS/2. There are compatibility issues, though. Most importantly, you must use 4,096-byte clusters to enable both OSs to use the same JFS partitions. There are also filename case-retention issues -- OS/2 is case-insensitive, whereas Linux is case-sensitive. You can use JFS in a case-insensitive way from Linux, but this is only advisable on dedicated data-transfer partitions.
XFS, from Silicon Graphics' (SGI's) IRIX, is another migrant filesystem. Linux/IRIX dual-boot systems are rare, but you might want to use XFS as a compatibility filesystem on removable disks that move between Linux and IRIX systems. You can also use Linux's XFS support to read hard disks that originated on IRIX systems.
ReiserFS is currently the least portable of the major Linux-native filesystems. There is a BeOS version, but versions for other platforms have yet to appear. Therefore, you should avoid ReiserFS if you need cross-platform compatibility.
Reducing disk check times
All filesystems necessarily write data in chunks. In the event of a power outage, system crash, or other problem, the disk may be left in an unstable condition as a result of a half-completed operation. The result can be lost data and disk errors down the line. In order to head off such problems, modern filesystems support a dirty bit. When Linux mounts a filesystem, it sets the dirty bit, and when it unmounts the filesystem, Linux clears the dirty bit. If Linux detects that the dirty bit is set when mounting a filesystem, the OS knows that the filesystem was not properly unmounted and may contain errors. Depending on
mount command options, Linux may run
fsck on the filesystem when its dirty bit is set. This program, described in more detail in the upcoming section, "Recovering from Filesystem Corruption," checks for disk errors and corrects them whenever possible.
Unfortunately, a complete disk check on a traditional filesystem such as ext2fs takes a long time, because the computer must scan all the major disk data structures. If an inconsistency is found,
fsck must resolve it. The program can often do this on its own, but it sometimes requires help from a person, so you may have to answer bewildering questions about how to fix certain filesystem problems after a crash or other system failure. Even without answering such questions, disk checks of multigigabyte hard disks can take many minutes, or potentially even hours. This characteristic may be unacceptable on systems that should have minimal down time, such as many servers.
Over the past decade, journaling filesystems have received increasing attention as a partial solution to the disk check time problem. A journaling filesystem keeps an on-disk record of pending operations. When the OS writes data to the disk, it first records a journal entry describing the operation; then it performs the operation; and then it clears the journal. In the event of a power failure or crash, the journal contains a record of all the operations that might be pending. This information can greatly simplify the filesystem check operation; instead of checking the entire disk, the system can check just those areas noted in the journal as having pending operations. The result is that a journaling filesystem takes just a few seconds to mount after a system crash. Of course, some data might still be lost, but at least you won't wait many minutes or hours to discover this fact.
Linux supports four journaling filesystems:
This filesystem is basically just ext2fs with a journal added. As such, it's quite reliable, because of the well-tested nature of the underlying ext2fs. Ext3fs can also be read by an ext2fs driver; however, when it's mounted in this way, the journal will be ignored. Ext3fs also has another advantage: As described in the upcoming section, "Converting Ext2fs to Ext3fs," you can convert an existing ext2 filesystem into an ext3 filesystem without backing up, repartitioning, and restoring.
This filesystem was the first journaling filesystem added to the Linux kernel. As such, it's seen a lot of testing and is very reliable. It was designed from the ground up as a journaling filesystem for Linux, and it includes several unusual design features, such as the ability to pack small files into less disk space than is possible with many filesystems.
IBM's JFS was developed in the mid-1990s for AIX, then it found its way to OS/2 and then to Linux. It's therefore well tested, although the Linux version hasn't seen much use compared to the non-Linux version or even ext3fs or ReiserFS on Linux.
SGI's XFS dates from the mid-1990s on the IRIX platform, so the filesystem fundamentals are well tested. It's the most recent official addition to the Linux kernel, although it has been a fairly popular add-on for quite a while. XFS comes with more ancillary utilities than does any filesystem except ext2fs and ext3fs. It also comes with native support for some advanced features, such as ACLs (see the upcoming section, "Securing a Filesystem with ACLs), that aren't as well supported on most other filesystems.
For the most part, I recommend using a journaling filesystem; the reduced startup time makes these filesystems beneficial after power outages or other problems. Some of these filesystems do have drawbacks, though. Most importantly, some programs rely upon filesystem quirks in order to work. For instance, as late as 2001, programs such as NFS servers and the Win4Lin emulator had problems with some of these journaling filesystems. These problems have been disappearing, though, and they're quite rare as of the 2.5.54 kernel. Nonetheless, you should thoroughly test all your programs (especially those that interact with disk files in low-level or other unusual ways) before switching to a journaling filesystem. The safest journaling filesystem from this perspective is likely to be ext3fs, because of its close relationship to ext2fs.
ReiserFS and JFS are also somewhat deficient in terms of support programs. For instance, neither includes a
dump backup utility. XFS's
xfsdump) is available from the XFS development site but isn't shipped with the
xfsprogs 2.2.1 package, although some distributions ship it in a separate
xfsdump package. The
xfsdump and the ext2fs/ext3fs
dump programs create incompatible archives, so you can't use these tools to back up one filesystem and restore it to another.
Maximizing disk throughput
One question on many people's minds is which filesystem yields the best disk performance. Unfortunately, this question is difficult to answer because different access patterns, as created by different uses of a system, favor different filesystem designs. In Linux Filesystems (Sams, 2001), William von Hagen ran many benchmarks and found that every Linux filesystem won several individual tests. As a general rule, though, XFS and JFS produced the best throughput with small files (100MB), while ext2fs, ext3fs, and to a lesser extent JFS did the best with larger files (1GB). Some benchmarks measure CPU use, which can affect system responsiveness during disk-intensive operations. At small file sizes, results were quite variable; no filesystem emerged as a clear winner. At larger file sizes, ext3fs and JFS emerged as CPU-time winners.
Unfortunately, benchmarks are somewhat artificial and may not reflect real-world performance. For instance, von Hagen's benchmarks show ext2fs winning file-deletion tests and ReiserFS coming in last; however, von Hagen comments that this result runs counter to his subjective experience, and I concur. ReiserFS seems quite speedy compared to ext2fs when deleting large numbers of files. This disparity may be because von Hagen's tests measured CPU time, whereas we humans are more interested in a program's response time. The moral is that you shouldn't blindly trust a benchmark. If getting the best disk performance is important to you, try experimenting yourself. Be sure to run tests using the same hardware and partition; wipe out each filesystem in favor of the next one, so that you're testing using the same disk and partition each time. Install applications or user files, as appropriate, and see how fast the system is for your specific purposes. If this procedure sounds like it's too much effort to perform, then perhaps the performance differences between filesystems aren't all that important to you, and you should choose a filesystem based on other criteria.
Minimizing space consumption
Most filesystems allocate space to files in blocks, which are typically power-of-two multiples of 512 bytes in size (that is, 21 x 512, 22 x 512, 23 x 512, and so on). Common block sizes for Linux filesystems range from 1KB to 4KB (the range for ext2fs and ext3fs). XFS supports block sizes ranging from 512 bytes to 64KB, although in practice block size is limited by CPU architecture (4KB for IA-32 and PowerPC; 8KB for Alpha and Sparc). ReiserFS and Linux's JFS currently support only 4KB blocks, although JFS's data structures support blocks as small as 512 bytes. The default block size is 4KB for all of these filesystems except ext2fs and ext3fs, for which the default is based on the filesystem size.
You can minimize the space used by files, and hence maximize the number of files you can fit on a filesystem, by using smaller block sizes. This practice may slightly degrade performance, though, as files may become more fragmented and require more pointers to completely describe the file's location on the disk.
ReiserFS is unusual in that it supports storing file tails -- the ends of files that don't occupy all of an allocation block -- from multiple files together in one block. This feature can greatly enhance ReiserFS's capacity to store many small files, such as those found on a news server's spool directory. XFS uses a different approach to achieve a similar benefit -- it stores small files entirely within the inode (a disk structure that points to the file on disk, holds the file's time stamp, and so on) whenever possible.
None of these features has much impact when average file sizes are large. For instance, saving 2KB by storing file tails in a single allocation block won't be important if a filesystem has just two 1GB files. If the filesystem has 2,000,000 1KB files, though, such space-saving features can make a difference between fitting all the files on a disk or having to buy a new disk.
Another aspect of disk space consumption is the space devoted to the journal. On most disks, this isn't a major consideration; however, it is a concern on small disks, such as Zip disks. On a 100MB Zip disk, ReiserFS devotes 32MB to its journal and ext3fs and XFS both devote 4MB. JFS devotes less space to its journal initially, but it may grow with use.
Ext2fs and ext3fs suffer from another problem: By default, they reserve five percent of their disk space for emergency use by root. The idea is to give root space to work in case a filesystem fills up. This may be a reasonable plan for critical filesystems such as the root filesystem and /var, but for some it's pointless; for instance, root doesn't need space on /home or on removable media. The upcoming section, "Creating a Filesystem for Optimal Performance," describes how to reduce the reserved space percentage.
Supporting the maximum number of files
To some extent, storing the maximum number of files on a partition is an issue of the efficient allocation of space for small files, as described in the preceding section, "Minimizing Space Consumption." Another factor, though, is the number of available inodes. Most filesystems support a limited number of inodes per disk. These inodes limit the number of files a disk can hold; each file requires its own inode, so if you store too many small files on a disk, you'll run out of inodes. With ext2fs and ext3fs, you can change the number of inodes using the
-N options to
mke2fs when you create the filesystem. These options set the bytes-per-inode ratio (typically 2 or 4; increasing values decrease the number of inodes on the filesystem) and the absolute number of inodes, respectively. With XFS, you can specify the maximum percentage of disk space that may be allocated to inodes with the
maxpct option to
mkfs.xfs. The default value is 25, but if you expect the filesystem to have very many small files, you can specify a larger percentage.
ReiserFS is unusual in that it allocates inodes dynamically, so you don't need to be concerned with running out of inodes. This fact also means that the
-i option to the
df utility, which normally returns statistics on used and available inodes, returns meaningless information about available inodes on ReiserFS volumes.
Securing a filesystem with ACLs
Linux, like Unix in general, has traditionally used file ownership and permissions to control access to files and directories. Some of the tools for handling these features are described in Chapter 5, "Doing Real Work in Text Mode." Another way to control access to files is by using access control lists (ACLs). ACLs provide finer-grained access control than do ownership and permissions. ACLs work by attaching additional information -- a list of users or groups and the permissions to be granted to each -- to the file. For instance, suppose you have a file that contains confidential data. This data must be readable and writeable by you and readable by a particular group (say, readers). You give the file ownership and permissions such that only you can read or write the file and that anybody in readers can read it (0640, or -rw-r-----). You need to share this file with just one other user, though, and for purposes of security for other files, this user should not be a member of the readers group. ACLs enable you to do this by giving read permission to this one user, independently of the readers group. Without ACLs, you would need to create a new group (say, readers2) that contains all of the members of readers plus the one extra user. You'd then need to maintain this extra group. Also, ordinary users can manipulate ACLs, but this isn't usually the case for groups, so ACLs can greatly simplify matters if users should be able to give each other access to specific files while still maintaining restricted access to those files for others.
Few Linux-native filesystems support ACLs directly; this honor belongs only to XFS. If you need ACLs, though, you can obtain add-on packages for ext2fs, ext3fs, and JFS. No matter what filesystem you use, you'll also need support utilities, which are available from the same site. These tools enable you to define and modify ACLs. For instance,
getfacl displays a file's ACLs, and
setfacl changes a file's ACLs.
ACLs are still quite new in Linux. As such, you may run into peculiar problems with specific programs or filesystems. Chances are you don't need ACLs on a typical workstation or a small server. If you're administering a multiuser system with a complex group structure, though, you might want to investigate ACLs further. You might be able to simplify your overall permissions structure by switching to a filesystem that supports ACLs.
In our next article we'll talk about optimizing filesystems.