December 15, 2004

SysAdmin to SysAdmin: Parallel I/O for clusters using PVFS

Author: Brian Jones

When you're trying to share files and perform I/O-intensive operations across 100+ nodes of a Beowulf cluster, the old model of a central NFS file server handling every client request begins to break horribly. What many cluster admins do instead is limit NFS to distributing home directories across the cluster, and using some form of parallel I/O model for doing "real work."

Of course, cluster administrators and users alike crave the benefits of an NFS-like system; admins (in many cases) want logins only to the head node of the cluster, and users want to have all of their output in one place so they don't have to collect it from the various local disks of the individual compute nodes. However, it has been proven time and again that NFS is ill-suited for the kind of pounding it would take in a large, I/O-intensive cluster environment. The answer to this problem has come from researchers, open source projects, and the private sector, in the form of a parallel I/O model.

PVFS basics

The Parallel Virtual File System (PVFS) is one solution for creating a parallel I/O environment for your compute nodes to play in. The model is simple when you look at it from a high level: instead of an all-encompassing server that handles every part of the operation requested by the client, the server's jobs are split among various servers. In a PVFS environment, there is a single metadata server that keeps track of where existing data lives, and which machines are available for performing actual I/O operations. The I/O daemons do the heavy lifting of performing the reads and writes, and they are the actual data stores.

While there is only one metadata server in use per PVFS filesystem, there can be many IODs (I/O daemons). While theory might suggest that the number of IODs could scale indefinitely, reality is that at some point you'll see a decreasing return due to limitations either in the network infrastructure, the metadata server's ability to handle all of the requests, or the client-side utilities, libraries, and network hardware. Though I have yet to go over 16 IODs to a PVFS filesystem, other cluster admins I've spoken to have indicated that IODs can scale up to 32 per filesystem with no problems. Developers who work with PVFS say that 100 IODs have been run without issues in testing environments, though they note that, at some point, a breakpoint is reached where your ability to increase I/O performance is outweighed by the cost increase of metadata operations when an IOD is added. For the record, much work has been done to address these issues in PVFS2, the next generation of PVFS.

The observant among you will note that I've implied that there can be more than one PVFS filesystem per cluster. This is absolutely true. However, a metadata server can handle requests for only a single filesystem, and to my knowledge, IODs answer only to a single metadata server, which also limits them to one filesystem. Technically, you can create two PVFS filesystems using the same physical machines by running two instances of the metadata server on one machine, and running two instances of the IODs on the I/O nodes. Of course, this would affect the scalability, performance, and reliability of both filesystems, not to mention perhaps your own sanity and other mental faculties, but the daemons communicate with each other on configurable port numbers, so in non-production environments, this is perfectly feasible.

Another detail you might wonder about is how storage capacity is determined. Storage capacity is a function of how much space is dedicated to the IODs on the I/O nodes. If you have a 50GB partition dedicated to the IODs on each of two IO nodes, you'll have something slightly less than 100GB available for use by the clients. Some space is taken up by directories set up by the IODs to help them find data more efficiently without using, say, a separate database component.

A PVFS I/O operation up close

There is already great documentation on getting started and setting up a basic PVFS filesystem in the form of a PVFS Quickstart guide. There is also a PVFS-users mailing list, which is very helpful while not being very high-traffic. While I won't duplicate the effort put into that guide, I will briefly discuss the basic mechanics of PVFS.

In this example, I have one metadata server, two IODs, and one client. The client, it should be noted, mounts a directory from the metadata server, since it is the authoritative source of information regarding what the filesystem looks like at any given time (nice, as it hides the details and allows you to scale IODs in a way that is transparent to the clients). I've mounted my PVFS partition to /mnt/pvfs, and I'm going to copy a file from my local disk to that directory.

Let's look at a simple dump of this operation from the client side to see things from the client's point-of-view. My network uses a address scheme. Node 2 ( is the metadata server. Node 252 is the local machine, and 253-254 are the I/O nodes. For clarity in reading the dump, port 7000 is the port used by the I/O nodes, and 3000 is used by the metadata server. I've stripped some most of the data to focus on the conversation (and to be nice to my editors!)

[root@compute-0-0 root]# tcpdump -nn -i eth0 host \( pvfs-0-0 or pvfs-0-1 or
frontend-0-0 \) and port \(7000 or 3000\)
tcpdump: listening on eth0 >
16:16:00.146923 >
16:16:00.147008 >
16:16:00.147242 >
16:16:00.149613 >
16:16:00.149738 >
16:16:00.153163 >
16:16:00.192986 >
16:16:00.194201 >
16:16:00.194250 >

To start, the client first contacts the metadata server, and requests a read operation on the /mnt/pvfs directory, to make sure that the file doesn't already exist. The metadata server replies with the contents of the directory (in this case, there's nothing in the directory, so that didn't take long). The client also requests a write operation, and the metadata server returns the list of IODs to contact to handle that request. From here, most of the traffic is between the client (252), and the IODs (253 and 254).

16:16:00.194386 >
16:16:00.194433 >
16:16:00.195631 >
16:16:00.195664 >
16:16:00.195670 >
16:16:00.195701 >
16:16:00.195898 >
16:16:00.197730 >
16:16:00.197884 >
16:16:00.202981 >
16:16:00.203261 >
16:16:00.203346 >
16:16:00.203700 >
16:16:00.203730 >
16:16:00.205020 >
16:16:00.205217 >
16:16:00.205300 >
16:16:00.206463 >
16:16:00.206542 >
16:16:00.206734 >
16:16:00.209134 >
16:16:00.242995 >
16:16:00.243010 >
16:16:00.243022 >

This traffic consists almost entirely of data-moving operations, from the clients to the IODs. Dumps on the IODs and metadata server of the same operation show that there is also communication betweeen these two components, since the IODs have to report to the metadata server about the state of the directory, which enables the metadata server to provide a consistent view of the directory. Though it's still a bit fuzzy to me, as best I can tell, the traffic interspersed among the last half of this conversation is likely an effort to propagate this updated view of the directory to the client.

PVFS is just one solution out there for parallel I/O. Others include Lustre and IBM's GPFS, to name just two available for Linux. I hope this article has helped give an understanding of how PVFS works, and as always, I encourage you to share your experiences and wisdom on this topic by posting in the comments.

Click Here!