September 8, 2004

SysAdmin to SysAdmin: Rocks tames Beowulf clusters

Author: Preston St. Pierre

I set up a two-node Beowulf cluster at my house one time, but back then I had no clue what to do with it. It's tough for an administrator to make effective use of a
cluster, being that there's no usable parallelized password cracker that I know of. Now I'm in an environment where other people can make use of a cluster, and I volunteered to try to set one up using Rocks 3.2.0.

System administration and cluster administration are
very different things. The tools are different, the problems are different
(though many are related to system-level stuff that we can sink our teeth
into), and the options for implementing a cluster are many and varied.

You can grab a vanilla distribution and build your tools of choice
from source on every node in your cluster. Then when new software, or new
user account information, or a new mount directive, or a new host entry, or
whatever, needs to be distributed to every node in the cluster, you can figure
out how to do it and then automate it so that next time it'll be a piece of
cake. Eventually (or maybe that should be "occasionally"), you'll get to a
point where you're not really sure if the software is all in sync on all of the
nodes, or a small set of nodes is acting in a manner inconsistent with the rest
of the nodes, and you can just blow the whole thing away and start this whole
process over from scratch. Believe it or not, this appears to work wonderfully
for many cluster administrators.

Of course, these are probably full-time cluster admins. For those of us who
have an environment to maintain and administer outside the cluster room, it'd
be nice to have all of this stuff "just work." This, my friends, is where Rocks can save your sanity and time!

What the heck is Rocks?

Rocks is one of a few available cluster toolkits. Cluster toolkits come in
various shapes and sizes, all offering different mixes of control and
convenience, which are somewhat inversely proportional to each other.
Some are layers on top of a vanilla distribution, others are just packages of
scripts to make some maintenance tasks a bit easier. Rocks is an entire
cluster implementation, complete with all of the tools, software, and (gasp!)
documentation you need to get a cluster off the ground in minutes.

No, really. Minutes.

Since clusters have different purposes, calling for different toolsets, the
Rocks distribution is downloadable in the form of "rolls." A roll is basically
another ISO image that adds a certain desirable toolset to your cluster
automagically. So, for example, in addition to the Rocks "base" and "HPC" rolls
(which are required in order to use the cluster as, well, a cluster) I knew
from my research that I wanted to use
the Maui Scheduler for my cluster
scheduling interface, and Torque for my resource
management (Torque is based on the wildly popular OpenPBS resource manager). I was able to do
this by adding the Rocks "PBS/Maui" roll.

Getting the cluster going is pretty much dependent on getting the head node up
and running. Pop in the Rocks base CD, fire it up, and you're prompted with
a menu with some obvious options. Choose the one that says you want to install a front end
node. At that point, you'll have a few questions to answer which are simple if
you've ever installed a Linux distribution. The installer then asks if you
want to add any rolls to your installation. Keep adding rolls to your heart's
content, and when you're done, just tell the installer you have no more rolls
to add. At that point, it'll go about its merry way, and in just about no time
you'll have a head node.

And what a head node it is! It contains a PXE server
pre-configured to kickstart the rest of the nodes, in accordance with the rolls
you added during your head node installation. In my case, I used the PBS/Maui
roll, so all of my compute nodes were set up to be PBS compute nodes which
report to a server on the head node. If you ever decide that a
compute node is so badly out of whack that you want to start over, you just log
into the head node, and type shoot-nodenodename, and the node will be immediately rebooted,
reinstalled, and brought back up for immediate use in about 10
minutes on any decent hardware.

Adding new nodes is a matter of logging into the head node, running
insert-ethers, booting the compute node you want to add, telling it to boot
to the network, and that's it. The head node adds all of the information about
the new node to a local database and kickstarts it.

Need to add user accounts? Just add 'em to the head node, and it'll spit
'em out to the rest of the cluster using a Rocks-specific multicast-based
service called 411, which is used as a replacement for NIS and seems to work
flawlessly thus far. (There are tools to whip it into submission, just in case,
but I have yet to need them.) The same service handles host information, so
that your entire cluster is kept in sync at all times. The service works so
well that I wonder why nobody has tried to implement 411 outside the context of
a cluster -- maybe because of the multicast dependency?

In addition to adding compute nodes, the head node can also add different
kinds of nodes. For example, I currently have 16 compute nodes, which will grow
to roughly 140. Sometime before node 140 gets added to the cluster, I/O
performance for large, long-running jobs will become a problem. Rocks,
however, can bring up a dedicated Parallel Virtual File System IO node for you, and make sure that your nodes actually use it. PVFS can completely alleviate the need for NFS within
your cluster, and we all know NFS is an enormous source of performance issues,
administrative overhead, and downtime.

If your cluster nodes don't have PXE or even a CD drive, no problem -- just check the Rocks documentation for alternative solutions suitable to your situation.

Support and docs

The Rocks documentation is great for getting an initial Rocks configuration up and running (and tested) -- and keep in mind this is coming from the guy
who got so fed up with the state of open source documentation that he wrote an
article about it
. The Rocks team has done a good job of
addressing the possible issues that can arise during a configuration, and how
to avoid common pitfalls. They explain their take on the problem of cluster
configuration up front, state their assumptions, and clearly explain how to get
started.

The Rocks discussion mailing list is amazing. I've posted a few questions, and
even caught one bug, and there has been friendly help and encouragement along
the way. I generally avoid adding yet another mailing list to my collection,
since most of them seem to be frequented by new users who don't understand why
their hostname keeps showing up as "localhost," but this one was well worth it.
I have not asked an unanswered question yet, and the problems I've had have
been solved in record time.

In conclusion

If you don't know much about clusters, but you want to know more, it's
sometimes easier to learn on a system that is already in some form of
"known-working" state. Rocks makes getting to this point a breeze, after which
you can start kicking tires and playing with the tools provided. From there,
it's just a short hop to production. Rocks has done an excellent job of
abstracting the problems associated with maintaining a cluster, and boiling
them all down to a set of very usable, very efficient tools.

Click Here!