The cluster fest is neither a commercial event like LinuxWorld, nor a typical Linux community endeavor. For that reason it came into Austin under the radar, not drawing any attention on the local LUG mailing list. Although it's partially sponsored by Dell and IBM, there were no vendor booths set up from which to hawk their wares. There was a table with free issues of Linux Journal (another sponsor of the event), and of course there was coffee and pastries to get the morning sessions started.
But there is a sense of community here, very much like you would expect in large LUG meeting. Conversations going on outside the meeting rooms are as likely as not to include the previous -- or the next -- speaker. It's relaxed and friendly, peer to peer rather than guru to the unwashed multitude. It's also more focused on computing than any other show I've attended. Not just any computing, however. At a minimum it needs to be high-performance, and of course it's even better if it has high-availability. My guess is that a large majority of the attendees are from academia.
I would estimate a crowd of less than 100 were present for the feature presentations this morning on cluster health. Chokchai Leangsuksun of Louisiana Tech University started today's pre , who gave a talk entitled "A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster." Jim Prewett of the HCP center at the University of New Mexico followed with a talk entitled "Listening to Your Cluster with LoGS."
He claims that HA-Oscar has already made impressive gains in the HA game of "how many nines?" Oscar sports a downtime of roughly thirty hours a year. Primarily by making it dual-headed (see the images of Tux below) with fallover on failure, HA-Oscar is estimated to reduce hours to minutes. Thirty-six minutes a year, to be precise. That doubles the nines from two to four.
Jim Prewett "listens to his clusters" by analyzing their logs. Building on Logsurfer and swatch, he's written a new free software tool in Lisp designed to help administer the complexities associated with large clusters. It's called LoGS, and it is designed to be extendable and highly flexible.
Unfortunately, I didn't have the time to spend two or three days at the conference, but I did learn a lot about clusters, performance, and availability in the two talks I heard. In fact, Chokchai caused me to do a double-take the first time he mentioned checkpoint/restarts, a feature they want to implement in HA-Oscar, and a term I hadn't heard since my jobs ran on big iron, and REFBACKs in my JCL were the bane of operations.
|Click to enlarge|
If you're into Linux clusters, this is a show you'll want to attend should you ever get the chance.