Are Your Linux Skills Right for HPC Jobs?

139

Do you have what it takes for that Linux job with an HPC vendor you’ve got your eye on? Brent Welch, the director of software architecture at Panasas, talks about the role Linux plays in HPC at Panasas and the in-demand technical skills supercomputing suppliers need from job applicants.

Last year, Panasas, a provider of high performance parallel storage solutions for technical applications and big data workloads, moved into new corporate headquarters in Sunnyvale, California, and expanded its team by more than 50 percent in areas such as engineering and sales. Panasas hasn’t been the only supercomputing-focused company growing and hiring recently. In fact, high performance computing (HPC) vendors across the industry are hiring, but they are running up against a shortage of skilled talent.

At the end of 2011, Dan Lyons looked at this disparity between demand and supply in a Daily Beast article called The U.S. Is Busy Building Supercomputers, but Needs Someone to Run Them. “Scientists refer to the talent shortage as the ‘missing middle,’ meaning there are enough specialists to run the handful of world-beating supercomputers that cost a few hundred million dollars, and plenty of people who can manage ordinary personal computers and server computer—but there are not nearly enough people who know how to use the small and mid-sized high-performance machines that cost anywhere from $1 million to $10 million,” Lyons wrote.

He examined efforts to prepare students for the in-demand jobs in HPC, including the Virtual School of Computational Science and Engineering. Thom Dunning, director of the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, told Lyons that 1,000 students participated in the virtual school in 2011, compared to 40 students in 2008, when the program began.

In her 2011 Recap blog post, Faye Pairman, Panasas President and CEO, said that the hiring momentum will continue in 2012. Recently, I contacted Brent Welch, the director of software architecture at Panasas, to discuss what his company is working on and the technical skill needs employers like Panasas have now. Welch has a long history in I.T., including positions at Xerox-PARC and Sun Microsystems Laboratories. While working on his Ph.D. at UC Berkeley, he designed and built the Sprite distributed file system. Welch is also the creator of the TclHttpd web server, the exmh email user interface, and the author of Practical Programming in Tcl and Tk.

Linux at Panasas

Welch says that Panasas is primarily a product for Linux HPC environments. “We have a Direct Flow kernel file system driver for Linux that knows how to speak to our PanFS parallel file system,” he explains, adding, “I like to say, ‘We support a lot of operating systems, as long as they are Linux.'”

Panasas builds and certifies their kernel module for more than 300 variants of Linux, mostly Red Hat and SUSE. Welch says the company is starting to support Ubuntu and Debian, too. “We depend on a packaging system that has good versioning support so we know that our sophisticated network file system driver will work properly in a particular Linux kernel. We never require any kernel patches,” he says.

Panasas also exports their file system via NFS and CIFS for “legacy” file system support, but Welch says the performance and fault tolerance advantages of their product are best realized in the Linux environment. “Our customers include enterprise HPC customers in a wide range of commercial applications, from manufacturing to finance. Anyone that has a non-trivial Linux compute cluster is an ideal Panasas customer,” he explains.

Panasas been working with the IETF standards group and the Linux open source community to develop the pNFS (parallel NFS) standard as part of NFS v4.1, and putting a lot of effort into a standard Linux client that can replace their proprietary DirectFlow client and still interoperate with the Panasas file system as well as competitors’ offerings. “Currently our DirectFlow client provides a great advantage over traditional NFS implementations, but even with a standard parallel file system client, it will be our back-end that provides value to our customers,” Welch says.

When it comes to big data, Panasas supports single namespace file systems measured in petabytes, which is larger than most definitions of big data. “Petabytes of storage simultaneously accessed by thousands of Linux hosts working in concert on a shared application,” Welch explains. “That’s big data, and that has been our target environment for the life of the company,” he adds. Panasas has been supporting production customers since 2004.

In-demand Skills

So what skills does an HPC company like Panasas need? Great programmers.

“You’d be surprised at the number of folks that are in the job market that can barely program themselves out of a paper bag,” Welch responds. “We need folks that understand threads, concurrency, and distributed systems. Folks that are fearless and ready to dive into a large, sophisticated code base. We have all sorts of interesting problems to solve, and need smart people capable of diving in and making a difference,” he says.

According to Welch, Linux kernel experience is a plus because it tends to expose some of the harder problems. “Distributed systems – network protocol design and failover systems are some of the hardest problems out there,” Welch says. “Our product composes storage, which is mission critical and has a zero tolerance for serious bugs, and distributed systems, which makes everything harder. We have lots of sophisticated infrastructure, and have a very demanding product. We need help from folks with experience building real systems,” he explains. Welch says the company screens job candidates on phone calls and uses programming exercises to make sure applicants “aren’t blowing smoke.”

Welch thinks there are two directions for computing. The first is the fun personal device, such as the front-end phone and tablet, and then there’s the ultra-large back-end “cloud” that composes massive amounts of computing systems into something extremely capable.

“Yesterday’s fringe supercomputers are evolving into mainstream HPC deployments,” he notes, adding, “Even so, I think HPC systems are still in their infancy. I often think about the global phone system that connects millions of devices with a non-stop service. There are all sorts of lessons in reliability – and billing – that need to be relearned by the rest of us. I always say that if you have a reliable and scalable architecture, then you can get whatever performance you need by building a larger system. But I think you have to start with the built-in reliability and mechanisms for fault detection and automatic system recovery.”

He says that if you get fault tolerance right, you can build a large, powerful system, but if you start with performance and skimp on the hard problems of fault tolerance, you’ll never have a system that really works for your customer.

“HPC systems have been transitioning from the labs that put up with all sorts of cruft, into enterprise applications that demand non-stop, missing-critical performance from very large collections of computing, storage, and networking resources,” Welch explains. “The most massive applications in Google and Amazon and Microsoft are all hand-rolled. There is a second tier – everyone else – that just want to get the job done, and they don’t have the time to hand-roll their solution.” He says that this is the focus for Panasas. “We solve the HPC file system problem for the large scale. That’s what we do, and I think we do a great job.”

Panasas will be attending SCALE10x this month in Los Angeles. Stop by their booth to see sample storage and director blades.