Shared vs. distributed memory in large Linux clusters

With Linux clusters becoming more commonplace in business IT settings, it seemed helpful to learn concepts that keep popping up in discussions of using clusters to solve problems. One such concept is the difference between shared and distributed memory.

The two architectures are suited to solving different kinds of problems, according to Bob Ciotti, Terascale Project Lead at NASA Ames Research Center, and Jason Pettit, Altix product manager at SGI. Understanding the problem at hand can help IT staff make decisions about the kind of computing resources needed to attack advanced problems efficiently.

A shared memory machine has a single node, system disk, and network connection. It may use 128 or 256 or more processors, but it looks to the end user like a single Linux machine — albeit a monstrously powerful one — that exhibits high throughput.

Modeling the commodity cluster

Distributed memory describes the model of the commodity cluster, which has a large number of nodes, each with its own processor(s), system disk, and networking. A cluster has as many system images as it does nodes and looks to the user like a number of independent computers on a network.

Shared memory machines are best suited to what Pettit calls ‘fine-grained’ parallel computing jobs, where all of the pieces of the problem are dependent on the results of the other processes. An example is weather simulation, in which a system is broken up into cells. In each time slice, the pressure, temperature, and other parameters of each cell are computed based on the properties of the other cells — one stuck or slow cell slows the whole process.

Distributed memory machines are best suited to “coarse-grained” problems, where each node can compute its piece of the problem with less-frequent communication with adjacent nodes. Mining large databases is an example of a problem that can be adapted to a coarse-grained approach. Each machine is given a data set to search and returns results as quickly as it can in the form of a message.

Fine-grained problems tend to be bound by the need for nodes to communicate, rather than processing power, since each node is dependent on the work being done by all the other nodes. Coarse-grained problems are bound more by the processing power of each node, since communication, in the form of message passing, is less frequent.

Some industrial-strength examples

An example of a distributed memory architecture is the $5.2 million Terascale Project cluster at Virginia Tech. 1,100 dual-processor Power Macintosh G5 computers are networked using both InfiniBand and a secondary Gigabit Ethernet network. Each machine has two 64-bit IBM PPC970 processors running at 2 GHz, 4 GB of RAM and a 160 GB hard drive and runs BSD-based Mac OS X.

NASA’s 256-processor SGI Altix 3000 is an example of shared memory architecture and is the machine that Ciotti oversees at NASA Ames Research Center. The Altix, recently upgraded to 64-bit Intel 1.5 GHz Itanium II processors, has 2 terabytes of RAM and runs a version of Red Hat Linux using kernel 2.4.21 with high-performance computing patches.

The Altix is being used by ECCO program oceanographers to model the behavior of earth’s oceans in near-real time. The Altix crunches large sets of data from satellites and other instrumentation to compute current, wind, and other simulations. ECCO is a consortium of scientists from JPL, MIT, and the Scripps Institute of Oceanography.

JPL’s Ichiro Fukumori explains why Altix simulations are important: “ECCO’s analysis will provide a basis for understanding mechanisms of ocean circulation … such understanding is fundamental to improving models to forecast weather and climate.”

Each 64-bit processor sees the machine’s RAM as a single contiguous memory space and can load data from any address it needs directly. The Altix achieves this by using a modular approach: The machine consists of a backplane and “bricks.” Besides a single brick that holds the system disk, DVD drive, etc., each brick contains either four processors, two memory controllers and part of the RAM or a router that interconnects the processor bricks. The system uses SGI’s proprietary NUMALink — for Nearly Uniform Memory Access — architecture to achieve memory access latencies as low as 5 nanoseconds, the time it takes for light to go 5 feet.

One advantage of this approach is that oceanographers, physicists, and other scientists don’t have to be computer science experts: the machine architecture takes care of parallelizing the problem. Pettit says this suits scientists on the leading edge who are trying to perfect new methods. They can concentrate on models or algorithms without learning how to break up their problems to run efficiently on a cluster of machines.

Breaking up problems to solve them

And that’s a drawback of the commodity cluster. To run efficiently, problems have to be broken up so that they will run efficiently in the RAM, disk, networking, and other resources on each node. If nodes have a gigabyte of RAM but the problem’s data sets don’t easily break into pieces that run in a gig, the problem could run inefficiently.

Another issue with distributed memory clusters is message passing. Since each node can only access its own memory space, there needs to be a way for nodes to communicate with each other. Beowulf clusters use MPI — Message Passing Interface — to define how nodes communicate. An issue with MPI, however, is that there are two copies of data — one is on the node, and the other has been sent to a central server or other node. The cluster needs a way to be sure that the data each node is using is the latest.

Still, commodity clusters are cheaper, sometimes orders of magnitude cheaper, than single-node supercomputers. Advances in networking technologies and software can help level the field between massively parallel clusters and purpose-built supercomputers. The University of Kansas at Wichita uses both a commodity cluster and an Altix, and uses custom queueing software to route the right sort of problem to the right computer.

Shared versus distributed? It all depends on the problem you’re trying to solve.

Chris Gulker, a Silicon Valley-based freelance technology writer, has authored more than 130 articles and columns since 1998. He shares an office with 7 computers that mostly work, an Australian Shepherd, and a small gray cat with an attitude.

RELATED ARTICLESMORE FROM AUTHOR

How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

Automating Compliance Management with UTMStack’s Open Source SIEM & XDR

Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces

Xen 4.19 is released

Advancing Xen on RISC-V: key updates

RELATED ARTICLES MORE FROM AUTHOR