The Cassandra database is designed so that large clusters of systems can hold massive amounts of data. So why is a University of Dundee lecturer running it on the tiny $25 ARM-based Raspberry Pi?
At first glance, you may think that, with its 700Mhz ARM processor, 512MB of memory and booting off SD cards, that the Linux-running Pi was almost too anaemic to usefully run the big data oriented Java-based database Cassandra. Cassandra is an Apache project originally contributed to the foundation by Facebook and it is actively used by organizations including Netflix, eBay, Twitter and CERN to process large amounts of data using powerful servers in multiple data centers. It uses clusters of connected disk and RAM loaded servers to store that data and spreads the load over the cluster. Those clusters can also be connected over more constrained links to provide an internationally reliable and resilient database.
The challenge for Andy Cobley, who has been working out how to run Cassandra on multiple Raspberry Pis, is to make it possible for students to experience the database running on multiple ethernet connected computers without building datacenters and server racks.
He happily notes that Cassandra is built to be fast to write data to disk and while a typical laptop can manage 12000 write operations, in the same amount of time, a single Pi can only manage 200 writing to the SD card. And adding an external USB drive actually slows it down. The Pi's Ethernet port shares the same bus as the external USB port and the SD card and as Cassandra is very network centric, any disk performance improvement is overwhelmed by reduced network performance. And that, in itself is a useful lesson about how performance can be affected by the routes data takes through a system.
Cassandra on Pi
Cobley uses four or eight Raspberry Pis attached to an Ethernet switch and powered from one or two USB hubs. Each of the Pis runs the Debian Linux variant Raspian and although this, Cobley says, couldn't run the then current Oracle JDK it could run Cassandra using OpenJDK. This was just one of the complications of getting Cassandra to run, though fixing some of them resulted in bug fixes for Cassandra such as making the startup script resilient to being told there are no CPU cores in the system.
Cassandra uses compression to boost performance and so another complication meant avoiding using data compression schemes which uses native methods, like Google's Snappy compressor which was Cassandra's default. Instead the slower, Java-based Deflate compression was used, with a penalty in write performance. One Raspberry Pi specific optimization was to ensure as much memory as possible was available to the CPU; the Pi evenly splits memory between the CPU and GPU but using the Pi raspconfig tool, you can change that balance and the more memory the CPU has, the better Cassandra can run.
In a cluster with three or four nodes, Cassandra 1.1 on the Raspberry Pi manages around 700 writes, still nowhere near the capability of a single laptop, but the objective of the exercise is not to create a production Cassandra network. Part of the idea for the students is that by scaling down the platform, without rewriting it, something that the combination of Linux and Java on the Raspberry Pi make easy, then the problems of scaling up become much easier to examine in the same way a scale model of a bridge lets architects see physical stress in action.
Students are introduced to the ideas around Cassandra, given the Raspberry Pis and software and asked to go build their own clusters. For around £200, a student can build their own eight node Cassandra cluster and learn how to administer it and as a testbed. "Most of them move on to Cassandra on other platforms, either their own machines, Amazon servers or currently Azure servers" said Cobley. The skills they acquire using Linux, Java and Cassandra on the test bed scale up to other cloud platforms.
Future plans for the Cassandra on Pi project include getting Cassandra 2.0 running and working on using deliberately crippled hubs between two clusters to impersonate the distance and potential unreliability of a transnational link for experiments in replication as part of the MSc in Data Science post graduate course at the University of Dundee. With novel tools like Cassandra on Raspberry Pi/Linux clusters, the students are sure to gain valuable skills and insights into one of the fastest growing areas of information technology.