October 19, 2015

Exclusive Interview: Max Ogden of HyperOS

dat-logo copyHyperOS is a nifty solution for those who want to run their own containerized environment on desktops or laptops for development purpose. HyperOS supports Linux, Mac, and soon Windows and is intended to be used primarily as a end-user CLI tool on workstations. We reached out to Max Ogden who leads the development team.

Can you tell us about yourself? What you do, where do you live?

I'm a computer programmer from Portland, OR. Since 2014, I've led a small grant-funded team called Dat to build tools to bring more reproducibility to scientific data sharing and analysis. Our work is 100% open source and funded entirely through grants, and we are housed inside a not-for-profit organization called US Open Data.

One way to think about Dat is as an independent software research + development team. I'd say our main focus is to try and introduce ideas from the open source software world into the world of scientific computing + data, which doesn't prioritize and/or fund very many general purpose software tools. Our current major funder, the Alfred P. Sloan Foundation, has been doing a lot of grant funding in this area recently to try and get universities to start investing in data science tools and infrastructure since it impacts so many academic disciplines now.

How are you associated with the Node.js project?

I'm not involved with the Node.js core project directly but am active in the community. Back in 2013, I helped start NodeSchool, which has grown into an amazing community of over 130 chapters globally with thousands of people learning and teaching Node.js with open source curriculum.

What are the other open source projects you are involved with?

I use Node.js for all sorts of weird things, from distributed systems to 3D graphics and have published nearly 300 modules to npm over the last 4 years.

Let's talk a bit about HyperOS. What is it?

HyperOS is a distribution of TinyCore Linux that we created to support our use case of running containers on top of a version-controlled distributed filesystem. We found existing tools like Docker great at deploying your code to a server, but very difficult to use if you wanted someone to run your container on their laptop.

max ogdenHyperOS came out of the Dat project, which is a dataset version control tool. Our goal with Dat is to make it easy to share datasets, big or small, over a network. We realized it would be really powerful to be able to version control the machine environment (a Linux VM) and the dataset (raw data files like CSVs, etc.) together in the same repository.

A huge problem with scientific software, and to a lesser extent open source software in general, is getting someone else’s code to run on your machine. Sometimes it's easy, but sometimes it takes hours or days of debugging someone's Makefile to make it work on your system. There's a saying in science, "works on my machine," which is the typical dismissive answer you receive upon asking someone why the code to reproduce the results in their scientific paper didn't work. We're trying to address this problem, and HyperOS is one of our ideas about how to do so.

Who are your target users?

Our main audience is scientists. This is because we are currently paid to write software for scientists. However, we are making sure none of our tools are specific to scientific use cases. We would love to get feedback on our tools from anyone who is interested in using containers to do software dependency management (e.g. fulfilling a similar need of apt-get or Homebrew).

Can you give us some use cases where HyperOS makes sense?

Say you had a 2GB CSV file, a Python script that imports it into a PostgreSQL database, runs a query, and generates a PNG chart with gnuplot. You now want to package this up in a container and have a colleague reproduce your results on their laptop. You're both running Mac OS X.

Option A is to tell them to install PostgreSQL, Python and gnuplot manually, download the CSV file, and run your Python script. This might not sound that hard to some, but there are so many different variables in play that could cause the entire process to fail. You might be using a different version of Python, PostgreSQL, or gnuplot. The CSV URL might return a 404. Your operating system might not have a distribution of one of the software dependencies available.

Option B is to use Docker. You could install the Docker Toolbox on your machine (currently around ~175MB), which includes VirtualBox. Then you could open the Docker Quickstart Terminal app it installs, create a Dockerfile, and finally build and publish a Docker image to the Docker Hub. Your colleague would also have to install the Docker Toolbox, open the Docker Quickstart Terminal and Docker pull your image. Once the entire image is done downloading, they can run your container in the terminal.

To get the 2GB CSV, you could either put a script in the container that used curl to download it when the container gets run, meaning the URL where the CSV lives hopefully never 404s (a common occurrence in science), or you could include the 2GB file inside the built Docker image, meaning you'd have to rebuild the image every time the CSV changes.

We think this kind of flow involves too many complicated steps for scientists. For example, scientists have long favored flat file formats over complicated databases, even if the databases are more powerful, because at the end of they day they only care about the science -- the code itself is just a means to that end and they aren't willing to invest in what might turn out to be technical debt.

Most people use Docker for secure cloud deployments, and it's great for that. We think containers could also be very useful for local software dependency management.

Option C is to use HyperOS. It downloads 14MB and runs a Linux VM, the total process usually takes less than 1 minute. Then it can do what we call "live booting," which is where we mount a lazy virtual system and spawn a container on top of it. The filesystem is managed on your host OS by a tool we wrote called HyperFS, which you can think of as a version-controlled distributed filesystem. The main defining feature of HyperOS is that the actual filesystem in HyperOS is immutable. This is in part thanks to the way TinyCore Linux works, but also because we run 100% of user code inside HyperOS on top of virtual filesystems that are persisted into volumes on the host OS.

We don't have to download the entire filesystem in order to live boot it, we just have to get the filesystem metadata (the filenames, lengths, and permissions). When Linux needs to read a file like /bin/bash, we fetch it on demand from the remote data source (which could be a single place like the Docker Hub, or a P2P network like BitTorrent). This means you only have to download the data you actually use in the container. Instead of downloading 600MB to run a shell script, we can live boot a container to a bash prompt with only 50MB.

Another huge difference between Docker and HyperOS is that we can do version control for containers. Since everything runs on top of our version-controlled filesystem, it lets us explore exciting new possibilities for containers such as forking someone’s container, installing or modifying some software and sending them a diff.

It seems to be using a different way to install Linux on Mac OS X. Can you explain the “npm install linux” project?

We are using the new Hypervisor.framework that Apple released as part of OS X Yosemite. It is an operating system level hypervisor that is built into Mac OS. We're using the xhyve project to interface with it, which is a C port of the bhyve hypervisor API from FreeBSD.

What's the reason behind using npm for it? Any clear advantages (any obvious ones against using a virtualized environment)?

We use npm to install our command-line tools and to download the 14MB HyperOS distribution. We find npm to be an excellent choice for writing and distributing command-line tools.

Can you talk a bit about the install process? How does it work?

When you run “npm install linux -g” it downloads HyperCore, which is our custom TinyCore build. It includes the base TinyCore rootfs plus OpenSSH, OpenSSL, and our virtual filesystem mounting utility hyperfused. It also downloads the TinyCore vmlinuz64 64-bit Linux kernel binary.

We include a 250KB xhyve binary, compiled for Mac OS 64-bit. The rest of the “linux” package is our command-line interface that lets you spawn hypercore + xhyve and execute commands inside the VM over SSH.

I am curious why you targeted OS X (although Windows support is coming). Is OS X more popular among Linux users?

We found xhyve/Hypervisor.framework to be pretty simple since it's built into the OS and has a completely programmatic API. To support Windows, we need to integrate with Hyper-V, which will involve some manual setup steps for the user and potentially some PowerShell scripting on our part. We could also go the route of bundling VirtualBox, which is what Docker does, but we really want to try and keep the dependencies as simple as possible.

What are the core components of the project?

We took the Merkle Directed Acyclic Graph design used by Git and built a distributed filesystem on top of it. This lets us have version control capabilities on the filesystem. Since Linux containers are just filesystems (because everything in Linux is a file), we can replicate and version control containers. The last piece is a way to execute the containers, which requires running Linux. The “linux” module on npm is our way to do that on Mac OS and eventually Windows using the operating system hypervisors. Linux users won't need any special hypervisor software.

What are your plans to integrate HyperOS with other tools?

We're in the process of integrating HyperOS with our Dat command-line tool to simplify this workflow down to a single command (e.g. "dat run"). We want to make sure HyperOS is a standalone project in the spirit of the Unix philosophy, and Dat is just one project that uses it. We also want to use as much existing container standards as we can (e.g. we are looking into being able to run Docker containers inside HyperOS).

Will there be any commercial offering based on it?

Not at this time. Luckily, we are grant funded and can hopefully continue to write open source software with future grants. If you would like to talk to us about supporting our work with grants, please contact me! It's difficult to find funding for these kinds of fundamental open source public good utilities, and we are very interested in continuing down this path.

What's in there for enterprise customers? Does it benefit them?

I think enterprise customers will still use tools like Docker to deploy their containers into production. I see HyperOS are more of a developer tool for local development.

The Dat project is in early stages. What are your long-term goals?

Dat is currently in Beta, and our vision for the 1.0 is to combine our filesystem version control with the HyperOS container runtime. We want to bring an end to the "works on my machine" excuse by providing a version control tool that can share reproducible code and data workflows between collaborators.

To get involved with the project, check out our website or the #dat channel on irc.freenode.net.

Click Here!