Diffs and the Power of the Docker Layering Model
Recently I’ve been working more with the sophisticated tool that is Docker, and it hasn’t escaped me that the foundation of the DevOps world is essentially composed of layer after layer of diffs.
For those readers who aren’t hard-core hackers, a diff in back-in-the-day Unix terms simply means a difference. At a glance, as a Unix utility at least, it seems to have been around since the 1970s. The command simply allows you to compare files or directories so it’s easier to spot any differences between them. All modern-day Linux boffins will attest to the fact that it’s still a highly useful command, which frequently saves the day (if you’re curious, the GNU version can be found here).
Of course, any self-respecting coder will have been using revision control software for years. There are several available, such as the superpopular Git, partly written by Linus Torvalds himself. As a coder, once you have performed a commit (or save) of your first version of a new piece of code, whether that be one or a thousand lines long, when using Git, its clever software repositories will only save the difference between that first version and any future version which you then commit.
By only dealing with diffs, this process becomes uber-efficient, meaning that restoring previous versions can be done at breakneck speed, and as you’d imagine storing even hundreds of thousands of lines of precious code into your repositories is kind to disk space.
The Layering Model
For the uninitiated, somewhat surprisingly, Docker doesn't work too differently. Its inherent layering model affords Docker images the luxury of being lightweight and exceptionally performant and, to my mind at least, the construction of Docker images is a thing of beauty.
Once a base layer has been decided upon for direct download or adjusted to your liking (such as Debian’s) then with a little tweaking, it’s perfectly possible to run your customized applications using an unfathomably thin slice of disk space on top of that base layer.
There are no gold stars being handed out for immediately guessing how that might work.
Correct. The intelligent Docker to all intents and purposes also uses diffs. Whenever you make a change to an already existing image, you’re effectively adding a layer to Docker which simply sits on top of any existing layers. If you’re generating too many layers to keep track of, then a simple way to reduce the number of layers for simplicity is by chaining commands together.
For example, the following two commands, without the two ampersands chaining them together, would otherwise be two different layers because they’re two distinct adjustments to the underlying layer(s):
$ apt-get update && echo “Chris says hello”
This layering model dramatically reduces the amount of detail that Docker needs to remember, and of course by that I actually mean save to disk. When there’s a few Debian containers residing on a host, Docker simply treats the base layer as a dependency and effectively makes the other changes, which are found within the diffs as the container is launched. By way of an example, one base layer would serve your web, database, and SMTP servers as three distinct containers with a few hundred megabytes of diffs being the only difference between them in total.
A story for another day is how CopyOnWrite (COW) works with Docker images -- but aside from that complexity, the undeniably excellent layering model employed by Docker is remarkably simple.
Just like the super-slick Git and lightning-fast Docker, the next time you approach a complex problem, I encourage you to flex your lateral thinking muscles before meekly committing to a decision.
Simplicity after all is key in this brave new world.
Chris Binnie is a Technical Consultant with 20 years of Linux experience and a writer for Linux Magazine and Admin Magazine. His new book Linux Server Security: Hack and Defend teaches you how to launch sophisticated attacks, make your servers invisible and crack complex passwords.