How to Set Up External Service Discovery with Docker Swarm Mode and Træfik

March 27, 2017

2090

In my previous post, I showed how to use service discovery built into Docker Swarm Mode to allow containers to find other services in a cluster of hosts. This time, I’ll show you how to allow services outside the Swarm Mode cluster to discover services running in the cluster.

It turns out this isn’t as easy as it used to be. But first, allow me to talk about why one would want external service discovery, and why this has become more difficult to achieve.

Why External Service Discovery?

For most of us, we are not running 100 percent of our applications and services in containers. Maybe some are, but they’re running them in two or more Swarm Mode clusters. Perhaps a large group are constantly deploying and working on containers. In these situations, it can become trying to have to update configuration files or DNS entries every time a service is published or changes location.

What changed?

Those of us who use Docker heavily are familiar with Docker’s “move fast and break things” philosophy. While the “break” part happens less frequently than in Docker’s early days, rollouts of significant new features such as Swarm Mode can be accompanied by a requirement to retool how one uses Docker. With earlier versions of Docker, my company used a mixture of HashiCorp’s Consul as a key/value pair, along with Gilder Labs’ Registrator to detect and publish container-based service information into Consul. With this setup, Consul provided us DNS-based service discovery – both within and external to the Swarm (note: Swarm, not Swarm Mode) cluster.

While Docker 1.12 brought Swarm Mode and extreme ease of use to building a cluster of Docker hosts, Swarm Mode architecture is not really compatible with Registrator. There are some workarounds to get Registrator working on Swarm Mode, and after a good amount of experimentation I felt the effort didn’t justify the result.

Taking a step back, what’s wanted out of external service discovery? Basically, the ability to allow an application or person to easily and reliably access a published service, even as the service moves from host to host, or cluster to cluster (or across geographic areas, but we’ll cover that in a later post). The question I asked myself was “how can I combine Swarm Mode’s built-in service discovery with something else so I could perform name-based discovery outside the cluster?” One answer to this question would be to use a proxy that can do name-based HTTP routing, such as Træfik.

Using Træfik

For this tutorial, we’ll build up a swarm cluster using the same Vagrant setup from my previous post. I’ve added a new branch with some more exposed TCP ports for this post. To grab a copy, switch to the proper branch and start the cluster, follow below:

$ git clone https://github.com/jlk/docker-swarm-mode-vagrant.git

Cloning into 'docker-swarm-mode-vagrant'...

remote: Counting objects: 23, done.

remote: Total 23 (delta 0), reused 0 (delta 0), pack-reused 23

Unpacking objects: 100% (23/23), done.

$ cd docker-swarm-mode-vagrant/

$ git checkout -b traefik_proxy origin/traefik_proxy

$ vagrant up

If this is the first time you’re starting this cluster, this takes about 5 minutes to update and install packages as needed.

Next, let’s fire up a few WordPress containers – again, similar to the last post, but this time we’re going to launch two individual WordPress containers for different websites. While they both use the same database, you’ll notice in the docker-compose.yml file I specify different table prefixes for each site. Also in the YML you’ll see a definition for a Træfik container, and a Træfik network that’s shared with the two WordPress containers.

Let’s connect into the master node, check out the code from GitHub, switch to the appropriate branch, and then start the stack up:

$ vagrant ssh node-1

$ git clone http://github.com/jlk/traefiked-wordpress.git

$ cd traefiked-wordpress

$ docker stack deploy --compose-file docker-compose.yml  traefiked-wordpress

Finally, as this example has Træfik using hostname-based routing, you will need to create a mapping for beluga and humpback to an IP address in your hosts file. If you’re not familiar with how to do this, Rackspace has a good page covering how to do this for various operating systems. If you’re running this example locally, 127.0.0.1 should work for the IP address.

Once that’s set up, you should be able to browse to http://beluga or http://humpback in your browser, and see two separate WordPress setup pages. Also, you can hit http://beluga:8090 (or humpback, localhost, etc) and see the dashboard for Træfik.

NDVq3Jiai4T9CSvBqx3Kyt9wOa3K_p4mC0bAxyxB

jdBff81VuyDesgkYenpyCv0VdIoDEcYtnMYT_9QY

An Added Benefit, But with a Big Caveat

One of the things which drew me to Træfik is it comes with Let’s Encrypt built in. This allows free, automatic TLS certificate generation, authorization, and renewal. So, if beluga had a public DNS record, you could hit https://beluga.test.com and after a few seconds, have a valid, signed TLS certificate on the domain. Details for setting up Let’s Encrypt in Træfik can be found here.

One important caveat that I learned the hard way: When Træfik receives a signed certificate from Let’s Encrypt, it is stored in the container. Unless specified in the Træfik configuration, this file will be stored on ephemeral storage, being destroyed when the container is recreated. If that’s the case, each time the Træfik container is re-created and a proxied TLS site is accessed, it will send a new certificate signing request to Let’s Encrypt, and receive a newly signed certificate. If this happens often enough within a small period of time, Let’s Encrypt will stop signing requests from that top-level domain for 7 days. If this happens in production, you will be left scrambling. The important line you need to have in your traefik.toml is…

       storage = "/etc/traefik/acme.json"

…and then make sure /etc/traefik is a volume you mount in the container.

Now we understand external, DNS-based service discovery for Swarm Mode. In the final part of this series, we’ll add high availability and failover to this mixture.

Learn more about container networking at Open Networking Summit 2017. Linux.com readers can register now with code LINUXRD5 for 5% off the attendee registration.

John Kinsella has long been active in open source projects – first using Linux in 1992, recently as a member of the PMC and security team for Apache CloudStack, and now active in the container community. He enjoys mentoring and advising people in the information security and startup communities. At the beginning of 2016 he co-founded Layered Insight, a container security startup based in Silicon Valley where he is the CTO. His nearly 20-year professional background includes datacenter, security and network operations, software development, and consulting.

Why External Service Discovery?

What changed?

Using Træfik

An Added Benefit, But with a Big Caveat

RELATED ARTICLESMORE FROM AUTHOR

How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

Automating Compliance Management with UTMStack’s Open Source SIEM & XDR

Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces

Xen 4.19 is released

Advancing Xen on RISC-V: key updates

RELATED ARTICLES MORE FROM AUTHOR