Running Distributed Applications at Scale on Mesos from Twitter and CloudBees


A recurring theme in our MesosCon North America 2016 series is solving difficult resource provisioning problems. The days of investing days or even weeks in spec’ing, acquiring, and setting up hardware and software to meet increased workloads are long gone. Now we see vast provisioning adjustments taking place in seconds.

Twitter invented something they call “magical operability sprinkles” to handle Twitter’s wildly varying workload demands, which spike from little activity to millions of tweets per minute. These magic sprinkles are built on Finagle, linkerd, and Apache Mesos, and magically provide both massive scalability and reliability.

CloudBees took on the challenge of building a giant Jenkins cluster, possibly the largest one in existence, using Docker and Mesos. They run Jenkins masters in Docker containers and spin up Jenkins slaves on demand. It is a clever structure that solves the difficult problems of scaling Jenkins and of providing isolation for multiple discrete users.

What happens when a container exceeds its memory quota.

Finagle, linkerd, and Apache Mesos: Magical Operability Sprinkles for Microservices

Oliver Gould, CTO of Buoyant

Back in the olden days of Twitter (2010), getting more hardware resources involved bribes of whiskey to the keeper of the hardware, because resources were so scarce. It was an acute problem; Twitter was growing rapidly and was not keeping up with the growth. Consequently, they suffered frequent outages, to the point that during the 2010 World Cup Twitter staff were chanting “Please no goals. Please no goals.” Giant spikes could happen at any time, so their most pressing problem was “How do you provision for these peaks in a way that doesn’t cost you way too much money, and still keep the site up when this happens?”

Oliver Gould explains Twitter’s approach to building both scalability and reliability. “This is a quote from a colleague of mine at Twitter, Marius Eriksen. ‘Resilience is an imperative. Our software runs on the truly dismal computers we call data centers. Besides being heinously complex …they are unreliable and prone to operator error.’ Think about this a second. If we could run a mainframe, one big computer, and it never would fail, whywouldn’t we use that?…We can’t do that. That’s way too expensive to do. Instead, we’ve built these massive data centers, with all commodity hardware and we expect it to fail continuously. That’s the best we can do in computing. We can build big data center computers out of crappy hardware, and we’re going to make that work.”

Another problem is slowness. “’It’s slow’ is absolutely the hardest problem you’ll ever debug. How do we think about slowness in a distributed system? Here, we have 5 services talking to 4 services, or however many. When one of these becomes slow, this isn’t proportional to the slowness downstream. This spreads like wildfire…Load balancing is probably the sharpest tool we have for this,” Gould says.

Microservices and Finagle are key to solving these problems. A microservice is not necessarily small in size, and it may require a lot of CPU or memory. Rather, it is small in scope, doing only one thing and doing it well. So, instead of writing giant complex applications, Twitter engineers can quickly write, test, and deploy microservices. Finagle is a high-concurrency framework that manages scheduling, service discovery, load balancing, and all the other tasks that are necessary to orchestrate all these microservices.

Watch the complete presentation below:

CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

Carlos Sanchez, CloudBees

The Jenkins Continuous Integration and Continuous Delivery automation server is a standard tool in shops everywhere. Jenkins is very adaptable for all kinds of workloads. For example, a software company could integrate Jenkins with Git, GitHub, and their download servers to automate building and publishing their software, its documentation, and their web site.

Scaling Jenkins, Sanchez notes, involves tradeoffs. You can use a single master with multiple build agents, or multiple masters. With a single master “The problem is that the master is still a single point of failure. There’s a limit on how many build agents can be attached. Even if you have more and more slaves, there’s going to be a point where the master is not on the scale or you’re going to have a humongous master… The other option is having more masters. The good thing is that you can have multiple organizations, departments with our own Jenkins master. They can be totally independent. The problem obviously is you need now single sign on. You need central configuration and operation. You need a view over how to operate all these Jenkins Masters that you run.”

“What we built was something that it was like the best of both worlds…We have the CloudBees Jenkins Operation Center with multiple masters, and also dynamic slave creation only to master.”

The CloudBees team built their Jenkins Operation Center with Mesosphere Marathon, and installed the Mesos cluster with Terraform. Other components are Amazon Web Services, Packer for building the machine images, OpenStack, Marathon for container management, and several more tools. They had to solve permissions management, storage management, memory management, and several other complexities. The result is a genuine Jenkins cluster for multiple independent users that scales on demand.

Watch the complete presentation below:

Mesos Large-Scale Solutions

You might enjoy these previous articles about MesosCon:

4 Unique Ways Uber, Twitter, PayPal, and Hubspot Use Apache Mesos

How Verizon Labs Built a 600 Node Bare Metal Mesos Cluster in Two Weeks

And, watch this spot for more blogs on ingenious and creative ways to hack Mesos for large-scale tasks.


MesosCon Europe 2016 offers you the chance to learn from and collaborate with the leaders, developers and users of Apache Mesos. Don’t miss your chance to attend! Register by July 15 to save $100.

Apache, Apache Mesos, and Mesos are either registered trademarks or trademarks of the Apache Software Foundation (ASF) in the United States and/or other countries. MesosCon is run in partnership with the ASF.