How Twitter Avoids the Microservice Version of “Works on My Machine”


Apache Mesos and Apache Aurora initially helped Twitter engineers to implement more sophisticated DevOps processes and streamline tooling, says software engineer David McLaughlin. But over time a whole new class of bespoke tooling emerged to manage deployment across multiple availability zones as the number of microservices grew.

“As the number of microservices grows and the dependency graph between them grows, the confidence level you achieve from unit tests and mocks alone rapidly decreases,” McLaughlin says, in the interview below. “You end up in the microservice version of “works on my machine.”

David McLaughlin, software engineer at Twitter, will speak at MesosCon Europe in Amsterdam Aug. 31 – Sept. 2, 2016.
McLaughlin will talk this month at MesosCon Europe about these challenges, as well as the system Twitter built to support their CI/CD pipeline and close the gaps in deploy tooling.

Here, he describes application testing and deployment in a microservices architecture; how Twitter approaches it; and what he’s learned about DevOps in the process. What is the challenge with orchestration in a microservices architecture?

David McLaughlin: One of the biggest challenges for service owners is trying to build the confidence that code changes are going to work in production. As the number of microservices grows and the dependency graph between them grows, the confidence level you achieve from unit tests and mocks alone rapidly decreases. You end up in the microservice version of “works on my machine.”

One way we’ve built up confidence is to build pipelines where services are deployed and tested end to end against real upstream and downstream services before going to production. At a given size, you also have the issue of having multiple availability zones and finding yourself having to repeat all these testing steps for each zone. If this process involves humans in any way, that becomes a lot of time and money being spent just deploying code. This is obviously not a good position to be in when you fully embrace microservices and start to have hundreds or even thousands of services being managed. How did your team initially try to address the challenge?

David: Mesos and Aurora make it really easy for engineers at Twitter to deploy their service to multiple environments and clusters. Aurora comes with the ability to schedule a service to only run when capacity is available, and be evicted in favor of a production service during peak loads. This allows engineers to use more resources in pre-production environments without worrying about the cost to the company – they are simply taking advantage of the extra capacity that is required for things like disaster recovery or peak events.

However, orchestrating the deploy pipeline across each step was still left to users. This was done through complex CI job configurations, with bespoke deploy tooling, or even worse – completely manually. How are you handling orchestration and tooling now?

David: We’ve built a tool to handle the release of a code change from development to testing to production across multiple clusters. It is built to support (if not encourage!) automation via an API, but also supports manual orchestration via a UI or even a hybrid approach where everything except the final production push can be automated. This allows users to adopt the tooling even if their current testing practices don’t provide the confidence to fully automate the whole process. What did you learn about DevOps in the process?

David: I think the biggest lesson I’ve taken away from working in this space is that when it comes to DevOps, often the best user experience is having no experience at all. The vast majority of developers just want to build stuff and ship it. So the focus should be on enabling that, instead of putting more and more tools in between them and the satisfaction of having their work shipped. Can you give one example of a more sophisticated DevOps process that resulted from your experimentation with tooling?

David: Performing a rollback used to mean retracing steps to find some known-good package version or gitsha, manually injecting it into your configuration and then applying to all your different production clusters. Now it’s as simple as clicking a button (or making an API call) on a previously successful deploy. It greatly reduces the stress of backing out from problems in production.


Join the Apache Mesos community at MesosCon Europe on Aug. 31 – Sept. 1, 2016! Look forward to 40+ talks from users, developers and maintainers deploying Apache Mesos including Netflix, Apple, Twitter and others. Register now.

Apache, Apache Mesos, and Mesos are either registered trademarks or trademarks of the Apache Software Foundation (ASF) in the United States and/or other countries. MesosCon is run in partnership with the ASF