As a preview to MesosCon, we spoke with Chris Pinkham, VP of Engineering at Twitter, about some of the issues involved with running “one of the largest single Mesos clusters known” and why open source technology is critical to Twitter’s success.
In his keynote presentation, “Platform Infrastructure at Twitter: The Past, Present and Future” on Thursday, June 2, Pinkham will provide an overview of the company’s current platform infrastructure, explain some of the challenges of operating at scale, and describe his team’s vision for hybrid cloud services.
Linux.com: Please tell us briefly about Twitter’s platform infrastructure. What are some of the problems you are working to solve?
Chris Pinkham: The platform infrastructure consists of the large-scale services behind the scenes that power Twitter’s websites and apps. Major components include the compute clusters, key-value, graph and object storage systems, search infrastructure, data platform, traffic management and load distribution services as well as the actual tweet services, user services, and the social graph services linking them all together.
At a high level, the biggest problems we’re working on are how to operate — efficiently and reliably — a huge messaging infrastructure connecting hundreds of millions of users globally, and at the same time allow Twitter’s engineers the ability to make continuous improvements to our products without negatively impacting our users’ experience.
Can you give us an example of a major problem you had to solve when massively scaling services? How did you approach that?
We regularly have to invent our way out of large scale distributed system problems — there are many examples, some of which we have made available in open source (such as Mesos and Aurora in our compute infrastructure) and some we haven’t (Manhattan, our internal key-value service and Omnisearch, a new information retrieval system).
In all cases, we are concerned with efficiency of operations in the sense that, when dealing with many thousands of machines, we are continually faced with defects. Building operational intelligence deeply into the systems allows the relatively small development teams to operate reliable services with minimal human intervention, but this only works if you assume that component failure is ongoing.
What is Twitter working on in regard to Mesos specifically?
Our immediate focus area is scalability and reliability of the compute platform. We run one of the largest single Mesos clusters known and are often the first to encounter a variety of performance issues given the scale. We are also focusing on making the compute platform more efficient.
We built a chargeback system that also leverages Mesos container stats to provide accurate utilization and cost reporting per job, project, and team. This not only drove over 30 percent improvement in overall resources utilized across our clusters but also provided a framework to evaluate ROI when compared to other alternatives (including public cloud services). We are now in the process of enabling resource oversubscription (revocable offers) that will open up a “spot” market where customers will have the flexibility to return resources and avoid being charged for it.
Finally, we are working to expose other compute resources such as network bandwidth and GPU for low latency, high throughput, and machine learning use-cases.
What role does open source play in your platform infrastructure strategy?
Open source is critical to Twitter’s success and our ability to scale the business effectively. Starting about 15-20 years ago, the largest web companies realized that traditional vendor relationships weren’t going to help with their most important problems — the traditional vendors were too slow, opaque and expensive, and didn’t scale to support their needs well. So, we collectively set about solving our most important problems ourselves and then shared the results so that we didn’t all have to be working on everything.
Many important contributions have come from other sources, of course, but a disproportionate number of the popular scalable compute, storage, and data systems in the open source world have come from companies such as Twitter and other large web companies. We rely on open source to make progress, and we feed the system by helping out with our own contributions whenever possible, especially when we have a particularly compelling point of view.
Can you give us a quick preview of your upcoming talk at MesosCon?
I will be giving an overview of Twitter’s current platform infrastructure, pointing to some of the major open source contributions we have made and are continuously making, and discussing how we are moving towards a compute infrastructure that combines private and public services in a seamless self-service platform that engineers can use to operate their own services at developer speed, unlimited by the infrastructure.
Anything else that you’d like to share?
Check our Twitter’s engineering blog — https://blog.twitter.com/engineering — to keep up to date on some of the most interesting happenings behind the scenes at Twitter.
In case you won’t be able to attend MesosCon in person, The Linux Foundation will be offering free live video streaming of all keynote sessions. You can see the full agenda of keynotes here, and sign up for the livestream now.