4 Unique Ways Uber, Twitter, PayPal, and Hubspot Use Apache Mesos

June 13, 2016

1033

You know the saying: fast, cheap, or good, pick two. Uber, Twitter, PayPal, and Hubspot show that you can have all three with Apache Mesos.

Apache Mesos is a cluster manager; it sits between the application layer and the operating system, and deploys and manages applications in large-scale clustered environments. But this dry description doesn’t convey its vast scope for creative and ingenious solutions to large-scale problems.

-Uber uses it to run a cluster of geographically-diverse datacenters.
-Twitter built a system to precisely track and cost resource usage.
– PayPal is building a self-sustaining dynamically-sized cluster, in effect using Mesos to run itself.
– HubSpot uses Mesos to dynamically manage 200 Nginx load balancers.

These are all use cases that go far beyond simple clustering. Watch videos, below, of each company’s use case, presented June 1-2 at MesosCon North America 2016. And see all 55+ recorded sessions for more from MesosCon North America.

Uber: Running Cassandra on Apache Mesos Across Multiple Datacenters

Dr. Abhishek Verma, Uber

Dr. Abhishek Verma, first author of the Google Borg Paper, describes using the Apache Cassandra database and Apache Mesos to build a fluid, efficient cluster of geographically diverse datacenters. The goals of this project are five nines reliability, low cost, and reducing hardware requirements. Mesos allows such flexible resource management that you can co-locate services on the same machine.

Mesos does not support multi-datacenter deployments, but Cassandra does. Cassandra has a mechanism called seeds. Seeds are special nodes that tell new nodes how the join the cluster. A seed node “talks” to other nodes in the Cassandra cluster to get the membership and topology information. Dr. Verma’s team created a new custom seed provider, and this has its own URL. New nodes access this URL and gets the information they need to start up and join the cluster.

Dr. Verma explained “We are trying to move away from our in-house custom deployment system to everything running on Mesos. All our services, all our stateful frameworks, everything is going to run on top of Mesos. We can gain these efficiencies because we co-locate services on the same machine. If you have a CPU intensive service, you can co-locate it with a storage engine which is just requiring a lot of SSG space or a lot of memory.

“This can lead to 30 percent fewer machines as we explored in the Google Borg Paper. In this, what we did is we looked at a shared cluster and we looked at the number of machines in this shared cluster. Then it partitioned the workload into production workload and batch workload. We estimated if we just had one cluster, which was serving the production workload, and one which was just serving the batch workload, how many more machines would we need? It turned out that we would need almost 30 percent more machines in the median cell. We can save all of these machines if we co-locate all of these different services on the same machines.” Watch the full talk, below.

https://www.youtube.com/watch?v=U2jFLx8NNro

Twitter: How We Built a Chargeback System for Twitter’s Compute Platform

Micheal Benedict and Jeyappragash Jeyakeerthi, Twitter

Twitter’s compute platform manages thousands of containers across thousands of hosts. But they had no way of knowing if containers were sized properly and using hardware efficiently, couldn’t tell who owned which containers and services, and couldn’t track resource use by container. Micheal Benedict and Jeyappragash Jeyakeerthi (JJ for short) built their Chargeback system on Mesos to answer these questions. Then they make this data available to users, and put a realistic dollar cost on it. This shows users that there is a real cost, and real money that is wasted when they schedule, for example, a 24-core job to do the work of a 4-core job.

https://www.youtube.com/watch?v=QaaEjJVVd44

Benedict describes their goals as “…to capture all the different resources, every infrastructure system provided. One of the first things we want to talk about is we have three big problems over here. The first one is called resource fluidity. What this essentially means is for example, Aurora Mesos has the notion of the three basic resources. You have cores, memory and disk and users get to define that when they’re launching a job. What’s important over here is to note that there’s no notion of time. When you’re actually doing something like Chargeback, you want to charge people for the resources they use over a period of time. This doesn’t capture it. We launch a job, the job gets killed or it continues to run. One of the first things is we wanted to track the fluidity of these resources across time and we had to do that pretty much for every other infrastructure as well. We also wanted to support additional resources as they get added…”

“The next step is to actually put a unit price or a dollar next to these resources. When Aurora is offering cores as a resource to the customer, that I can acquire 1,000 cores, I need to actually put a dollar next to it so people get an understanding of what’s the impact. This dollar needs to actually be as close to the real cost as possible and this is really where we invested a lot of our time in because there’s always this notion of having funny money which really doesn’t work. People are under the assumption, “It’s just funny money. Why do I care?””

Paypal: Efficient Usage of Apache Mesos Using Container Hibernation

Kamalakannan Muralidharan, Paypal

One of the challenges of operating a Mesos cluster is efficiently allocating your resources. One approach is to schedule applications on demand: take down applications when they’re not used, and bring them back up on demand. Set a timeout, and when the application is not used for that period of time send it into hibernation. Then when a use request comes in spin it back up. Integration with a software load balancer provides necessary capabilities such as events based on traffic pattern, and traffic hits.

https://www.youtube.com/watch?v=OJgmfog0e1w

In his MesosCon presentation, Kamalakannan Muralidharan discusses the flexibility and agility that is possible with Mesos: “Even though Mesos is good at handling a cluster, somebody has to provision and make it available for Mesos to use the cluster. We built a system which can automatically create a cluster and monitor the cluster, make sure that this capacity is there for applications to run. If it finds that it says less capacity, it scales up; if it finds there are not many applications running and capacity is underutilized, it will shrink the cluster. We build a complete, closed-loop system to manage Mesos and its Mesos’ resources itself. After once the resource is given to Mesos, Mesos manages it for applications and application deployments.”

HubSpot: Migrating 200+ Load Balancers into Apache Mesos

Stephen Salinas, HubSpot

HubSpot wanted a better way to manage their 200+ Nginx load balancers. They wanted more efficient use of resources, reliability, easier management, easier monitoring so they could know exactly what was happening anywhere at anytime, service discovery, failover, configuration validation, and very fast provisioning. Right, and why not some unicorns as well? But they did it with Mesos and other open source tools, including ZooKeeper and Docker.

https://www.youtube.com/watch?v=JvuJQmIVdGs

Stephen Salinas discusses some of the benefits of their Mesos-based dynamic load-balancer system: “We condensed I think it was about 225 was the highest limit of that down to less than 10 servers, saving us about 24 grand a month in server costs, which was a pretty cool thing. Then on top of that, now that this deploys through our scheduler, it deploys like any other app would, so upgrading Nginx is the same as deploying a new build of an API, so we can leverage all those same deploy processes, putting blockers on deploys, rolling things back if they go bad, and I could do an Nginx upgrade in under a half an hour instead of going into every server, take it out, upgrade it, put it back in, taking hours and hours and hours at a time, so it’s very fast and very easy to operate on this stuff.

“Then kind of the unseen benefit of this is it gave us a bit of a new view on things as to what we could and couldn’t run in Mesos. Obviously there’s challenges to each thing, but if something very, very static like a load balancer could run in Mesos, why not just everything? Why not cash servers? Why not, obviously people are looking at running SQL in Mesos. There’s lots of things out there that can really benefit from everything that Mesos has to offer as long as you do it in the right way.”

MesosCon North America

The Mesos slogan is Program against your datacenter like it’s a single pool of resources. The project has excellent documentation, including how to get started testing Mesos.

The call for speaking proposals is open for MesosCon Europe. The Apache Mesos community wants to hear from you! Share your knowledge, best practices and ideas with the rest of the community. Submit your proposal now. The deadline to submit proposals is June 17, 2016.

Apache, Apache Mesos, and Mesos are either registered trademarks or trademarks of the Apache Software Foundation (ASF) in the United States and/or other countries. MesosCon is run in partnership with the ASF.

Uber: Running Cassandra on Apache Mesos Across Multiple Datacenters

Dr. Abhishek Verma, Uber

Twitter: How We Built a Chargeback System for Twitter’s Compute Platform

Micheal Benedict and Jeyappragash Jeyakeerthi, Twitter

Paypal: Efficient Usage of Apache Mesos Using Container Hibernation

Kamalakannan Muralidharan, Paypal

HubSpot: Migrating 200+ Load Balancers into Apache Mesos

Stephen Salinas, HubSpot

MesosCon North America

RELATED ARTICLESMORE FROM AUTHOR

Celebrating the Second Year of Linux Man-Pages Maintenance Sponsorship

Kubernetes on Bare Metal for Maximum Performance

How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

Automating Compliance Management with UTMStack’s Open Source SIEM & XDR

Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces

RELATED ARTICLES MORE FROM AUTHOR