How DevOps Failed 60K Users

By

-

June 23, 2016

Back in 2012, when I was an operations engineer at Slideshare, I was part of a team that launched a DevOps model to speed processes and stay ahead of our competition.

We were a small startup, with fewer than 20 employees, building what became one of the most successful professional content sharing tools on the web. We didn’t know it at the time, but DevOps was one of the keys to our success in quickly reaching 29 million unique visitors per month and leading to the $119 million acquisition by LinkedIn in 2012.

Our goals in adopting DevOps practices were to create a more cohesive team and achieve maximum efficiency. The development team was split between San Francisco and New Delhi, and the infrastructure was quite complicated. A DevOps environment pushes every contributor to work on and contribute to different parts of the product, so it helped overcome geographic barriers by making people interact and help each other.

It also helped us spread technical knowledge to the most possible people, so that if someone was going on vacation or leaving the company, there was limited impact.

However, our DevOps success didn’t come without some failures, which have since become valuable lessons I share with my engineering students at Holberton School.

Lessons from a DevOps Failure

One of the main ideas behind DevOps is a greater sense of ownership over work responsibilities, and for that you need to give access to part of the infrastructure that developers do not generally have access to. At SlideShare, engineers had access to production servers and production databases.

A software engineer was working on a database-related project and trying out a tool that offered the ability to explore a MySQL database graphically. He decided to reorganize the order of the database columns in that tool so that the data would make more sense to him. What he did not know was that it was also changing the columns’ order in production on the actual database, locking it, which brought down Slideshare.net and shut out the more than 60,000 users trying to access it. When it happened, the person responsible did not realize that the tool was actually performing actions, and it took 15 minutes of collective effort to figure out the source of the problem.

There were two takeaways from this failure:

While DevOps is pushing for everyone to have an impact on any step of the product/service cycle, it’s good practice to take a step back every time you give access to something and make sure it is actually valuable. In this specific situation of the database outage, we realized that giving access to production data was actually not useful at all and was very dangerous. The developer could have extracted the same exact value by using a staging database, but with a much more minor impact on the company.
It’s important to better educate developers on the workings of infrastructure. Many of them have never been exposed to production infrastructure. DevOps is based on a way of working, which obviously is more about human interaction. You can’t expect everyone to naturally know the hidden rules. That’s why onboarding should be mandatory and critical.

Sylvain Kalache is a co-founder of Holberton School and a former senior Site Reliability Engineer at LinkedIn. He was part of the small Slideshare startup team, as a key player that contributed to the LinkedIn acquisition in 2012.

Is Apstra SDN? Same Idea, Different Angle

By

ComputerWorld

-

June 23, 2016

The network automation startup starts with an enterprise’s intentions and makes the network carry them out, One of the main goals of SDN (software-defined networking) is to make networks more agile to meet the changing demands of applications. A new Silicon Valley startup, Apstra, says it has an easier way to do the same thing.

The company’s product, called Apstra Operating System (AOS), takes policies based on the enterprise’s intent and automatically translates them into settings on network devices from multiple vendors. When the IT department wants to add a new component to the data center, AOS is designed to figure out what needed changes would flow from that addition and carry them out. The distributed OS is vendor-agnostic. It will work with devices from Cisco Systems, Hewlett Packard Enterprise, Juniper Networks, Cumulus Networks, the Open Compute Project and others.

New Tools and Techniques for Managing and Monitoring Mesos

By

Carla Schroder

-

June 23, 2016

It wasn’t that long ago that the idea of managing petabytes of data, and monitoring giant busy computing clusters running thousands of services was something for the future and not the now. As it turned out, that was a mighty short future, and it’s all happening now. These talks from MesosCon North America show how two different companies are solving configuration and data management issues with Apache Mesos and other tools.

Activision Publishing, a computer games publisher, uses a Mesos-based platform to manage vast quantities of data collected from players to automate much of the gameplay behavior. To address a critical configuration management problem, James Humphrey and John Dennison built a rather elegant solution that puts all configurations in a single place, and named it Pheidippides.

Drew Gassaway explains how Bloomberg Vault used Mesos to build a scalable system to aggregate and manage petabytes of data and provide custom analytics — without asking the customer to change anything.

All Marathons Need a Runner — Introducing Pheidippides

James Humphrey and John Dennison, Analytics Services, Activision Publishing

Activision Publishing makes computer games, lots and lots of them. Activision collects giant quantities of data from players, and according to James Humphrey uses it to “make millions of small decisions…stuff like anti-cheating, match-making, economy simulations, balancing of items in the games.” Analytics Services relies on Docker, Mesos, Marathon, and Jenkins to support the services that process all of this data.

Configuration management became a critical problem. The Docker, Mesos, Marathon, and Jenkins infrastructure makes it very easy to deploy large numbers of containers, services, and Jenkins jobs, all merrily proliferating. The good news is this system is so flexible you can do virtually whatever you want. The bad news is you have many ways to poke yourself in the eye. Humphrey and Dennison built a rather elegant solution to the problem of configuration management and named it Pheidippides, or Pheidip for short. (Pheidippides was supposedly the first marathon runner.)

Dennison explains their approach of making the framework the configuration manager and thinking of configurations as infrastructure rather than loading up containers with individual configurations: “The choice is either put it inside the container and make it a smart container. Or keep it out, and make it dumb. Every time, we have repeatedly learned that you’ve got to keep your containers dumb. Because, you want to switch out the logging system, and now you have a whole bunch of logging logic. Plugins running all over your cluster, and they’re written by different people in different teams…You can always put pieces of your infrastructure in your containers, but don’t.”

This puts all configurations in a single place, rather than scattered throughout hundreds of containers. “The ability to declare something once and only once across many services is very key for us.”

Watch the complete presentation below:

https://www.youtube.com/watch?v=XBEvamRP3KU?list=PLGeM09tlguZQVL7ZsfNMffX9h1rGNVqnC

A Seamless Monitoring System for Apache Mesos Clusters

Drew Gassaway, Bloomberg LP

Bloomberg Vault provides managed data services, currently about four petabytes’ worth for all of their customers. Drew Gassaway’s team was tasked with adding data and file analytics, and features such as trade reconstruction, and financial and compliance products. They had to build a new platform for these new products: “We were in a situation that’s similar to a lot of people. You have an existing infrastructure. You have a lot of static VMs. You are doing mostly static resource allocation. We wanted to get away from that. We had two main goals we wanted to achieve with this. We wanted to act as a base for some of these brand new analytics applications we wanted to build. We wanted to have a platform that we ended up putting the majority of our team at the Vault applications onto, so we would have a lot of room to grow. We want to start it pretty quickly, but keep it as scalable as possible.” In other words, a typical Mesos-based solution of fast, cheap, and scalable.

Gassaway’s team had to figure out how to aggregate and manage a large and diverse assortment of logging data from Mesos tasks, syslogs, HAProxy, and other logs captured from a large number of nodes. They wanted support dashboards and alerting. They didn’t want to modify existing applications or require their customers to change anything. They also had to be mindful of protecting potentially sensitive customer data. The platform must also support growth: “We wanted to collect at the node level and push data to the center of our topology…we didn’t want to begin in a situation where we are struggling to keep up pulling from an expanding cluster from the center.”

A primary goal was keeping as much as possible inside Mesos. “We have 3-pronged approach where we have a Mesos task on every node. One is collecting logs, one is collecting metric data and another one is for anything else we might want to actively scrape…we also added some throttling and quality of service behavior, so we can control if certain apps were generating an excessive or overwhelming amount of logs.”

The platform includes a large number of tools including Logstash, Elasticsearch, InfluxDB, and Kibana.

Watch the complete presentation below:

https://www.youtube.com/watch?v=8OahXeQhNPY?list=PLGeM09tlguZQVL7ZsfNMffX9h1rGNVqnC

Mesos Large-Scale Solutions

Please enjoy the previous blogs in this series, and watch this spot for more blogs on ingenious and creative ways to hack Mesos for large-scale tasks.

Apache, Apache Mesos, and Mesos are either registered trademarks or trademarks of the Apache Software Foundation (ASF) in the United States and/or other countries. MesosCon is run in partnership with the ASF.

Building Serverless Apps with Docker

By

Docker Blog

-

June 23, 2016

Every now and then, there are waves of technology that threaten to make the previous generation of technology obsolete. There has been a lot of talk about a technique called “serverless” for writing apps. The idea is to deploy your application as a series of functions, which are called on-demand when they need to be run. You don’t need to worry about managing servers, and these functions scale as much as you need, because they are called on-demand and run on a cluster.

But serverless doesn’t mean there is no Docker – in fact, Docker is serverless. You can use Docker to containerize these functions, then run them on-demand on a Swarm. Serverless is a technique for building distributed apps and Docker is the perfect platform for building them on.

What a Virtual Network Looks Like: Planning

By

No Jitter

-

June 23, 2016

Virtual networks make things easier for the user at the planning level… at least in theory.

Network services don’t spring up unbidden from the earth but rather they’re coerced out of infrastructure in response to business and consumer opportunities. Every operations and management paradigm ever proposed for networking includes an explicit planning dimension to get the service-to-infrastructure and service-to-user relationships right. On the surface, virtualization would seem to help planning by reducing inertia, but don’t you then have to plan for virtualization? How the planning difficulties and improvements balance out has a lot to do with how rapidly we can expect virtualization to evolve.

What virtual networks do is disconnect “service” from “network” in at least some sense. They can do this by laying a new protocol layer on top of existing layers (the Nicira/VMware or software-defined WAN model), or by disconnecting traffic forwarding and network connectivity from legacy adaptive protocols (OpenFlow SDN and white-box switches).

Xen 4.7 Open Source Linux Hypervisor Arrives with Non-Disruptive, Live Patching

By

Softpedia

-

June 23, 2016

Xen 4.7 arrives eight months after the release of the previous version, Xen 4.6, and it appears to be yet another major release, not that we expected less from the leading open-source virtualization system, which is currently being used in many of the world’s best and renowned cloud hosting services, including AWS (Amazon Web Services), Rackspace Public Cloud, and Verizon Cloud, powering more than 10 million users.

Release highlights of Xen 4.7 include a new XL command-line interface that has been designed to allow the use of PVUSB devices for PV guests, as well as to enable hot-plugging of USB devices, HVM guests, and QEMU disk backends…

OPNFV Project Scales Up Network Functions Virtualisation Ecosystem

By

ComputerWeekly

-

June 22, 2016

The OPNFV Project, the Linux Foundation’s open source network functions virtualistion (NFV) platform development organisation, has announced an expansion of its R&D capabilities and an internship programme to help further develop the worldwide NFV ecosystem.

The organisation, which is currently hosting its annual OPNFV Summit in Berlin which brings together developers, end-users and upstream communities, said it was seeing clear momentum for NFV around the world.

CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

By

Amber Ankerholz

-

June 22, 2016

https://www.youtube.com/watch?v=XVE3uCRtHVs?list=PLGeM09tlguZQVL7ZsfNMffX9h1rGNVqnC

In this presentation Carlos Sanchez will share his experience running Jenkins at scale, using Docker and Apache Mesos to create one of the biggest (if not the biggest) Jenkins clusters to date.

Finagle, linkerd, and Apache Mesos: Magical Operability Sprinkles for Microservices

By

Amber Ankerholz

-

June 22, 2016

https://www.youtube.com/watch?v=VGAFFkn5PiE?list=PLGeM09tlguZQVL7ZsfNMffX9h1rGNVqnC

Finagle and Mesos are two core technologies used by Twitter and many other companies to scale application infrastructure to high traffic workloads. This talk describes how these two technologies work together to form applications that are both highly scalable and resilient to failure.

Successful DevOps Deployment Involves Shift in Culture and Processes

By

Amber Ankerholz

-

June 22, 2016

To create sustained high performance, organizations must invest as much in their people and processes as they do in their technology, according to Puppet’s 2016 State of DevOps Report.

The 50+ page report, written by Alanna Brown, Dr. Nicole Forsgren, Jez Humble, Nigel Kersten, and Gene Kim, aimed to better understand how the technical practices and cultural norms associated with DevOps affect IT and organizational performance as well as ROI.

According to the report, which surveyed more than 4,600 technical professionals from around the world, the number of people working in DevOps teams has increased from 16 percent in 2014 to 22 percent in 2016.

Six key findings highlighted in the report showed that:

High-performing organizations decisively outperform low-performing organizations in terms of throughput.
They have better employee loyalty.
High-performing organizations spend 50 percent less time on unplanned work and rework.
They spend 50 percent less time remediating security issues.
An experimental approach to product development can improve IT performance.
Undertaking a technology transformation initiative can produce sizeable returns for any organization.

Specifically, in terms of throughput, high IT performers reported routinely doing multiple deployments per day and saw:

200 times more frequent code deployments
2,555 times faster lead times
24 times faster mean time to recover
60 times lower change failure rate

Shift Left

Lean and agile product management approaches, which are common in DevOps environments, emphasize product testing and building in quality from the beginning of the process. In this approach, also known as “shifting left,” developers deliver work in small batches throughout the product lifecycle.

“Think of the software delivery process as a manufacturing assembly line. The far left is the developer’s laptop where the code originates, and the far right is the production environment where this code eventually ends up. When you shift left, instead of testing quality at the end, there are multiple feedback loops along the way to ensure that high-quality software gets delivered to users more quickly,” the report states.

This idea also applies to security, as an integral part of continuous delivery. “Continuous delivery improves security outcomes,” according to the report. “We found that high performers were spending 50 percent less time remediating security issues than low performing organizations.”

For companies just getting started with DevOps, the move involves other changes as well.

“Adopting DevOps requires a lot of changes across the organization, so we recommend starting small, proving value, and using the trust you’ve gained to tackle bigger initiatives,” said Alanna Brown, Senior Product Marketing Manager, Puppet, and co-author of the report in an interview.

“We also think it’s important to get alignment across the organization by shifting the incentive structure so that everyone in the value chain has a single incentive: to produce the highest quality product or service for the customer,” Brown said. Employee engagement is key, as “companies with highly engaged workers grew revenues two and a half times as much as those with low engagement levels.”

In this year’s survey, according to Brown, most respondents reported beginning their DevOps journey with deployment automation, infrastructure automation, and version control — or all three.

“We see these practices as the foundation of a solid DevOps practice because automation gives engineers cycles back to work on more strategic initiatives, while the use of version control gives you assurance that you can roll back quickly should a failure occur,” she said. “Without these two practices in place, you can’t implement continuous delivery, provide self-service provisioning, or adopt many of new technologies and methodologies such as containers and microservices.”

Build a Foundation

Ultimately, however, to be successful, DevOps must overcome “political and cultural inertia,” Brown said. “It can’t be a top-down dictate, nor can it be a purely grassroots effort.”

The 2016 report offers some steps that can make a difference in your organization’s performance. Once you have your foundation in place, Brown said, “you’ll see all the opportunities that exist to automate manual processes… And, of course, there will be the bigger initiatives like moving workloads to a public cloud, building out a self-service private cloud, and spreading DevOps practices to other parts of the organization.”