Site Reliability Engineering for Cloud-Native Operations

June 27, 2017

123

Developers want to change things as soon as they can, while operations teams remain apprehensive that changes will break stuff. To reconcile these two drives, Google forged the path of site reliability engineering (SRE), an emerging practice for maintaining complex computing systems that need to run with high reliability. As the founder of Google’s SRE Team, Ben Treynor put it: SRE is “what happens when a software engineer is tasked with what used to be called operations.”

SRE dates back to 2003 when Treynor joined Google to manage a team of engineers to run a production environment. The practice proved to be a success, and the company now 1,500 engineers working in SRE. Apple, Oracle, Microsoft, Twitter, Dropbox, IBM, and Amazon have all implemented their own SRE teams as well.

RELATED ARTICLESMORE FROM AUTHOR

Celebrating the Second Year of Linux Man-Pages Maintenance Sponsorship

How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

Automating Compliance Management with UTMStack’s Open Source SIEM & XDR

Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces

Xen 4.19 is released

RELATED ARTICLES MORE FROM AUTHOR