Root Cause: How Complex Web Systems Fail

October 27, 2016

198

Distributed web-based systems are inherently complex. They’re composed of many moving parts — web servers, databases, load balancers, CDNs, and many more — working together to form an intricate whole. This complexity inevitably leads to failure. Understanding how this failure happens (and how we can prevent it) is at the core of our job as operations engineers.

In his influential paper How Complex Systems Fail, Richard Cook shares 18 sharp observations on the nature of failure in complex medical systems. The nice thing about these observations is that most of them hold true for complex systems in general. Our intuitive notions of cause-and-effect, where each outage is attributable to a direct root cause, are a poor fit to the reality of modern systems.

In this post, I’ll translate Cook’s insights into the context of our beloved web systems and explore how they fail, why they fail, how you can prepare for outages, and how you can prevent similar failures from happening in the future.

RELATED ARTICLESMORE FROM AUTHOR

How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

Automating Compliance Management with UTMStack’s Open Source SIEM & XDR

Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces

Xen 4.19 is released

Advancing Xen on RISC-V: key updates

RELATED ARTICLES MORE FROM AUTHOR