Distributed web-based systems are inherently complex. They’re composed of many moving parts — web servers, databases, load balancers, CDNs, and many more — working together to form an intricate whole. This complexity inevitably leads to failure. Understanding how this failure happens (and how we can prevent it) is at the core of our job as operations engineers.
In his influential paper How Complex Systems Fail, Richard Cook shares 18 sharp observations on the nature of failure in complex medical systems. The nice thing about these observations is that most of them hold true for complex systems in general. Our intuitive notions of cause-and-effect, where each outage is attributable to a direct root cause, are a poor fit to the reality of modern systems.
In this post, I’ll translate Cook’s insights into the context of our beloved web systems and explore how they fail, why they fail, how you can prepare for outages, and how you can prevent similar failures from happening in the future.
Read more at Scalyr Blog