My life is full of mistakes. They’re like pebbles that make a good road. — Beatrice Wood
You know all the catchphrases and inspirational quotations about failure: fail fast, succeed quicker; fail forward; embrace failure; fail fast, fail often, fail everywhere. As creators of the bleeding edge of technology, we know that if we’re not failing, we’re not trying hard enough, and we’re not learning. But merely failing a lot doesn’t lead to progress. Anyone can fail all the time; the trick is converting failure to success. Ilan Rabinovitch of Datadog tells us, in his LinuxCon North America presentation, how to intelligently learn from our failures, and how to progress from failure to success.
The key to converting failure to success is to collect and analyze useful metrics, and to conduct formal post-mortems (or call them reviews or retrospectives if you don’t care for “post-mortem”). This needs to be part of your core process, because “The monitoring systems that we engage with these days are distributed and complex, more so than ever… All the pieces interact in ways that are much more complex than they might have been 10 years ago when you had a very clear three-tier architecture or static website that you interacted with. There are lots more pieces that can break or interact in unintentional ways” says Rabinovitch.
There are enough new mistakes to make; we don’t need to repeat the old ones. — Ilan Rabinovitch
Your reviews are definitely not about blame and punishments, but rather “We need to go back and see why was I able to do that, why did I make that mistake, why did I think that was the right actions to take. Put away the pitchforks, it should never be about the blame.” Rabinovitch reminds us that “Culture is this idea that we’re working together, we’re seeing the problem as the enemy, not each other… Sharing this idea that we’re going to take our learnings back and help each other be more successful in the future”.
So how to approach this? We’re already drowning in data, and yet Rabinovitch advises us to “Collect as much [data] as you can. If you don’t, it’s going to be expensive to generate again later, going back and trying to recreate the events of a security incident or a technical outage or what you’ve said or didn’t say on a control call.” Then, the next step is to categorize your metrics into three buckets: work metrics, resources, and events. Then what do you do?
Watch the complete presentation (below) to learn excellent insights on what to look for, what kind of tools and processes can help you make sense of what happened, and how to move forward.