How to Lead a Disaster Recovery Exercise For Your On-Call Team

August 3, 2018

432

A disaster recovery exercise is a fire drill for your on-call team. The exercise is the most useful when it is as realistic as possible. A well-designed exercise will involve engineers searching through your production codebase trying to find the tools to operate on a production-like environment.

Our disaster recovery exercises follow four basic principles:

All on-call engineers are gathered in one room
Sterilized environment (like prod, but not prod)
Clear objective
Timeboxed recovery

At SigOpt, we run on AWS, so our first exercise was to spin up an API from scratch in our backup region. Our sterilized environment was us-east-1, with no access to AMIs, instances, or databases in our production region. Our objective was to hit dr-api.sigopt.com and service an API requests. Our timebox was 4 hours, which we chose from an engineering OKR.

Disaster Recovery Exercise as an Infrastructure Diagnostic

We ran our original disaster recovery exercise to diagnose holes in our ability to recovery our infrastructure. True to our goal, the exercise produced a few months of projects to work on.

RELATED ARTICLESMORE FROM AUTHOR

How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

Automating Compliance Management with UTMStack’s Open Source SIEM & XDR

Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces

Xen 4.19 is released

Advancing Xen on RISC-V: key updates

RELATED ARTICLES MORE FROM AUTHOR