How to Lead a Disaster Recovery Exercise For Your On-Call Team

391

A disaster recovery exercise is a fire drill for your on-call team. The exercise is the most useful when it is as realistic as possible. A well-designed exercise will involve engineers searching through your production codebase trying to find the tools to operate on a production-like environment.

Our disaster recovery exercises follow four basic principles:

  • All on-call engineers are gathered in one room
  • Sterilized environment (like prod, but not prod)
  • Clear objective
  • Timeboxed recovery

At SigOpt, we run on AWS, so our first exercise was to spin up an API from scratch in our backup region. Our sterilized environment was us-east-1, with no access to AMIs, instances, or databases in our production region. Our objective was to hit dr-api.sigopt.com and service an API requests. Our timebox was 4 hours, which we chose from an engineering OKR.

Disaster Recovery Exercise as an Infrastructure Diagnostic

We ran our original disaster recovery exercise to diagnose holes in our ability to recovery our infrastructure. True to our goal, the exercise produced a few months of projects to work on.

Read more at Medium