On Sept 25th, 2014 AWS notified users about an EC2 Maintenance where “a timely security and operational update” needed to be performed that required rebooting a large number of instances. (around 10%) On Oct 1st, 2014 AWS sent an updated about the status of the reboot and XSA-108.
While we’d love to claim that we weren’t concerned at all given our resilience strategy, the reality was that we were on high alert given the potential of impact to our services. We discussed different options, weighed the risks and monitored our services closely. We observed that our systems handled the reboots extremely well with the resilience measures we had in place. These types of unforeseen events reinforce regular, controlled chaos and continued to invest in chaos engineering is necessary. In fact, Chaos Monkey was mentioned as a best practice in the latest EC2 Maintenance update.
Read more at the Netflix blog.