Weathering the WAR Zone in AWS

Ten years ago I wrote on LinkedIn about surviving the issue of a single AWS availability zone failure during a sizable storm cell in Sydney: Weathering the Cloud Storm: AWS AP-Southeast-2 LSE, 5 June 2016.

I have not seen many AZ failures in the last decade, but there is one this week in the UAE Region.

Let’s not talk about cause of the power outage; there’s a lot of stories and not much verified facts. But the power went off, and it was an incident.

It is always worth checking the private health dashboards in these situations, and rechecking them regularly. Many people think they will instantly reach out to their AWS Account team, but in the heat of an incident (Large Scale Event), they are playing catch up as well from the same limited information. Using the health dashboard is a scalable way to get operational information out to the large number of clients.

It is an interesting approach, and one that many organisations should look at in developing their own crisis comms approaches. Rather than having people spend time on the phone during an issue, direct them to a rapidly updated dashboard, and have more people fixing the issue rather than talking about it!

As was the case 10 years ago, workloads that are deployed using the ell-Architected frameworks, and use multiple Availability Zones in an active-active fashion have a good chance of surviving an AZ outage. AutoScale deployments of EC2 Virtual machine Instances should rebalance in the remaining configured Availability Zones.

Of course, if this advise is heeded by many other AWS cloud users, then all clients will have AutoScale groups making similar demands for capacity within the same small window. The best chance you have is to ensure that you are deployed across three Availability Zones (if there are three in your operational Region).

The same goes with any RDS configured instances; your standbys must be able to run from any of the operational Availability Zones, and the same situation exists if you have configured only two AZs when three exist: an automatic and immediate race for resources when an AZ goes offline.

For those that had followed Well-Architected principles and my advice from 10 years ago, then this should have been an informational event, not a critical issue. If you did get impacted, then you now know what your technical debt is. Perhaps you’ve been using cloud as “just another data canter”. Perhaps you got caught up in a migration of doing “the bare minimum to get to cloud”. Now is the time to review, and learn.