Changing the definition of High Availability

I was recently speaking with someone about high availability. Their approach to high availability was two hand crafted instances (servers), in each in a separate Availability Zones on AWS, behind an ELB.

My approach to high availability is two (or more) instances in two (or more) Availability Zones behind an ELB, auto-created (and replaced upon failure) by AutoScale, with health-checks to ensure the instances are processing their workloads correctly, and replacing those that are not.

I extend this to include:

  • rolling updates so when the need arises, such as replacing the underlying OS every 6 – 12 months I can do so as seamlessly as possible
  • applying critical security updates daily without human intervention
  • ensuring that logs are sent off the instance to persistent storage and analytics where they can’t be overwritten or tampered with
  • ensuring data workloads are on object storage, or managed database platforms

Of course the ASG provisioning approach to EC2 deployment requires that everything during install is scripted and repeatable.And it gives me the freedom to size my ASG to zero instances outside of service hours, and ensure that the system deploys every single time.

I’ve done this resize ASGs to zero in non production times for many years now, across hundreds of instances per day. You can bet its pretty well tested, has saved hundreds of thousands of dollars, and ensured that the process works.

I bake my own AMIs, and re-baseline these semi-regularly using automation to do so – a process that takes 10 minutes of wall time, and about 60 seconds of my time. I treat AMIs as artefacts, as precious and as important as the code they run.

I have to stop myself sometimes when others rely upon pre-cloud approaches to high availability.

The snowflakes must stop. Cattle, not pets.