Ops to the Nines

A primer on fault tolerance, high availability, and disaster recovery

You’re going to need a new ride. (Dawn Armfield)

“The Great US-EAST–1 Outage of 2017” is how it shall be known forthwith. An Amazon Web Services operations engineer attempted to remove some instances from service that were causing an unexpected problem in the AWS billing services. A mistyped command-line script accidentally took a bunch of file servers offline and borked the Simple Storage Service (S3) in the US-EAST–1 region.

Cries rang out across the Twittersphere: “OMG Amazon is down!” The S3 outage had far-reaching consequences, with a significant chunk of all internet traffic—and a lot of major sites and web apps—impacted to some extent, but it didn’t have to be this way. You can insulate your apps from practically any outage given a bit of knowhow, some forethought, and a calculated approach to balancing your costs and risks.

Amazon Web Services uses both regions (geographically isolated services) and availability zones (separate physical locations within a region) to mitigate outages, but we’ve come to treat S3 specifically as infallible.

Although AWS comprises an amazingly reliable collection of services, using AWS does not allow us to ignore the essential operational concerns of high availability, fault tolerance, and disaster recovery scenarios. Although my examples will cover the AWS ecosystem specifically, the same concepts apply to any public cloud provider whether it’s AWS, Microsoft Azure, Google Cloud Platform, or another cloud provider running OpenStack.

Architecture of Cloud Computing

What is cloud computing anyway besides a buzzword? The concept of cloud computing arose as a reaction to the costs to provision, operate, and maintain on-premises servers. A brief history of cloud computing will help us see where we went wrong.

First, we went toward virtualized private servers (VPS) configurations with multiple virtual machines running on top of a single server or cluster of servers sharing a common physical location. Virtualization allowed us to save time and money provisioning servers from software images instead of physically assembling and racking them. We also saved money by more fully utilizing more of the servers’ compute capacity. The cost savings from virtualization were wonderful, but the real magic was the ability to automatically heal our infrastructure. In virtualized computing, a failure event triggers provisioning of a new virtual instance. This is the core of fault tolerance: problems with an individual instance are auto-healing.

Virtualization also gives us the ability to automatically scale capacity for a specific service up or down to maintain a desired quality of service. If, all of a sudden, a bunch of people need to use Software A, our infrastructure can de-prioritize Software B and reallocate additional virtualized computing capacity to Software A. Resilient applications use event monitoring to maintain quality of service standards without human intervention.

The concept that takes us from virtualized computing to cloud computing is the use of multiple physical locations with virtualized instances moving between them in a way that’s transparent to system users.

Amazon Web Services is the dominant provider of cloud computing with 31% global market share.

AWS provides the tool to create your dream cloud-computing infrastructure, but sadly it’s not out-of-the box magic. Leveraging AWS services like EC2 for computing and S3 for storage saves us money, but if we don’t design for fault tolerance and high availability, we’re not utilizing AWS to its fullest.

Designing for Fault Tolerance

The first step toward resilient systems is to design for fault tolerance. If you are using Elastic Cloud Compute (EC2) for virtualized servers, you need to use some method of auto-healing to protect against downtime in the event of an instance failure.

The primary way to architect auto-healing for EC2 is to use an autoscaling group and an Elastic Load Balancer (ELB). These services let us choose where to place instances. Most organizations choose US-EAST–1 region, to achieve low latency and low costs. This region has four availability zones (geographical-isolated data centers). An auto-scaling group can spread out your instances across these availability zones.

So you now have a few redundant server instances in multiple availability zones. The ELB will share load across these instances, and an autoscaling group defines the minimum and maximum number of instances you’d like to run across the availability zones. This allows you to choose a smaller instance size for each of the availability zones, and still have the combined capacity of your instance pool keep your application running nicely. If one availability zone becomes unavailable for some reason, the autoscaling group will rebalance your pool across the remaining availability zones within the region. The minimum pool size should keep your application running at some baseline level of performance. The maximum size prevents you from paying out the wazoo in the case of a major traffic spike or denial-of-service (DoS) attack. For these instances of ultra-high traffic, you’ll want to look at some kind of DoS mitigation and static-site serving through the Route 53 service.

The Joy of Managed Services

Another way to improve your resilience and fault tolerance is to opt for managed services instead of instance services. If you’re using the Simple Storage Service (S3), then you already are using a managed service! You don’t have access to the operating system, nor do you have to decide which availability zone to place your files in. AWS handles all of that for you. In addition to operational simplification, leveraging a managed service means that you have greater fault tolerance to any lower-level infrastructure failure. Instances may fail, but you’ll never notice. Your managed resources automatically migrate to another host or availability zone without your intervention.

While you can run your databases on EC2 instances, I suggest using a managed service like the Relational Database Service (RDS)—which is often spread across instances and availability zones, giving us more protection against AWS infrastructure problems. It’s a bit more expensive than installing PostgreSQL directly on an EC2 instance, but the managed service takes care of operating system patching and gives you a great interface for taking snapshots and creating read-replicas to scale your database. Unless you have so many databases that you warrant a full-time database administrator on your team, I’ll bet dollars to doughnuts you’ll save money (and you’ll definitely save operational hassle) by using RDS. If you can’t tolerate losing any writes, definitely look at Aurora. Aurora is a managed configuration for RDS that replicates data to six instances across three availability zones and automatically handles all of the failovers for you.

Managed services and autoscaling groups help achieve fault tolerance within a region, but what if AWS itself has a problem within a region/geography/all the instances in the AWS data center fail, and there are no available resources to move to in order to remain tolerant of faults? This is what happened during the recent S3 outage. While it is rare for an AWS managed service to have an outage, it’s not a matter of if but when. Always be prepared.

High Availability Prepares You to Withstand Regional Outages

AWS managed services take care of high availability within a region, but the vast majority of managed services operate within a single region due to latency requirements. Route 53 (for DNS) and Identity and Access Management (IAM) are two obvious exceptions, but any service related to compute, data, or storage won’t automatically be available in multiple regions.

S3 also provides cross-region replication for each bucket that is critical. Keep all of your CSS and JavaScript necessary to run your site in a bucket with cross-region replication enabled. I also suggest replicating another bucket for all images and other static assets used by articles and graphics published within the last two weeks. Pages with older content may be missing some static assets, but the vast majority of your site is going to operate normally even if S3 in US-EAST–1 fails completely. To reduce your S3 costs in your backup region, set up an S3 lifecycle policy to delete any items that are older than two weeks. Your mileage may vary on the age of the lifecycle policy, but this is at least a good starting point.

RDS also gives you one-click configuration to add multi-region replication. A read-replica instance will be set up in another region, and this read replica will stay mostly current with your primary database. If you’re able to lose a few hundred milliseconds of database writes during a region outage, you’re good to go. In the event of the regional outage, you can promote the read replica to be the master database. Don’t forget to add the appropriate amount of read replicas again, so that your overall capacity remains stable.

EC2 is a bit trickier to configure for multi-region redundancy. To utilize autoscaling groups, you have already created an Amazon Machine Image (AMI) that is used to create new EC2 instances, but AMIs are not available between regions. If your server image rarely changes, then you can just periodically copy AMIs from one region to another using the AWS command line tool. If your server does change, you can schedule instance snapshots and have the snapshots saved into an S3 bucket with cross-region replication enabled.

When deciding on your backup region, you need to pick two regions far enough apart to not be behind the same major internet routers. If you’re used to US-EAST–1, look at CA-CENTRAL–1 as your backup. It has latency times not far off from US-EAST–1, and it’s completely hydroelectric powered!

Disaster Recovery Patterns

AWS provides tooling and documentation to help implement common disaster-recovery patterns. I’ll take you through a high-level introduction to the four major patterns and provide some relevant use cases for each. At the core, these four patterns provide you with a comprehensive range of options to consider based on your needs for data continuity and quality of service. Not surprisingly, costs increase as the required speed of recovery increases.

Backup & Restore

You do keep backups, right? Of course you do! Your phone and laptop are probably configured to perform backups on a schedule or every time you plug your phone into your computer. The backup and restore pattern is the oldest and most familiar of disaster-recovery patterns: You snapshot your data at some frequency and then have it available to restore your data when needed. While backups are most frequently used to restore archival data, either due to accidental deletion or an audit, you can use the same systems to set up a basic disaster recovery plan.

AWS provides a managed backup service for most types of resources, so use these if possible and enable cross-region replication. You will be paying double for S3 storage costs (once per region) but you now can recover within an hour or so in a second region. If you are using EC2 compute instances and Elastic Block Store (EBS) data volumes, enable EBS backups. Since this is block-level storage, the first snapshot will be the full disk and subsequent backups will just include data that has changed, so costs stay low, and you don’t have to worry about lifecycle management. For S3, enable cross-region replication on any buckets required to keep your application running as well as any machine images (AMIs) or other artifacts that your organization needs to keep working.

Be sure to routinely test your backups. If you haven’t practiced restoring in a few months, assume you won’t be able to when disaster strikes.

Pilot Light

The pilot-light pattern takes its name from a home furnace where there is always a flame burning and ready to ignite your gas. For software disaster recovery, the pilot light is comprised of a machine image and a database. Keep up-to-date AMIs and other necessary source files in your secondary region via S3 cross-region replication. Leverage the AWS RDS service for your databases, so that you have a redundant copy of your data in the secondary region. When disaster strikes, create an autoscaling group from your AMI and then update your Route 53 DNS.

Pilot light has similar costs to backup and restore, with the added cost of however much data is in your database. Although you could save some money by only keeping a subset of your data in the secondary database, the administrative headaches are way more trouble than the cost savings for most organizations. Just enable read-replicas for your database in a secondary region and rest easier. Recovery takes a few minutes of manual configuration, and then your application might still feel sluggish for another few minutes as your caches are warming.

Warm Standby

The warm standby is my go-to pattern because we can and should configure this to be totally automated. For this pattern, we need to already have our autoscaling group set up with a minimal instance count to allow the application to run correctly. All caches and support systems also need to be turned on and ready to go. Your autoscaling configuration should be set to scale up to your predicted instance count when traffic starts coming. Within Route 53, add latency-based routing between your primary region and secondary region. Assuming things are acting normally in your primary region, you’ll be sending 100 percent of traffic there. If the region becomes unavailable (as determined by high latency), 100 percent of your traffic will automatically shift to the secondary region. The warm standby pattern will have zero downtime and get you back to full speed in a few minutes after your autoscaling group has finished expanding.

Additionally, the automated recovery is worth the additional costs to mitigate your worst-case scenario of your operations guru being sick, asleep, on vacation, or otherwise unavailable.


Can’t bear the thought of any downtime or degraded service level? The multisite pattern consists of an identical topology in two regions.

Everything is running all the time, whether it is serving traffic or not. Since all resources in the secondary region are scaled to production size already, you should see zero degradation of service when Route 53 shifts traffic to the secondary region.

You’ll also be paying double for your infrastructure 24/7, so you can provide faster service to all of your users all of the time with latency-based routing. Route 53 can send traffic to whichever region has the lowest latency for each user. You might end up with 50 percent of traffic to each region or maybe 90 percent to one and 10 percent to the other. Regardless, you can stop thinking in terms of a primary region and a secondary region and instead think of regions similar to availability zones: put enough resources in each region to withstand an outage in any single region.

What About Threat Modeling?

Although I’m primarily discussing disaster recovery as protection against provider outages, you should certainly consider threat scenarios including direct attacks against your application and data. The tactics I’ve described above to enable disaster recovery also shift a ton of the attack surface to the edge of the AWS ecosystem. The combination of Route 53 and CloudFront provide a lot of distributed denial of service (DDoS) protection, so you should be leveraging these two services even if you’re just running small, internal, or secondary systems that would be annoying to your day if interrupted. The costs for this protection are factored into your hourly rates, so you might as well use it.

If you are storing personally identifiable information or other sensitive information related to confidential sources, I strongly recommend you add a virtual private cloud (VPC) and intrusion detection system (IDS) into your architecture. Combined, these two services will allow you to tightly regulate all traffic coming into your application topology and automatically respond against adversaries. The common architecture pattern is to use the VPC to wrap all of your services, have a public subnet solely to hold web servers, then have a private subnet for databases, caching, and any other kinds of workers that you need. Not every AWS service can be placed within a VPC (e.g. Lambda and SQS), so keep this in mind if you have regulatory or security mandates that require VPC placement for various types of data.

What About Continuous Integration Servers?

If you are using Jenkins or some other self-hosted integration server on EC2, then you can follow the same disaster-recovery pattern for build servers as you use for application servers. Many news apps teams use a managed continuous-integration service like TravisCI or CircleCI to shift operational responsibilities and complexities to a SaaS provider, but make sure you’re tightly coupled with the disaster-recovery strategy of that provider. This isn’t necessarily bad, but it’s something to consider. How will you build and deploy your code if TravisCI becomes unavailable? You must factor this into your disaster recovery planning.

One way to more loosely couple with TravisCI is to still have all of your build scripts kept in your source-code repository in some generic form such as a Makefile. Assuming all of your scripts are defined in a Makefile, then TravisCI becomes just a virtual machine to run your builds, and the TravisCI web portal is a place to manage environmental variables. This pattern makes it really easy to manually build during a disaster-recovery event. You can follow your existing procedures for storing and sharing production keys, load these secrets onto a local machine, and then execute build scripts locally until ready for ad-hoc deployments. In a larger organization, you’ll need to coordinate this secret management with the appropriate IT team to follow your formal or informal statement of operational controls (SOC) policy.


AWS gives you great infrastructure and tools, but the onus is on you to understand how to use these tools and deal with their associated risks and costs. I’ve tried to lay out the major topics and patterns in this article, but it’s impossible to exhaustively cover everything. Use this as a starting point for your internal teams to think strategically and tactically about fault tolerance and disaster recovery:

  • Do our backups exist in at least two regions?

  • What disaster-recovery pattern do we follow?

  • When did we last test our disaster-recovery plan?

AWS publishes a ton of excellent whitepapers, so be sure to read some for more tips on building fault tolerant applications and to plan for disaster recovery situations. If operations is in your job description, definitely get the AWS SysOps Admin Associate certification. The materials from ACloudGuru, LinuxAcademy, and CloudAcademy are all great and absolutely worth your time and money.



  • Dave Stanton

    Dave Stanton is a software architect and technical coach. Currently he works with teams to integrate quality, scalability, security, and accessibility into enterprise mobile and cloud projects. He earned a Ph.D. from the University of Florida by researching the behavioral and cognitive effects of interface design.


Current page