1. AWS Disaster Recovery Introduction
Disaster Recovery (DR) is about preparing for and recovering from a disaster. Any event that has a negative impact on your business continuity or finances could be termed a disaster. This could be hardware or software failure, a network outage, a power outage, physical damage to a building like fire or flooding, human error, or some other significant disaster.
Depending on your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) – two commonly used industry terms when building your DR strategy – you have the flexibility to choose the right approach that fits your budget. The approaches could be as minimum as a backup and restore from the cloud or full-scale multi-site solution deployed in onsite and AWS with data replication and mirroring.
Recovery Point Objective (RPO) — The acceptable amount of data loss measured in time. For example, if a disaster occurs at 12:00 PM (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 AM. Data loss will span only one hour, between 11:00 AM and 12:00 PM (noon).
Recovery Time Objective (RTO) — The time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). For example, if a disaster occurs at 12:00 PM (noon) and the RTO is eight hours, the DR process should restore the business process to the acceptable service level by 8:00 PM.
With Amazon Web Services (AWS), your company can scale up its infrastructure on an as-needed, pay-as-you-go basis. You get access to the same highly secure, reliable, and fast infrastructure that Amazon uses to run its own global network of websites. AWS also gives you the flexibility to quickly change and optimize resources during a DR event, which can result in significant cost savings.
2. Disaster Recovery Scenarios with AWS
There are 4 main DR architectures:
AWS enables you to cost-effectively operate each of these DR strategies. If your application is already running on AWS, then multiple regions can be employed and the same DR strategies will still apply.
Backup and Restore
If you use this method, it can take a long time to restore your system in the event of a disruption or disaster. AWS S3 is an ideal destination for backup data that might be needed quickly to perform a restore.
For systems running on AWS, you also can back up into Amazon S3. Snapshots of Amazon EBS volumes, Amazon RDS databases, and Amazon Redshift data warehouses can be stored on Amazon S3. Alternatively, you can copy files directly into AWS S3, or you can choose to create backup files and copy those to S3. There are many backup solutions that store data directly on Amazon S3, and these can be used from Amazon EC2 systems as well.
The following diagram shows how you can quickly restore a system from Amazon S3 backups to Amazon EC2
- Simple to get started
- Take a long time to restore
- Many steps to do
- Take a backup of current systems (EC2, RDS, Web server, API…) and schedule these backups.
- Use Amazon S3 to store these backups
- Retrieve backup from S3
- Bring up the required infrastructure
- EC2 instances with API, CMS.
- Load Balancing…
- Restore System from Backup
- Switch over to the new system.
- RTO take a long time depend entirely on your system architecture complicated or not.
- RPO: from the last backup.
Pilot light is used to describe a DR scenario in which a minimal version of an environment is always running in the cloud.
The idea of pilot light is an analogy that comes from the gas heater. In a gas heater, a small flame that’s always on can quickly ignite the entire furnace to heat up a house.
This scenario is similar to a backup and restores scenario. With AWS, you can maintain a pilot light by configuring and running the most critical core elements of your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core.
Infrastructure elements for the pilot light typically include the database servers which would replicate to EC2 or RDS. Depending on your system, there might be other critical data outside of the database that needs to be replicated to AWS.
This is the critical core of the system (pilot light) around which all other infrastructure pieces in AWS can quickly be provisioned to restore the complete system.
To provision the remainder of the infrastructure to restore business-critical services, you would typically have some pre-configured servers bundled as Amazon Machine Images (AMIs), which are ready to be started up at a moment’s choice.
When starting recovery, instances from these AMIs come up quickly with their pre-defined role (API, CMS, etc..) within the deployment around the pilot light.
From the networking point of view, you have two main options for provisioning:
- Use Elastic IP address, which can be pre-allocated and identified in the preparation phase for DR and associate them with your instances.
- Use Elastic Load Balancing (ELB) to distribute traffic to multiple instances. Then you would update your DNS records to point at your EC2 instance or point to your load balancers using a CNAME.
- We recommend this option for web-based applications.
The pilot light method gives you a quicker recovery time than the backup-and-restore method because the core pieces of the system are already running and are continually kept up to date.
AWS enables you to automate the provisioning and configuration of the infrastructure resources, which can be a significant benefit to save time and help protect against human errors.
However, you will still need to perform some installation and configuration tasks to recover the applications fully.
The following figure shows the preparation phase, in which you need to have your regularly changing data replicated to the pilot light, the small core around which the full environment will be started in the recovery phase.
You’re less frequently updated data, such as operating systems and applications, can be periodically updated and stored as AMIs.
- Reduces RTO and RPO
- Resources just need to be turned on
- Cost will be expanded
- Enable replication of data across regions. (EC2, RDS with Muti-AZ)
- Automation of services in a backup region. (All supporting custom software packages available in AWS)
- Switch over DNS to other regions when downtime occurs.
- Recovery phase:
- Start application in EC2 instance from custom AMIs.
- Resize existing database/ data store instances to process the increase in traffic.
- Turn on RDS Muti-AZ to improve resilience.
- Change DNS (Route53) to point at the new EC2 servers (or ELB).
Warm Standby Solution in AWS
Warm Standby is used to describe a DR scenario in which a scaled-down version of a fully functional environment is always running in the cloud.
A warm standby solution extends the pilot light elements and preparation. It further decreases recovery time because some services are always running. By identifying your business-critical systems, you can fully duplicate these systems on AWS and have them always on.
These servers can be running on a minimum-sized fleet of EC2 instances on the smallest sizes possible.
This solution is not scaled to take a full-production load, but it’s fully functional. It can be used for non-production work, such as testing and internal use.
In a disaster, the system is scaled up quickly to handle the production load. In AWS, this can be done by adding more instances to the load balancers and by resizing the small capacity servers to run on larger EC2 instances types. As stated in the preceding section, horizontal scaling is preferred over vertical scaling.
Key steps for preparation:
- Set up Amazon EC2 instances to replicate or mirror data
- Create and maintain AMIs
- Run the application using a minimal footprint of EC2 instances or AWS infrastructure
- Patch and update software and configuration files in line with your live environment
Key steps for preparation:
- Increase the size of the Amazon EC2 fleets in service with the load balancer (horizontal scaling).
- Start application on larger EC2 as needed (vertical scaling)
- Manually change the DNS records or use Route53 automated health checks ==> all traffic is routed to the AWS environment
- Consider using Auto Scaling to right-size the fleet or accommodate the increased load.
- Add resilience or scale up your database.
- Reduces RTO and RPO
- Resources need to be turned on or scale-down
- Can handle traffic from production
- Cost is expensive
- All necessary components are always running 24/7.
Multi-Site Solution Deployed on AWS
A multi-site solution runs in AWS as well as on your existing on-site infrastructure, in an active-active configuration.
The cost of this scenario is determined by how much production traffic is handled by AWS during normal operation. In the recovery phase, you pay only for what you use for the duration that the DR environment is required at full scale. You can further reduce cost by purchasing Amazon EC2 Reserved Instances for your “always on” AWS servers.
- Key steps for preparation:
- Set up AWS environment to duplicate your production environment
- Set up DNS weighting (Route53 Weighting), or similar traffic routing technology to distribute incoming requests to both sites.
- Configure automated failover to re-route traffic away from the affected site.
Key steps for recovery:
- Manually or by using DNS failover, change the DNS weighting so that all requests are sent to the AWS site.
- Have application logic for failover to use the local AWS database servers for all queries.
- Consider using Auto-scaling to automatically right-size the AWS fleet.
- Reduces RTO and RPO ( less than 10)
- All necessary components are running 24/7
- Cost is very expensive
3. AWS services DR checklists
To establish a strong DR plan, some basic checklists need to discover the best strategies to implement DR.
RDS DR Checklist ( AWS recommended)
- Set up production instances in a multi-AZ architecture
- Enable RDS daily backups
- Create cross region Read-replication in the recovery region of your choice
- Maintain DB settings in both regions
- Design system for failure
- Placing servers in multiple AZs (regions)
- Have a de-couple architecture (3-tiers) to reduce single points of failure
- User load balancers, autoscaling and health monitoring for HA
- Maintain up to date AMI’s
- Backing up the critical data to S3
- Testing plan regularly.
DevOps Lead at Rainmaker-Labs
AWS Certified Solutions Architect – Associate
AWS Certified SysOps Administrator – Associate