As usual, this article is more of my own notes to my future self but there is so much useful information it’d be rude for me to keep it to myself. If you find any errors in the article, please highlight them and mash that red “R” button on the right side of the screen.
Who is this guide for?
This guide is for any DevOps, Site Reliability engineers, or developers looking to:
- Not spend off-hours triaging/fixing problems with their technology stack
- Improve their skill-set in AWS or even prep for the AWS Certified Architect Associate Exam
- Design, deploy and maintain reliable applications on AWS
What topics are covered?
- What is Amazon’s Guiding Principles for SRE?
- Determining SRE Objectives
- Error Budgets
- Designing Acceptable Failure
- AWS Global Architecture for networking
- Network Availability and Resilience
- SRE Friendly Databases ( RDS & DynamoDB)
- EC2 & Lambda High Availability
- Kubernetes VS ECS on Amazon Web Services
- Accepting failure in a Multi-Tier Application
- Reliability Patterns
- Failures at Scale
- Extra helpful information
- Glossary of Terms
What are the objectives?
At the end of this article you will be able to answer a ton of questions such as:
- What is an SRE?
- What does an SRE do?
- How do I deploy reliable applications and what does that really mean?
- How to test the health of an application
- How to deploy applications on AWS
- Reliable Architecture Patterns
This isn’t intended to be a replacement for existing documentation on SRE as a practice but more for my own notes combining multiple sources and hopefully provide you with more information about this as a career path. It should walk you through what an SRE is, does, and how it all works with AWS infrastructure.
This article may also help with taking the AWS Certified Architect Associate Examination.
What is an SRE?
There is no standard job description online for a Site Reliability Engineer (SRE), there is no standard team structure, and there is no standard tooling (though there are tons to chose from). So what is it?
In 2003, Benjamin Treynor from Google invented “Site Reliability Engineer” as a discipline. The role applies software engineering practices to infrastructure and operational problems whose main task is to create and maintain highly reliable and scalable systems. Google has an approach located in this guide but it’s based on their infrastructure and that may not necessarily apply to your application(s).
In 2021, the average salary in the US for a Site Reliability Engineers make $123,733 annually with outliers making upwards of $180K annually (source).
SREs are responsible for Production Readiness Reviews of application design, builds/tests, releases, and operations through engagement, analysis, continuous improvement, refactoring, training, and onboarding new applications.
They will verify a service meets standards for production and operational readiness for development through release activities. They will also verify service owners are prepared to work with SRE teams to improve the reliability in production and/or reduce the impact/severity of incidents that might occur.
In a nutshell, SREs are a “go-between” for developers, DevOps teams, and release management to ensure when code is deployed, it’s done in the most reliable manner possible. Really, SRE is an opinionated way of implementing DevOps. More than anything, SRE is a mindset of “ownership” and a “questioning attitude” than a specific set of tools/tasks.
What are the guiding principles of the DevOps and SRE roles?
The main goals of DevOps teams are to:
- Reduce organizational silos – SREs share ownership with developers to create shared responsibility and reduce organizational silos.
- Implement gradual changes – SREs implement Release Management
- Accept failures as normal business – SLOs and SLIs are the SRE’s implementation of Error Budgets
- Automate leveraging various tools – SREs do this by eliminating toil
By Contrast, SRE team’s main goals are to include those DevOps practices above and also focus on:
- Release Engineering – Fundamentally, SREs see system operations as a software problem, not a hardware problem.
- SLO/SLI – Accepting failures as normal is a big part of this job. Service Level Indicators and Service Level Objectives are a big part of that.
- Error Budgets – plan and accept failures. Measure Everything – SRE defines ways to measure Key indicators (typically SLOs/SLIs).
- Post-mortems – SREs mandate blameless post-mortems
- Eliminate Toil – Automate anything that requires manual intervention. Toil is referred to as “menial tasks that could be automated”.
- Incident Response – Measure Everything – SRE defines ways to measure Key indicators (typically SLOs/SLIs) in a prescriptive manner.
Core Principles of Site Reliability Engineering
What is Availability?
When we talk about Reliability, people normally refer to Availability. For example, some web applications boast “99.999% uptime” also known as “five-nines”. This is just a probability that the application is operational at a given time based on previous experience. With 99.999% uptime, that means that the probability of unexpected downtime would be 5.26 minutes/year.
As you add more “nines” to the uptime percentage, it gets more complicated, more expensive, and hard to maintain.
In relation to Availability, you will often hear about RTO or Recovery Time Objective which is a target for returning services online after a failure or disruption.
What is Reliability?
Reliability is one of the key pillars of a “Well-Architected Framework” and the primary focus of the SRE role.
Reliability is how much you can trust a system or more commonly it’s “Reliability” is the likeliness of a system to produce correct outputs up to some percentage of time and is improved by tools that avoid, detect, and repair faults. It can be defined mathematically as:
# of Successful Requests Reliability % = ---------------------------- x 100 # of Total Requests
What does Reliability mean?
Reliability is the probability that a system will produce expected outputs up to a given amount of time using features that avoid, detect, and repair faults.
Reliability is a much broader term that encompasses:
What are Amazon’s Guiding Principles for SRE?
Amazon crafted the “AWS Well-Architected Framework” which covers a lot of ground such as “Operational Excellence”, “Security”, “Reliability“, “Performance Efficiency”, & “Cost Optimization” but our focus is solely on Reliability.
The Reliability Pillar includes the ability of a system to recover from infrastructure or service disruptions, dynamically acquire more computing resources to meet the demand, and mitigate any disruptions caused by network problems or misconfigured systems. All of the same topology is covered as the google implementation but with different language and Availability is the key tenant of the AWS model outlined below:
- Stop Guessing – Monitor the demand, capacity, utilization and size of our application using tools like CloudWatch.
- Scale-out – Horizontally scale your application to improve reliability. 3 is 2, 2 is 1, 1 is none. Dynamically acquire computing resources to meet the demand you are monitoring.
- Automate Change – Use automation to deploy, develop, and modify your application. Manual steps lead to poor results, reduce toil where ever possible. Implement change management in a way that de-conflicts potential changes.
- Automate Recovery & fallback – mitigate disruptions
- Test – Create test failure and recovery procedures.
The main premise is to try to answer these 9 questions about your infrastructure in the most reliable way.
To achieve reliability, you must start with the foundations
- An environment where service quotas and network topology accommodate the workload.
- The workload architecture of the distributed system must be designed to prevent and mitigate failures.
- The workload must handle changes on-demand or requirements, and it must be designed to detect a failure and automatically heal itself.
- Before architecting any system, foundational requirements that influence Reliability should be in place.
- For example, you must have sufficient network bandwidth to your data center.
- These requirements are sometimes neglected (because they are beyond a single project’s scope).
- This neglect can have a significant impact on the ability to deliver a reliable system. In an on-premises environment, these requirements can cause long lead times due to dependencies and therefore must be incorporated during initial planning. — (from AWS – Derek Belt)
- How are you managing and monitoring account limits?
- How do you plan out your network topology?
- Can your system adapt to change on demand?
- How do you log, manage and monitor your resources?
- How do you implement and manage changes?
- How do you back up your data? (storage solutions)
- How do you test resilience?
- Is your system able to withstand component failures?
- Do you have a DR (Disaster Recovery) plan?
Determining SRE Objectives
Imagine for a second that you are experiencing a cascading failure. A cascading failure is a catastrophic failure by nature that hasn’t necessarily happened yet. It grows over time as a result of a single or slightly less significant failure. Generally, the earliest failure puts pressure on another service or highlights a misconfiguration then causes increased latency (or even a network failure for one sub-service) and then causes users to not be able to log in for example leading to larger and larger failures. So is it possible to rule out all failures within the stack?
Is 100% reliability really required? Would your users even notice or be largely impacted by 99.99% uptime? Well, 100% is not necessary for most applications but let’s consider for a second that your goal is to be 100% reliable.
Without serious investment across an entire technology stack to achieve 100% reliability, you’d never reach 100%. The more “nines” you add to 99.9% availability, the more expensive it will be. If any single layer of your tech stack doesn’t support 100% reliability, then no layer in your stack can.
How can you predict snow storms that take out data centers in Texas?
If you want to have a reliable service, you have to first define what a reliable service is. Enter SLIs and SLOs.
Service Level _____
What are SLAs?
Most platforms that do provide an SLA (service level agreement) with 100% uptime guarantees aren’t actually seeing 100% uptime. These are just agreements, or legal terms that exist and only grant credits or refunds if the SLA isn’t met. This doesn’t change the fact that your users had a bad experience. As a site reliability engineer, we aren’t focused on SLAs. These are typically done by legal teams and pretty hard to affect change, but those SLA’s are impacted by SLOs/SLIs which we can influence.
AWS, for example, provides SLAs for their services (not on their global infrastructure because an outage in Africa doesn’t necessarily affect your service). Another important distinction here is AWS’s SLAs do not mean that they are providing an SLA for your applications. This is an important distinction. You are responsible for the things you deploy on AWS, and if you misconfigure something, that isn’t AWS’s responsibility.
|Service Commitment||Definitions||Service Credits||Exclusions|
|Defines the availability objective such as 99.99% uptime.||Specifies the usage terms. Most importantly defines how to measure the availability of a service||States how AWS Compensates customers affected by missed availability objectives||Defines which circumstances are not covered by the service commitment|
So what’s an SLO?
An SLO (Service Level Objective) is a goal. It sets the aim for an SLI that a Product Owner wants to reach.
Ok, so what’s an SLI?
An SLI (Service Level Indicator) – this is a measurement for the SLO.
What is an Error Budget?
Error budgets are the way SRE engages with a development team. This is done by influencing the release management aspect by balancing the reliability with feature releases. The error budget forms a control mechanism for diverting attention to stability as needed.
Error budgets are 1-SLO or 100% minus your service level objective. If your SLO is 99.9% then that means your Error budget is 100% – 99.9% = 0.1%. There are 525,600 minutes in a year, so your error budget is a whopping 5,256 minutes per year or 7.3 hours per month of downtime. That’s a lot to work with.
Another example to drive this point home – If your SLO is 99.999% (five nines), then your Error budget is 0.0001% or 52.56 minutes per year or 4.38 minutes per month of downtime.
Imagine again for a moment that you didn’t have automated measures to monitor, alert, and roll back changes. Imagine that you had to wait for a customer to call into your busy phone centers, then wait for the customer service rep to ask his team, the team to notice a trend, then the call center escalates to a manager who then notifies the DevOps team. Your 4.38 minutes were long gone. Sounds crazy? Unfortunately, this is how some organizations without SREs or a robust DevOps team actually handle monitoring faults.
How do error budgets apply to SREs?
Our primary directive as an SRE is to avoid toil at all costs and automate as much as possible with the aim of keeping the application/services online within our SLOs – this includes monitoring for those errors and alerting on them.
If the Application is meeting our SLO then the feedback to product owners they can keep working towards new features and the SREs support the application.
If the Application is not meeting our Error budget, then the product owner’s feedback from the error budget is to
SLOw down (get it, “slo” down) feature releases and focus on fixing the issues that are causing SLI/SLO failures. At this point, the SRE would then start stepping up to “owning” the application until the issues are resolved.
Typically we can error budget more for front-end requests than we can for backend and even less budgeting is available for analytics/monitoring systems.
Designing Acceptable Failure
Why does failure happen in traditional (non-cloud/non-SRE) setups?
Failure is inevitable, accepting that and then designing around it is key.
Bare Metal Infrastructure Reliability
Traditionally, infrastructure reliability is focused on reducing failures and the average time between failures (“Mean-Time-Between-Failures”).
It did leverage redundancy, but the core architectural consideration was to build something able to withstand the “peak” so most of the hardware is sitting idle most of the time. This is an expensive way to build infrastructure. This does work well for servers at the metal level of the cloud and this is how it is implemented but it doesn’t work at all on cloud platforms hosting your application.
Applications Before SRE
Application Reliability relied on resilient infrastructure (network, clustering, storage) and relies heavily on a DR (disaster recovery) strategy. It also heavily relied on infrastructure security and everything was split into tiers.
What are some of the problems with failures and traditional applications and their implementation of DevOps?
There isn’t a ton of security or reliability baked into this traditional application structure because the application assumes that is the responsibility of the infrastructure. The application focuses on performance. The infrastructure and reliability are simply not the application team’s problem – AKA “Not my problem”. Enter the blame game and silos. This is due to organizational structuring and in some cases, a culture of “not my problem” or the creation of knowledge silos.
Typically, IT organizations will build a solution and there is a lot of “blind faith” in the solutions working. Then when it’s done, there is no focus on continuous improvement because the team who created it is forward-facing. Not necessarily a personality fault but the problem lies in how they are focused solely on “New” things/features, not issues they have already “solved”.
Additionally, the product owners simply don’t have the budget or time to optimize solutions and are structured so that they have a bonus (in some cases) aligned with reaching certain feature milestones.
SRE’s aren’t necessarily aligned with either of these mindsets.
In this scenario, when things fail, bad things happen. People including end-users stop using or don’t trust the solution, symptoms are fixed but not the root cause of the problem and lastly, blame and shade are thrown which isn’t healthy in any organization.
From the application level what is Reliability?
The core tenants of application reliability boil down to these core questions:
- What happens when a component or components fail? (System Architecture and inter-service dependencies.)
- What do we measure or monitor to make sure our services are reliable? (Performance metrics such as availability, latency, and efficiency indicators)
- How will we know if something is less than healthy? (Instrumentation, metrics, and monitoring)
- What do we do when we have a problem or incident? (Emergency Response)
- How do we make sure the problem is fixed or that we release error-free code? (Change management / Release Management)
How do SREs fit into the fold?
Above are just some of the challenges that an SRE will have to overcome and as an SRE, there is limited ability to effect changes to the organization. So with that being said and if things are going to fail no matter what, what do we focus on? The answer is a question: How do we recover from those situations quickly?
We could focus on a few things:
- Refining monitoring and alerting
- Refine SLIs
- Integrate monitoring with recovery activities
- Join triages for post mortem opportunities
- Automate where possible
- Plan out failure recovery not avoidance
- Make the application reliable, not the infrastructure
- Run simulations and game days
- Intentionally fail components
- Learning – One of my favorites 😃
- Chaos testing – Netflix implemented “Chaos Monkeys” which is worth a read.
- Blameless post-mortems
- Write up and publish failure information to the public
These are typically the activities an SRE will be involved in to some varying degree.
AWS Global Architecture for networking
What is the difference between global and Zonal resources on AWS and everything in between?
Let’s talk about the Global, Regional and Zonal resources on AWS.
Your account is the main building block of your AWS account. This is a boundary for your costs, users, security policies, and sub-accounts. Inside the account, there are regions and edge locations.
What is Regional Availability?
As of writing this, there are 210+ Edge locations and 12 Regional Edge caches, or 220+ Points of Presence (pops). These tend to be a subset of services related to networking – specifically CloudFront and Route53.
Within a Region (a geo-location-based grouping of resources), you have Availability zones.
What is an Availability Zone (Zonal)?
Inside of each Region is an Availability Zone (AZ), of which there are 77 in 24 Regions so roughly 3 datacenters per region.
Not leveraging the edge locations and only serving your application from a Zonal resource could add latency to your end user which could cause failures downstream.
Network Availability and Resilience
How do we implement Network Resilience?
Within a region we typically think of network resilience.
Within a region we have a VPC (Virtual private cloud) – this is your network. The VPC is connected to your subnets located on individual Availability Zones (AZ). To get the most reliable network, you want to have at least 3+ subnets, that all map on different AZ.
Some AWS-managed services in some cases (like RDS) take advantage of those subnets. Services like DynamoDB are already resilient in that they span across multiple AZ so you don’t have to worry about those as much from an Architectural stance, just how they link up to your corporate datacenter.
For connecting your Corporate Datacenter use Direct Connect. Direct Connect gives you access to Amazon Web Services as well as your VPC. Having multiple Direct Connects would horizontally scale so that in the event one has failed, your corporate datacenter isn’t disconnected from your AWS services and vice versa.
What is Transit Hub and how does Multi-Region Network Availability work?
Transit Hub is used to connect your VPCs and on-prem networks. You can create a Transit gateway peering connection between transit gateways in different regions.
Before Transit Hub, connecting VPCs across accounts created challenges with connecting, the design was very complicated. In the image above, in order to get the VPC1 & VPC2 to talk to your data center, you had to use peering, and then it wouldn’t work very well because transiting wasn’t allowed. This allows much more broad coverage from your VPC/Corporate network across accounts/regions.
What are Global Accelerators?
What about Access for your users? Traditionally, if your user wanted to connect to a service in Region X (Image above), they’d go through their local ISP and then Hop multiple networks to eventually access the availability region. On a good day, this isn’t an issue but with every hop your requests make, you add latency and if any one of those hops takes longer than usual, your end users would have a subpar experience. This is where Edge locations come in.
The Edge locations are like adding a server connected more closely to your ISP, generally, in your local area, which improves that latency. This is called a “global accelerator”. Global Accelerators allow you to advertise your services at an edge location for quicker routing through the AWS Network to reach that service. They have also added Health Checks in route.
Global accelerators allow you to optimize your path for end-users from an edge location to your services and achieve ~60% better performance. This means you have created a more resilient network.
What kind of Storage Solutions does AWS offer and how Resilient are they?
What is Amazon S3?
S3 is an object storage solution much like a Google Cloud. Here are some key take always:
- Storage classes & durability – Storage classes
- Standard S3 – cheaper for upload/download
- Standard Infrequently Accessed (IA) – more expensive for access💡 To give you unlimited storage in S3, it is implemented as a distributed system. You always want to ask what the consistency model is on distributed systems. Is it strongly consistent or is it eventually consistent. It’s eventually consistent on updates and strongly consistent on existing storage. If you were eventually consistent and updated a file and did a read/write after, you may get the old version of the object.
- Encryption (data at rest)
- Encryption (data in transit) – HTTPS
- Versioning – this way you can see the previous version of files and will protect against accidental deletion.
- Access control
- Multi-part upload
- Internet-API accessible
- Virtually unlimited capacity
- Regional Availability
- Highly Durable – 99.999999999% (a lot of nines)
- Cross Region Replication:
What is EC2 Instance Store?
- Instance store is ephemeral. When it is terminated, the stores are deleted.
- Only certain EC2 instances
- Fixed capacity
- Disk type and capacity depends on the EC2 instance type
- Application-level durability.
💡 Generally you want to use the instance store for caching or temporary data that you are storing somewhere else
What is Elastic Block Store (EBS)
– Some EBS volumes can be configured for IOPS – Can only attach on EBS per volume at a time. – Different types – Encryption – Snapshots – Provisioned capacity – Independent lifecycle than EC2 instance -Multiple volumes striped to create large volumes
- EBS Hardware types:
- Lower IOPS lower throughput (good for random access)
- More Expensive
- Higher throughput but lower IOPS (good for sequential access)
💡 Think of EBS as durable attachable storage for your EC2 instances
What is Amazon EFS?
- File storage in the AWS Cloud
- Shared storage
- Petabyte-scale file system
- Elastic capacity (grows up and down with your needs)
- Supports NFS v4.0 and 4.1 (NFSv4) protocol
- Compatible with Linux-based AMIs for Amazon EC2, not supported on windows
What is Amazon Glacier?
Ideal for long term storage
- Data backup and archive storage
- Vaults and archives
- Encryption by default
- Amazon S3 object lifecycle policy can move data into Glacier after a set period of time.
- Regionally availability
- Highly durable – 99.999999999%
SRE Friendly Databases ( RDS & DynamoDB)
How does AWS implement SQL or NoSQL?
|SQL (RDS or Aurora)||vs||NoSQL (DynamoDB)|
|Scale Up – to improve networking you add more memory, more CPU, better disks||vs||Scale horizontally – add more instances|
|High availability by Multi-master or Read Replicas||vs||Resilience using sharding and replication|
|Utilizes Multi-AZ and Cross Regional replication||vs||Utilizes Multi-AZ availability and Cross Regional replication|
|Typically asynchronous for standbys and asynchronous for read replicas||vs||Multiple consistency model, eventually consistent is an option|
💡 RDS = relational database
More information about the table above can be found here: – DynamoDB
What is AWS’s method for implementing Multi-AZ deployments of RDS?
RDS (not including Aurora) supports the following server Database Engines:
- Microsoft SQL
It boasts a monthly Uptime of 99.95+% during any billing cycle. Read more on the SLA….
AWS creates a single primary in a single AZ and a standby node in a 2nd AZ. These 2 Databases leverage Synchronous updates to keep in sync.
How do Read Replicas work with RDS to provide High Availability in Multi-Regional Design?
The Read Replica has an asynchronous connection string from the primary database, though there is some lag between updating. You can set up a different Region leveraging the same method.
How does High Availability work with DynamoDB?
DynamoDb is a key-value and document NoSQL type of database. In it is a collection of items and each item has a collection of attributes that is regional. DynamoDB uses Primary Keys to shard the data and a “sort key” in order to sort it within a partition.
DynamoDB boasts a whopping 99.9999% uptime during any monthly billing cycle for global tables and 99.99% for a standard SLA. (Read more on the SLA)[https//aws.amazon.com/dynamodb/sla].
How do Read Replicas work with DynamoDB to provide High Availability in Multi-Regional Design?
Global tables allow you to use an existing table (a prototype table) that is then replicated in other regions. This all kept in-sync with the global table. This is a very resilient way to manage databases in DynamoDB across multiple regions.
EC2 & Lambda High Availability
Fault Tolerant Computation with EC2
What is the difference between virtual machines and serverless?
|Virtual Machine (EC2)||vs||Serverless (Lambda)|
|You specify RAM/Disk/OS/Network/Security||vs||You specify the framework, RAM, and Permissions|
|You configure OS, SDK, and Code||vs||You configure code and deploy|
|You pay for server||vs||You pay for cod execution|
|You manage, patch, and secure your server||vs||You run your code|
What is EC2 and how do I make it resistant to failure?
Elastic Compute 2 or EC2 is the basic compute building block of AWS which is a virtual machine that supports a number of operating systems from Ubuntu, Linux, Windows, macOS, and more.
It boasts a 99.99% uptime during a billing cycle. More on EC2 SLAs…
For more information about EC2, check out the EC2 User Guide.
Basically, there are 3 things you can do to improve the resilience of your EC2
- Manage the network using elastic IPs and ENI
- Manage your backups using EBS backups
- Think about making full backups of your machine.
Fault Tolerant Computation with Lambda
How reliable are Lambdas?
Lambdas scale automatically in line with any concurrency limits that may have been set. It has slightly less availability but that is calculated upon the request at 99.95% in any billing cycle. More on EC2 SLAs…
How do you improve the Lambda Database connection Resilience?
EC2 Load Balancing
How to improve resilience on EC2 using Load Balancing
There are 2 kinds of Load Balancers:
- Network Load Balancers – Network Load Balancers are best suited for TCP/UDP and Transport Layer Security traffic where performance is key.
- Application Load Balancers – Application Load Balancers are best suited for HTTP and HTTPS traffic – this is more for path-based routing. This method is not as performant but necessary for some applications.
How does autoscaling work for EC2?
It’s not particularly complicated, the main premise is “I have a minimum size, a Desired Size and a Maximum size”. You can scale out using ALB/NLB to scale out based on a health check. The autoscaling group defines how you scale out.
How does autoscaling work for Lambdas?
Lambdas scale automatically to meet demand. It is event-driven and it is able to handle scale-up between 500 and 30000 instances as necessary. After that, it will scale by an additional 500 instances per minute. From a cost perspective, this can easily rack up expenses. There is a concurrency limit you can set to stop the number of instances created.
There are three ways lambda can source events, you can use event source mapping using SQS Que, kinesis stream, or DynamoDB stream. This is by default an asynchronous process unless you use an SDK/CLI and only if you invoke the synchronous type – this helps determine the retries and where they are handled.
Kubernetes VS ECS on Amazon Web Services
What is the difference between Kubernetes and ECS?
|Kubernetes||vs||Elastic Container Services (ECS)|
|Open Source Project||vs||AWS Technology|
|Container Orchestration||vs||Container Orchestration|
|You pay for EKS cluster and EC2/EBS||vs||You pay for EC2/EBS|
|Integrates into a variety of clouds and providers||vs||Integrates into AWS|
|SLA of 99.95% for control plane and 99.99% for data plane (EC2)||vs||SLA of 99.99%|
How is Kubernetes implemented on AWS for Resiliency?
If you aren’t already familiar with Kubernetes, I created an exhaustive guide to teach you everything I know about Kubernetes that I recommend reading if you aren’t already familiar with. You can find that article here.
Amazon has their own flavor of hosted Kubernetes called EKS. I’ll go into further detail on EKS in a later section but you can read more on EKS here...
What is ECS (Elastic Container Service) and how is it implemented on AWS for Resiliency?
ECS provides an option to use Fargate. If you aren’t sure what Fargate is, you can see a video overview here: https://youtu.be/4CHu1ErN51o.
You can deploy a number of EC2 instances in or out using an autoscaling group, each one runs on an ECS agent.
You use ECS to describe your container’s workload using “tasks” and you can deploy and scale it using services effectively providing resilience with ECS like providing resilience with EC2 Autoscaling.
It leverages a lot of AWS services under the hood from APIs, EC2, Autoscaling Groups, and AMIs. I’m not sure as of writing this, how likely ECS is to stick around with tooling like EKS.
Accepting failure in a Multi-Tier Application
What is a 3-tier (N-tier architecture) and what are the problems with it associated with Cloud infrastructure?
When you start considering containers and microservices, the 3-tier approach fails. This approach isn’t necessarily a bad thing.
Let’s look at micro-services in detail.
What challences are there with the 3-tier approach to Resilience for our Microservices?
The main challenges with this architecture type, All layers tend to have tight coupling between them so releases of code happen across all layers.
As tiers are segmented, tracing an individual request/transaction becomes very challenging from a monitoring perspective.
This can work great for a single web application but as your ecosystem grows so do the networking and security control management challenges as well as cost.
As you begin to deploy microservices and containers, you will see why this architecture isn’t ideal for resiliency.
How do we design infrastructure resilience for our Microservices and what is an “app mesh”?
You can see in the diagram the individual tiers of our application and then in grey there the Amazon SLA uptimes with the lowest infrastructure SLA being 99.9% (or 43.83 minutes/month however the combined SLA may be lower).
Let’s make some infrastructure choices around ALB, Kinesis, and Aurora that will provide a higher SLA and be balanced with the needs of the application.
One thing to consider is that DynamoDB streams and App mesh have no SLA. So how do we cope with that?
How a single micro-service availability directly impacts reliability.
Remember our conversation about cascading failures from the Accepting Failure Section section? Well, that’s precisely the challenge we are going to look into here and how 1 service being “down” degraded the whole system’s reliability.
One thing that SREs can do to effect change here is to apply communication standards (logging standards) across services.
In summary, the SRE can improve availability and resilience and mitigate the micro-service’s “interservice” communications on micro-services using proven cloud design patterns to address key areas:
- Rate limits
- Circuit breakers
- Health checks
- Improve CI/CD pipelines so that you can release code reliably
What is a service mesh?
- The Service Mesh will embed a proxy inside your pods, your app container no longer communicates directly to the network, it uses the “sidecar” proxy for auto-discovery of other containers.
- This is also secure with something like mutual TLS, so there are some certificate signing and management within your mesh.
- This is beneficial because now the sidecar proxy is connected to the network so it can now send ” Telemetry ” across the stack.
What is App Mesh?
Amazon has their own flavor of a mesh network called “App Mesh” which is supported across ECS, EKS, and EC2 (Kubernetes) and uses the envoy proxy as a sidecar.
It leverages custom resource definitions to build custom resources inside the API layer. By extension the App Mesh Controller is joined with 3 Kubernetes custom resource definitions:
- Virtual Service
- Virtual Node
The App mesh can fit across container platforms on EKS or on-prem. This allows you to embed and control access that services have without changing the application and embed policy control in the application.
There are other available solutions such as Hong Mesh and Apigee which are popular solutions.
How do we manage state in a Multi-Tier Application?
Managing “State” is complicated so I’m not going to go in super depth here but I am going to touch on 4 key questions.
- How does a platform manage service state reliably?
- How do we manage service configurations reliably?
- How do we manage a User session reliably?
- How do we manage application data reliably?
How does a platform manage service state reliably?
Platform State – this is covered more in detail in my Kubernetes section on Stateful Sets and I recommend reading that section but for the purposes of this article, I’ll touch on it.
Platform state in a Stateful application is managed using a “StatefulSet. This manages the deployment and scaling of individual pods which guarantees the ordering and uniqueness of these pods.
These StatefulSets are useful for clusters where order, network names, and storage must be consistent.
Like a deployment, StatefulSets manage pods that are based on identical container specs but unlike a Deployment, a StatefulSet maintains a sticky identity for each of the pods.
StatefulSets are valuable for the following application requirements/applications:
- You need stable and unique network identifiers are necessary where the name or Pod IP may change
- You need Stable and persistent storage
- You need Ordered, Graceful deployments and scaling.
- You need ordered and automated rolling updates
StatefulSets require a headless service which does not contain a ClusterIP – instead it creates endpoints that are used to produce DNS records that are individually bound to a pod.
2. How do we manage service configurations reliably?
When we start configuring those containers, we have the ability to start using platform stores like Kubernetes secrets and configmaps.
Kubernetes secrets are not encrypted, they are simply encoded in base64 and if you have access you can decode it. That only works with kubernetes so if you are using lambdas, for example, you’d need an external service configuration like Hashicorp Vault, secrets Manager, or if you wanted to invest in it a custom configuration store. Amazon offers System Manager as a solution.
Things to consider:
- Ability to automatically rotate keys
- Number of requests
- The variety of container platforms
- Operational overhead – who is going to run/own it?
- Framework/SDK support
3. How do we manage a User session reliably? (When should you externalize your session state?)
You want to externalize your session state when:
- Throughput/number of requests is getting high
- Cost of the platform vs the cost of building a solution yourself
- You are using multiple container platforms
- The costs are manageable
- You have support for your SDK/framework with the application
4. How do we manage application data reliably?
As a general principle, micro-services use decentralized data management, where each micro-service encapsulates its own data. That leads to a number of challenges.
Typically in a large enterprise, you may have very large databases, so who is going to pay/budget for breaking up existing databases and data warehouses?
Another challenge is that copying data can lead to inconsistencies in your application data and lastly, distributed systems have a lot of components.
One solution is a Database Proxy which:
- Follows an operational pattern
- Allows you to effectively reuse network connections
- Adds caching
- No proxy needed because it’s scaling horizontally
Another solution is using Change data capture either implicitly through DynamoDB streams, SQS ques, or through Kinesis. This is a great solution for event sourcing which provides not only a state but an order history of events that lead to the problem you ran into.
It can help provide an audit source, help build compensating transactions if something goes wrong, and help with CQS.
One of the interesting bits in the diagram above is we have a micro-service E that’s talking to an RDS data source that is being populated through change data captures. So every time something changes that can be written into a separate datastore which is then queried by the Micro-service.
A solution is to avoid using is a centralized application data solution but that kind of breaks the first principle of having data encapsulated within a micro-service. Nevertheless, you can use it. Read more on the problems associated with a centralized data service.
This isn’t a comprehensive list of data patterns, but if you’d like to read more on data architectural patterns, you can do so here.
What are typical things you need to do within your micro-service or application to alleviate infrastructure and data reliability issues we may see?
We will look at how you manage health, timeouts, and retries.
We will also look at typical implementation of things like circuit breakers and bulkheads that help you scope and manage reliability.
Finally we will look at compensating transactions.
How do you determine if your application is healthy?
Imagine for a second you had an endpoint that provides the health of your application and is hosted on that service. Would we want to query that endpoint every few seconds? Probably not, this is because you may cause cascading problems making the unhealthy state worse. Basically a DOS (Denial Of Service Attack) on yourself!
Instead what we want to do is run the get health state from our application. In this case,
get_db_health() as a separate thread and all that is doing is “getting the database health at regular intervals and writing the state to a variable, to some sort of in-memory structure, or in a file. Then all your get health method is doing is querying that in-memory data structure and there is no direct relationship with your service so you can hit that health endpoint as many times as you want and it’ll only ever return the status from the last poll.
The other benefit is that health is multidimensional. While you always want to return a 200, it’s probably also important to include the payload details on the other aspects of the service, network, database, or cache.
One thing to note about health endpoints is that they are usually not authenticated so you want to make sure the information provided in terms of payload be a code that can be looked up or at the least not sensitive in nature.
How do you handle retries and Timeouts?
In this example, imagine that micro-service A is talking to micro-service B and something happens at micro-service B and the call fails. So what you will want to do is retry.
What happens when other services are retrying at the same time, especially if all of them are using the same timeouts and the same retry interval, the situation can cascade by heavily overloading the micro-service B.
Instead, you can use an exponential back-off for better flow control where you introduce randomness to your wait time. Most exponential back-off solutions use jitter (randomized delay) to prevent collisions of requests in succession to a micro-service. AWS does this on their services and recommends it.
For timeouts, it’s important to introduce a specific deadline to terminate requests. This will also have a significant impact on the reliability of a micro-service.
What is a circuit Breaker?
Setting retries and timeouts with hard deadlines is important but getting them setup with the correct values is rather hard. This is why Circuit breakers are important to remove the guess work.
These work in the same way an electrical circuit breaker would in that when excessive failures are observed downstream, the circuit breaker will “trip” and then take a preventative action to isolate it’s service or dependencies to avoid cascading failures.
In the image above, when the circuit breaker is closed and counting all failures, micro-service A can talk to micro-service B. When a threshold of failures is reached, the connection to micro-service B is opened.
Open meaning downstream services are offline and closed meaning business as usual. Half-open meaning it’s sending but degraded/reduced form.
This means you can leverage retries and timeouts to reduce the impacts of a downstream service in a degraded state.
This is an excellent way of automating fail-overs for micro-services.
What is a bulkhead and what do walls have to do with resiliency?
A bulkhead is what we called walls in the Navy but really it’s a boundary that isolates one section of the ship from another in case of flooding. In infrastructure, we use this concept to isolate sections of the application.
We group microservices together so we can manage the network, throttling, routing, and failure detection to that grouping. Normally this is done with a service mesh.
This is a good process to go through to think of the policies and controls you’d want in place to avoid cascading failures.
What are Compensating Transactions
In a large distributed application, compensating transactions are an important part, each microservice will have a very specific scope of tasks. Often those tasks are created so that they work into a workflow or list of tasks that have to be done by different microservices.
Let’s consider that each micro-service uses eventually consistent data sources or data synchronizations. How would you manage the transactional integrity of your loosely coupled microservices?
These are typically a step function, you make a call to micro-service A-D in sequence. Failures earlier in the process are easier to rollback than the final step for example.
So each step basically has a create action for a specific micro-service running for example on a lambda, or EKS, etc. So in this case, if micro-service D fails, we want it to fail back on micro-service C, B, and eventually A.
So in each one of your steps, you can create failure actions for each one of your microservices.
Each step can have partial or full compensating workflows which you can specify in your workflow engine, jBPM, etc. You could even use event sourcing to figure out what the compensating pattern is based on the state changes.
Failures at Scale
What can we do operationally to survive failures at scale?
The SREs will help the Product Owners and developers by providing subject matter expertise in terms of the AWS infrastructure, CI/CD, and incident management processes to remove information silos, improve communication, and overall reliability.
Aside from adding resilience in our application infrastructure, what are some of the changes we may need to implement to support a more global application?
Start with a good operating model, use at least 2 SREs alternating morning and evening shifts, typically Monday – Friday 6a – 2p and 2p – 10p that support the AWS platform and automate all aspects to eliminate toil. Also just as important is formulate a plan for extended hours such as pager duty.
At this point, it’s a great idea to focus on SLOs/SLIs to help identify areas of improvement and provide meaningful metrics to the Product Owners and developers so that they can continue to reliably release new features that support the business needs. This allows the Product Owners the ability to be engaged and have a stake in the overall performance of the application while providing necessary oversight for critical dependencies.
This will lead to informed release engineering processes.
After your foundation on developing a release engineering process is complete, focus on incident management processes and tooling. This is one of the most important jobs of the SRE, produce blameless post mortems. Here is a simple template you could use:
|Timeline of events||Fact based timeline of events that lead to the incident|
|Impact||Quantify how many sessions, users, or services were impacted and what the impact was|
|Root Cause||Identify the main cause(s) of the problem|
|Trigger||What event was the catalyst to this incident|
|Resolution||Steps taken to reconcile the issue|
|Where we got lucky||Identify any lucky breaks or events that made the impact less severe|
|What went wrong||List the specific non-compliant steps that lead to the failure/incident|
The SRE will really be tasked with running with the solutions identified in the Post Mortems and bringing them to life.
Lastly, as more SREs are introduced to the application teams, modify the team structure to support individual applications / micro-services and specialize specific roles for individual teams.
In this article we learned that failure of applications is inevitable, SRE’s are an essential and prescriptive methodology for implementing DevOps to minimize the impacts of those failures to an organization and we learned how to do it all on the AWS Infrastructure.
While this isn’t an exhaustive guide on the implementation of SRE on every cloud platform (and certainly isn’t the final word on how to go about it), it does cover most of the bases on AWS and should provide a framework to use as a guide for implementing SRE best practices when considering AWS infrastructure.
I’d love to hear what you think and learn more about how your organization has implemented Site Reliability Engineering within AWS.
Extra helpful information
- Learn Kubernetes
- Google’s Free Site Reliability Books you can read online
- Google’s SRE Landing Page
- SLO Adoption and Usage in SRE
- Chaos Monkeys
- EC2 User Guide
- Architecture patterns
- Stateful Sets from Kubernetes
- My Kubernetes section on Stateful Sets
- Problems associated with a centralized data service
- Compensating Transactions
- API retries
Glossary of Terms
- ACID: (atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequence of database operations that satisfies the ACID properties (which can be perceived as a single logical operation on the data) is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction, [source](https://en.wikipedia.org/wiki/ACID).
- Availability: The ability of a system to recover from infrastructure failure/disruptions. The amount of time an application/server is operating as a percentage of total time it should be operating.
- Availability Zone: An availability zone is a logical data center in a region available for use by any AWS customer. Each zone in a region has redundant and separate power, networking and connectivity to reduce the likelihood of two zones failing simultaneously. A common misconception is that a single zone equals a single data center.
- Control Plane: In network routing, the control plane is the part of the router architecture that is concerned with drawing the network topology, or the information in a routing table that defines what to do with incoming packets. Control plane functions, such as participating in routing protocols, run in the architectural control element. In most cases, the routing table contains a list of destination addresses and the outgoing interface(s) associated with each. Control plane logic also can identify certain packets to be discarded, as well as preferential treatment of certain packets for which a high quality of service is defined by such mechanisms as differentiated services, [source](https://en.wikipedia.org/wiki/Control_plane).
- Correctness: In theoretical computer science, correctness of an algorithm is asserted when it is said that the algorithm is correct with respect to a specification. Functional correctness refers to the input-output behavior of the algorithm (i.e., for each input it produces the expected output) source: https://en.wikipedia.org/wiki/Correctness_(computer_science).
- CQS: Command–query separation (CQS) is a principle of imperative computer programming. It was devised by Bertrand Meyer as part of his pioneering work on the Eiffel programming language.It states that every method should either be a command that performs an action, or a query that returns data to the caller, but not both. In other words, asking a question should not change the answer. More formally, methods should return a value only if they are referentially transparent and hence possess no side effects, source.
- Disaster Recovery: Disaster Recovery involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions, as opposed to business continuity, which involves keeping all essential aspects of a business functioning despite significant disruptive events. Disaster recovery can therefore be considered a subset of business continuity. Disaster Recovery assumes that the primary site is not recoverable (at least for some time) and represents a process of restoring data and services to a secondary survived site, which is opposite to the process of restoring back to its original place, source.
- Error Rate: The measurement of the effectiveness of a communications channel. It is the ratio of the number of erroneous units of data to the total number of units of data transmitted, source.
- Eventually Consistent: Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Eventual consistency, also called optimistic replication, is widely deployed in distributed systems, and has origins in early mobile computing projects. A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence. Eventual consistency is a weak guarantee – most stronger models, like linearizability, are trivially eventually consistent, but a system that is merely eventually consistent does not usually fulfill these stronger constraints, source.
- Freshness: The proportion of the data that was updated more recently than some time threshold. Ideally this metric counts how many times a user accessed the data, so that it most accurately reflects the user experience, [source](https://sre.google/workbook/implementing-slos/).
- jBPM: jBPM is a toolkit for building business applications to help automate business processes and decisions.jBPM originates from BPM (Business Process Management) but it has evolved to enable users to pick their own path in business automation. It provides various capabilities that simplify and externalize business logic into reusable assets such as cases, processes, decision tables and more, source.
- Latency: Latency is the delay between a user’s action and a web application’s response to that action, often referred to in networking terms as the total round trip time it takes for a data packet to travel, source.
- Load Balancer
- Multi-Regional: This solution deploys a reference architecture that models a serverless active/passive workload with asynchronous replication of application data and failover from a primary to a secondary AWS Region. To verify that regional failover is working, a sample photo-sharing web application can also be deployed, serving as a visual demonstration for the backend layers. This solution allows for a 15-minute Recovery Point Objective (RPO) and a Recovery Time Objective (RTO) of a few seconds, source.
- Product Owner: In agile teams, this is the person who plans along with a scrum master their product team’s workload.
- POP locations: Points of presence – this is generally the number of locations of worldwide edge server locations. A higher number means there is likely a closer edge location to the end-users reducing latency.
- Primary Keys: In the relational model of databases, a primary key is a specific choice of a minimal set of attributes that uniquely specify a tuple in a relation. Informally, a primary key is “which attributes identify a record”, and in simple cases are simply a single attribute, source.
- Reliability: how much you can trust a system or it’s “Reliability” is the likeliness of a system to produce correct outputs up to some percentage of time and is improved by tools that avoid, detect, and repair faults. It can be defined mathematically as:
# of Successful Requests Reliability % = ---------------------------- x 100 # of Total Requests
- RDS: Relational Database
- RTO or Recovery Time Objective: a target for returning services to online after a failure or disruption.
- Service Level Agreement: This is more of a legal term that is used to determine if an agreed-upon service availability isn’t met, there is a path to recourse to get your money back or credits.
- Shard: A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each [shard](#shard) is held on a separate database server instance, to spread the load. Some data within a database remains present in all shards, but some appear only in a single shard, source.
- SRE: In a nutshell, SREs are a “go-between” for developers, DevOps teams, and release management to ensure when code is deployed, it’s done in the most reliable manner possible. Really, SRE is an opinionated way of implementing DevOps. More than anything, SRE is a mindset of “ownership” and a “questioning attitude” than a specific set of tools/tasks.
- Telemetry: Telemetry is the in situ collection of measurements or other data at remote points and their automatic transmission to receiving equipment (telecommunication) for monitoring. The word is derived from the Greek roots tele, “remote”, and metron, “measure”. Systems that need external instructions and data to operate require the counterpart of telemetry, telecommand. Although the term commonly refers to wireless data transfer mechanisms (e.g., using radio, ultrasonic, or infrared systems), it also encompasses data transferred over other media such as a telephone or computer network, optical link or other wired communications like power line carriers. Many modern telemetry systems take advantage of the low cost and ubiquity of GSM networks by using SMS to receive and transmit telemetry data, source.
- Two-Phase Commit: In transaction processing, databases, and computer networking, the two-phase commit protocol (2PC) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort (rollback) the transaction (it is a specialized type of consensus protocol). The protocol achieves its goal even in many cases of temporary system failure (involving either process, network node, communication, etc. failures), and is thus widely used. However, it is not resilient to all possible failure configurations, and in rare cases, manual intervention is needed to remedy an outcome. To accommodate recovery from failure (automatic in most cases) the protocol’s participants use logging of the protocol’s states. Log records, which are typically slow to generate but survive failures, are used by the protocol’s recovery procedures. Many protocol variants exist that primarily differ in logging strategies and recovery mechanisms. Though usually intended to be used infrequently, recovery procedures compose a substantial portion of the protocol, due to many possible failure scenarios to be considered and supported by the protocol, source.
- What Does An SRE (Site Reliability Engineer) Do On Amazon Web Services? – Who is this guide for?
- What topics are covered?
- What are the objectives?
- What is Amazon’s Guiding Principles for SRE?
- Determining SRE Objectives
- Error Budgets
- Designing Acceptable Failure
- Why does failure happen in traditional (non-cloud/non-SRE) setups?
- From the application level, what is Reliability?
- AWS Global Architecture for networking
- Network Availability and Resilience
- How do we implement Network Resilience?
- What is Transit Hub and how does Multi-Region Network Availability work?
- What are Global Accelerators?
- What kind of Storage Solutions does AWS offer and how Resilient are they?
- What is Elastic Block Store (EBS)
- SRE Friendly Databases ( RDS & DynamoDB)
- How does AWS implement SQL or NoSQL?
- What is AWS’s method for implementing Multi-AZ deployments of RDS?
- How do Read Replicas work with RDS to provide High Availability in Multi-Regional Design?
- How does High Availability work with DynamoDB?
- EC2 & Lambda High Availability
- Fault-Tolerant Computation with EC2
- Fault-Tolerant Computation with Lambda
- EC2 Load Balancing
- Kubernetes VS ECS on Amazon Web Services
- Accepting failure in a Multi-Tier Application
- What is a 3-tier (N-tier architectures) and what are the problems with it associated with Cloud infrastructure?
- What challenges are there with the 3-tier approach to Resilience for our Microservices?
- How do we design infrastructure resilience for our Microservices and what is an “app mesh”?
- How a single micro-service availability directly impacts reliability.
- What is a service mesh?
- What is App Mesh?
- How do we manage state in a Multi-Tier Application?
- How does a platform manage service state reliably?
- 2. How do we manage service configurations reliably?
- 3. How do we manage a User session reliably? (When should you externalize your session state?)
- 4. How do we manage application data reliably?
- Reliability Patterns
- Failures at Scale
- Extra helpful information
- Glossary of Terms