- Resiliency is the ability to maintain acceptable service levels through, and beyond, severe disruptions.
- Cloud-based systems are only inherently more resilient if they are designed to be so.
- One way to improve resiliency is to have the correct levels of redundancy in place.
- Whilst cloud-based systems may have some redundancy built in, there will be single points of failure within the solution unless you ensure there aren’t!
- Businesses should define their RTO and RPO objectives and systems should be designed in accordance with these.
What do we mean by resilience and redundancy?
When building mission-critical technology for demanding users, resilience and redundancy must be understood.
Resilience means that your service will continue to work whilst issues are ongoing. You can think of this concept in a similar way to a run-flat tyre. You can still drive but maybe not as fast, although in some cases you can!
You can achieve resilience by using redundancy. This is where you may have additional provisions in place to take over from others in the event of a failure. This might happen instantly, in a similar manner to planes being able to run on fewer engines, and this is where redundancy adds to resilience. It is really important for a plane to have this built-in, instant, redundancy, but in other cases, a delay might be acceptable.
To continue the tyre analogy, you might have a spare tyre. You’ll have to stop and change to the spare, but then you can carry on your journey. In some circumstances not having a spare tyre could also be fine because you are happy to rely on roadside assistance.
All of these choices have to be aligned with your risk appetite and ability to cope with a delay to make the appropriate choice for you.
Redundancy can support the creation of a highly resilient system, but you have to make sure it is quick enough for your needs in the event of failure.
Both of these concepts help in the event of failure or disaster but are applicable in different situations depending on business needs.
What should businesses be thinking about?
When we perform due diligence assessments, one of the things we look for is that the business’s expectations or assumptions are in line with the technology’s capability.
Considerations about resilience and redundancy need to be balanced alongside costs, business expectations and in support of well understood and considered disaster recovery plan.
Your plan should contain the definition of the business’s Recovery Point Objectives (RPOs) and Recovery Time Objectives (RTOs).
Understanding how critical your applications are to business processes is a good starting point to define RTO’s and RPO’s. Considerations such as:
- How much downtime can we tolerate?
- Are there alternatives that can be used? Pen & Paper? Manual processing?
- Will activities queue up and is that ok?
- How much money do we lose if this isn’t working?
Once you understand how fast you need to bring a system back to operation, you will be able to make an informed decision about your choices of architecture and infrastructure to support the required levels of resilience and redundancy.
Reviewing your expected RTO and RPO against any third-party services you use is important. Some suppliers refer to availability using terms like “Five Nines” or 99.999% availability, this in reality relates to 5.15 minutes of downtime a year. If you are to achieve these levels of uptime and availability in your own platforms, this will require effort, planning and design, but also understanding do you need to?
Even in the cloud, you need to prepare for disaster. Testing your disaster recovery plans and ensuring that they meet the expectations of the business is just as important as having them. An untested plan is not useful in an emergency! If a disaster has already happened and that’s when you start thinking about resilience, redundancy, and your disaster recovery plan for the first time, you might have left it a little too late.
So what does this mean for the cloud?
A lot of people rely on the SLAs provided by cloud providers such as Azure or AWS, seeing the 99.99%, and do not architect for the resilience they really need as a business. Outages are the cloud provider’s problem… right? Wrong.
As we’ve seen in recent times, zone or regional outages, although rare, can happen in AWS, Azure or GCP. If all your applications are in a single zone or region, you are carrying a risk. If there is a failure, you are at the mercy of your cloud provider to bring it back up.
Cloud service providers such as Amazon Web Services (AWS) intrinsically enable various aspects of resiliency. Some companies outsource their resilience and redundancy using SaaS services, such as Dynamics CRM 365 and sign up to full geo-resilience. However, they may neglect to add the same controls around all the systems that feed into it. A chain is only as strong as its weakest link, after all.
During diligence processes, we tend to identify these weak links as single points of failure. An example could be a single Virtual Machine (VM), hosted in a physical data centre, running a business-critical application. Some organisations think that running a single VM in the cloud is better than in a data centre, but this is not the case.
Even a Platform-as-a-Service (PaaS) service such as an Azure App Service running in a single zone is only negligibly better. It is true that PaaS does offer an increased level of fault tolerance versus IaaS Infrastructure as a Service (IaaS) offerings like virtual machines. Using a PaaS service means that the cloud provider is responsible for ensuring that the service is up and running should issues be identified, however, it still needs to be configured appropriately – for example by setting useful health checks.
If your self-managed virtual machine breaks, it’s down to you to fix it wherever it is hosted. All of these hosting options still result in a single point of failure which should be mitigated by increased redundancy or a more resilient hosting option, such as enabling multiple availability zone support.
Ensuring your software has the appropriate level of resilience and redundancy for your business is something which needs to be thought about at all levels. Business owners need to be involved in defining and identifying what their expectations are for RTO and RPO and these should be tested regularly to demonstrate they are achievable.