Cloud outages – How to prepare

by Alexander Weiß

The cloud never fails. At least that’s what the cloud provider wants us to believe. The reality has proven for several times that even well managed clouds can fail and do fail. However, the problem is not that they fail, but that most people aren’t prepared for such failures, because they think the cloud never fails. In this post you’ll find some tips on how to prepare for cloud outages.

Although cloud services are known for their high availability, they are still bound to Murphy’s law: “Anything that can go wrong will go wrong”. So it’s not surprising that even the best cloud services had outages in the past.  No matter if we talk about Amazon’s AWS, Microsoft’s Azure, Googlemail or any other cloud services, most of them have failed in the past and most of them will fail sometimes in the future.

Do these outages make the cloud a less reliable option when you need highly available services? Absolutely not. Because when I talk about outages it doesn’t mean that every part of the cloud was down. E.g. the very few AWS outages were mostly limited to an Availability Zone.  That’s also the reason why during the last AWS outage in October some webservices were affected by the outage but other weren’t. But even some webservices which should have been affected showed no sign of malfunction. It seems that they have had a working failover system.

How to prepare for cloud outages

How to prepare for cloud outages

Although the implementation of the failover system is not done by the cloud provider, most of them offer failover features. So even if one of the cloud’s datacenter goes down completely there always should be another one which can continue to run your services. So the cloud is the best option if you need highly available services.

6 tips for cloud outage preparation

The key in preparing for outages and in continuing your service is proper planning. Here are six tips, which should make your webservices invulnerable for cloud outages:

  1. The most important part is: Don’t believe the promises cloud marketing makes, and think that clouds will never fail. They will. That’s the reason why you should examine the infrastructure of the cloud provider closely: How does he prepare for the failure of components or even complete zones?
  2. Often a cloud outage begins with the failure of a single component. But this single failure often starts a chain reaction. The result of this chain reaction is more than often that whole zones collapse.
  3. What will you do, if you realize that your webservice is down? You’ll probably want to fix it. But a cloud outage usually affects not only you, but a lot of people. All of them will do what you are going to do, they will try to fix the problem. If your server boots up it needs much more resources than one that is running under normal load. Now imagine that not only your server is booting up, but a thousand servers are: this can cause serious delays in the startup process.
    And what are you doing if your servers are not starting as fast as you are used to? Well you try to start them again, adding more stress to the infrastructure. But the problem isn’t solved yet and so you start the server again and again and again. This behavior can cause huge load spikes and can be another source of failure.
  4. Cloud providers don’t plan for failover of your services. They just provide the tools. So it is your job to plan and implement your own failover system.
  5. If you have a failover plan, you have to test it. By testing I don’t mean, that you do a quick test to see if the failover works. You have to test the failover system under real-world conditions. Especially the work load has to be in the same magnitude as during prime time. Only then you’ll know how long it will take until the failover kicks in and what problems may come up. One tool to simulate traffic is JMeter.
  6. You have to do the test under real-world conditions not only to see if everything is technically alright. The test is also a very good lesson for the people who are involved. They will learn how to cope with the situation and it will be easier to follow your plans in emergency cases if they have done so before. That’s the reason why every school has to practice evacuation in case of fire alarms once a year. The formula ”if you do things more often, you will get better at mastering them” also applies to failover planning. So arrange a failover test at least once.

The history taught us that cloud outages will happen. But this doesn’t mean that the services you are running in the cloud will have outages, too. If you have a working failover plan and your employees are trained adequately, the cloud service usually has the right instruments to prevent an outage of your service. All you have to do is create a failover plan and test it diligently.

However, planning and testing costs time and money. You have to set it in relation to the damage that is done by an outage. But even if the webservice is not your primary way to generate revenue, an outage will damage your image and may turn clients away. If you want to be a run your business professionally, outages shouldn’t happen. So start to prepare for failure now.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)