Cloudy, with a chance of outage

Blogs

(Image: Stockfresh)

18 January 2016

The cloud — it’s great, you know.

It’s resilient, dependable, adaptable, elastic and best of all, it’s cheap. It’s generally provided as a service so you don’t have to worry about the plumbing and it is maintained by an army of minions even more cleverer than those wee yellow chaps with the goggles and the denim.

At least that’s the theory.

However, a cursory glance at the records for the year was shows hardly a month went past without an outage from either a major service provider, or a major operator of such services.

The year started with a planned outage by US giant Verizon that would take services down over a weekend for some 40 hours. Despite the displeasure of some users, and many a raised eyebrow from the wider community, the reason given was infrastructure upgrades that would strengthen their offerings and, ironically, prevent future downtime.

“One thing in common across all of the instances quoted is that they were unplanned failures. They were not malicious attacks nor hacks nor sabotage. These were fat-finger, failed hardware, botched configuration, run of the mill, just-cocked-it-up incidents. Stuff blew up, out, over or whatever”

But month after month, from Google’s Compute Engine to Microsoft’s Azure, Amazon’s AWS and EC2, all fell prey to unexpected outages and downtime.

Here at home, April saw Three have a data centre problem that took down its own and customer network users too, affecting around 1.5 million in total. Close to home too, November saw Telecity Group have a power issue in London data centres.

See the table for a highlight picking, very selective, not very comprehensive, but fairly representative, catalogue of outages.

However, one thing in common across all of the instances quoted is that they were unplanned failures. They were not malicious attacks nor hacks nor sabotage. These were fat-finger, failed hardware, botched configuration, run of the mill, just-cocked-it-up incidents. Stuff blew up, out, over or whatever. In the classic manner of unexpected incidents everywhere since the beginning of time, stuff happened that no one was expecting.

Now in many instances, there were unforeseen consequences of these failures too. Sometimes, these failures, as with the Amazon failure in September that was described as a ‘cascading’ event that probably started with a database problem, the incident can spread due to dependencies.

But, I hear you say, aren’t these services supposed to be resilient to fail over, around, across and generally overcome such holes in the infrastructure?

Well yes, and no.

They are when behaving in a manner which has been foreseen and handled in the intelligence built into systems. They are not when something fundamental happens such as configuration errors, incorrect base information (such as dates and times) or with failure of supposed fault tolerant or redundant systems.

As has been seen with several major outages, something happens and a cluster of circumstances, none of which alone might be that unusual, or of any great import, come together with an actual failure, brew up a storm.

The resulting outage then becomes a source of major interest for the likes of this old hack, who make much of it and report it across the wires to all and sundry, tut-tutting at the failure of a cloud service!

Despite releasing it to the public a couple of years ago, and many a pundit advocating the approach at least if not the tool, few organisations use Netflix’s Chaos Monkey to actively and randomly test the resilience of their infrastructure. For those not familiar, it is a software tool that randomly roams infrastructure taking out nodes and generally having an affect like random failures just to see if things are as resilient, redundant and generally cloudy as they are supposed to be. Now, Netflix themselves are not immune to the random failure of fat-finger issue either, as evidenced by its own glitch in November which it acknowledged affected streaming services but didn’t give much in the way of technical detail for. But it has certainly strengthened its architecture and there are even tutorials about to allow you to build your own.

But, how many times have you heard of a major cloud service provider being sued for an outage? Not very often. And why? Because even with something like a five 9s uptime SLA (99.999%), even large outages in many cases do not actually break SLAs. It is often the scale geographically, or in terms of the number affected, that grab the headlines and give the impression of severity. In actual fact, for individual organisations, something like the February Compute Engine outage of 40 minutes or so, may not have come close to breaking the terms and conditions.

That is little comfort to an organisation that is paying for an always on, ubiquitously available service when it fails. But in reality, it is probably still more reliable, faster, more available and of a generally higher quality than most organisations can provide for themselves. So while expectations may be high, higher perhaps than SLAs, the reality is that few outages from the major cloud service or platform providers actually take from their promised, or contracted commitments.

Over the coming years, as these services mature, we will certainly see improvements and the rarity of cloud outages will increase, and the reporting of an outage will become more of an event. This will perhaps skew perceptions, so like the reporting around a plane crash, it may give the impression of greater frequency than is actually the case. That’ll be our fault again.

What one wonders is, if one were on the inside of one of these magic factories, what incidents occur that were handled by the system, the engineers and the general minions. I for one, would love to hear about the disasters that were averted. There’d be a script or two in that. Probably.

2015 Cloud outages

January – Verizon planned outage for infrastructure work to prevent future outages – 40 hours

February – Google Compute Engine outage – 40 minutes

March – Google Compute Engine outage – configuration issue

March – Apple iCloud – DNS error

March – Azure, Central US 2 hour outage

April – Three: power failure in DC hits own and Tesco Mobile network

April – Bloomberg: combined hardware and software failure

May – iCloud again, 500m users affected globally for up to 7 hours

July – NYSE, United Airlines and WSJ: ‘technical issues’

July – Office365 hosted Exchange outage during WPC

August – AWS outage