Holes in the cloud
1 November 2012 | 0
Many of the IT industry’s biggest names in cloud services have seen their reputations tarnished by service outages in the last twelve months.
From Microsoft to Google, Apple to Amazon, they’ve all fallen over at some point, illustrating in the process the danger for cloud-using customers of depending too much on any one remotely delivered service. At the same time, the business case for adopting cloud-based services remains attractive – reduce overheads, move capital expenditure to operating expenditure, improve flexibility and adapt to market conditions quickly – and Irish companies haven’t been slow to jump on board.
"There’s much more choice now when it comes to cloud-based offerings and companies now have access to lots of interesting new ways to get to the market quickly and provide IT back to the business in the most cost effective manner," said Tony Quinn, sales director of IT services company Arkphire.
"We’re increasingly seeing that when the business case for a project goes up to the board for approval, it almost invariably gets bounced back with the instruction to see if a more cost effective way to deliver it using cloud providers can be found. We respond to that by helping companies to understand which applications they should manage themselves, and which they should look for cost effective remote suppliers such as cloud service providers."
According to Quinn, an important first step is for the IT department in an organisation to take a hard look at the applications they are supporting, and ask which ones are revenue generating and are really critical to the business, and which ones sit around the edges, like payroll, CRM and email.
"These applications round out the IT service but aren’t core. What we’re seeing is that if they can put these out into the cloud, they are," he said.
While opting to use cloud-based services and infrastructure undoubtedly offers many business benefits, there are some potential problems too, and foolish is the company that thinks these don’t represent significant threats. The most pressing issue, of course, is what happens if and when the service falls over.
From broken fibre and copper cable to security issues, there are many reasons why service outages take place. But the bigger question is how companies should handle an outage. The standard answer is to make sure that a service level agreement (SLA) is in place between the service provider and consumer, but in the opinion of many industry experts, far too much faith is placed in SLAs-effectively pieces of paper on which a service provider promises to make things good if there is a problem.
"We’ve come across people who think that because they’ve outsourced a service to a third party, it becomes the third party’s problem if there’s a service outage. In effect they’ve washed their hands of it," said Michael Conway, director of Renaissance Contingency Services.
"But while it may be someone else’s problem, if there is a failure, it will be the end user that does the suffering. When people are outsourcing to the cloud, they need to be clear about what they’re doing and they need to be sure they’re outsourcing to somebody who’s doing things properly."
Renaissance specialises in the business of continuity planning, insulating companies from the worst effects of threats that might disturb their ability to do business.
"That means anything from the loss of traditional internal IT services to the loss of a core supplier of raw materials. It’s about looking at the supply chain, and if a cloud service powers a core part of what you do, then you really need to take a long hard think about what becoming reliant on that service means," said Michael Conway.
"Our experience is that many companies in that situation don’t actually look very aggressively at their supply chain and that’s something that they really need to do. They need to have a resilient infrastructure in place and they need to have some serious SLAs in place that actually mean something."
According to Conway, companies need to make sure that the services and suppliers they work with are truly up to speed on what to do if there is a major service interruption.
This is a serious issue, according to Microsoft’s Martin Cullen, who suggests that the first question a company thinking of adopting a cloud-delivered service should ask itself is "will moving to the cloud improve or reduce the potential uptime of its services?"
"To be honest, this hasn’t been a significant issue for us. If you look at our infrastructure, customers are getting enterprise-class architecture used to provide the service to them. How does that stand up against your existing environment? For most people, that’s a major improvement of the service level," he said.
His effective point is that when it comes to the major players in the cloud business, the Microsofts, Amazons, Googles and so on, the sheer scale they can offer means that they can provide a more reliable service than all but the largest multinationals could build from scratch for themselves. When problems occur, the resources they can bring to bear in repairing the problem are enormous. In short, for most normal-sized companies, associating with brands like this will improve potential uptime, not diminish it.
"We offer a service that’s based on a data centre that’s built to the highest specifications, to the highest worldwide standards. There’s a huge amount of confidence in that for the customer. We can’t afford to have our brand tarnished by poor service levels, so not having outages happen is extremely important to us," said Cullen, who is director of small and mid-sized business and partners for Microsoft.
Despite this, it remains a fact that even the biggest global players in this area have fallen down, and when that happens companies left clutching SLAs need to know they’re worth something. With this in mind, Cullen argues that not all SLAs are the same, and that companies should scrutinise the deals they’re signed up to.
"I’ve had customers tell me they have existing SLAs that guarantee them a 75% uptime and the provider in question is beating that uptime comprehensively. That sounds good, but a 75% uptime equates to a potential downtime of 1.25 days a week. They’re beating their targets because their targets are so atrocious. What business would be happy with that?" said Cullen. "You have to be in the 99.9% uptime range to be viable. That’s the range we’re in and that we’re committed to delivering."
Downtime is typically measured using the so-called nine scale-‘one nine’ equals 90% up time, or conversely 36.5 days a year downtime. Two nines (99%) equals 3.65 days offline a year, three nines (99.9%) equals 8.76 hours, four nines (99.99%) equals 52.56 minutes and five nines (99.999%) equals 5.26 minutes of downtime per calendar year.
"If you go to the mass cloud providers, the Amazons, the Googles, the Microsofts, what you tend to find is that they offer service levels that wouldn’t be acceptable for core services that are crucial to the business," said Quinn of Arkphire.
"You won’t get five nines availability, you’ll get the three nines availability and for some applications that’s okay. It’s much more cost effective than doing it yourself and the risk is acceptable. But it’s important to find the right service level agreements that are commiserate with the importance of the service being farmed out."
"For revenue generating applications, you need to be absolutely guaranteed that it’s available all the time. Any company thinking of going to the cloud needs to be confident that they can get at least as good a service level as they could if they kept it in house and managed it themselves. The service provider has to be able to provide the expertise, resilience of environment and of communications infrastructure that’s necessary and for the mission-critical and revenue generating applications, the service level necessary is much more higher than for some of the peripheral applications."
When outages happen, no amount of good intentions will help the end user left in the lurch. For this reason, only the most foolish of cloud users neglect to forensically scrutinise their SLAs, and even better, actually test out their providers.
"What exactly does your SLA say? Who is the SLA with? Have you tested it? How effective is it?" said Mark Kellet, CEO, Magnet Networks.
Kellet suggests that there is a long line of questions that any competent chief information officer needs to know the answer to. Not being able to answer these is a worrying sign of unpreparedness.
"How senior is your contact in your service provider? If the service goes down and you need to get access to someone who can fix the problem, can you actually get that person on the phone? If they’re not around, who’s your next contact in the escalation path?" he said.
"What’s the remedy if something does happen? In the case of a network problem, who’s going to actually come out to fix the copper or fibre if it’s broken? What’s your service provider’s incentive for fixing the problem? Are there real penalties for them, or is it a token discount on your bill?"
Magnet Networks provides the infrastructure for some of Ireland’s biggest IT service providers, as well as for many SMEs. The kind of outage insurance each company needs can vary, but according to Kellet, companies relying on remotely-delivered IT services should ideally have more than one data connection.
"In general, larger companies will have a fall back connection just in case, to handle things like voice services and other crucial data. But this also has to be done right. Companies in this situation should make sure they’ve seen a network map and identified where it’s weak. Make sure that your resilient back up connection isn’t arriving into your business by the same route as your main connection," he said.
"If a JCB digs up the road outside your office, is it going to take out all your connections? If you haven’t checked this, the answer is quite possibly."
Magnet recommends that companies with a high degree of dependence on cloud-based services make sure they have a back-up system that’s totally independent from their day to day system. This effectively means that they recommend that their clients also do business with their competition.
"If you’re not having that conversation with a customer then you’re not being pragmatic-these things happen and rather than living in the land of wishful thinking, it’s better to have plans in place to deal with problems. For example, a wireless connection from someone like Airspeed can effectively avoid this problem," said Kellet.
"The truth is that outages happen. People dig up cables accidentally or even steal it on purpose – in the UK, BT had 60 exchanges down last month because of copper theft. Copper is a valuable metal and there have been cases of people mocking up BT vans and uniforms, stopping traffic and putting out traffic cones in the middle of the road in order to dig up copper lines in broad daylight."
"They know that BT will come along and replace it and there have even been cases of the same thieves coming back a few weeks later to steal copper in the same way from the same place again."