Confounding with simplicity

Blogs

(Image: Mediateam)

10 July 2018

The cloud: elastic, adaptive, secure, resilient, reliable.

That has been the refrain we have constantly heard in recent times, admonishing us to give up our old infrastructure in favour of the cloud, where domain experts and specialists, from the boutique to the exascale giants, can provide greater management, protection and security than we could ever hope to. Except, we still hear about outages in cloud services, with alarming regularity.

Of course, any such outage must be taken with a grain of salt, and in context. For example, if you look at it another way, were we to report that the Joe Bloggs Online Novelties company had experience a service outage that affected its ability to take orders from nine global regions for almost seven hours, we would hardly bat an eye lid, in fact, more of note would be how we might have heard of it at all, as it would simply not have been newsworthy.

“While the tendency to overstate is just as apparent, as in no one was ever hit with a plain old malicious attack, it was always a sophisticated, advanced and persistent, the same applies to the overly simple”

But if the Joe Bloggs Online Novelties company had experienced a service outage while being hosted on a regional node of an exascale cloud services provider, also affecting a hundred others, well, that might be, or in fact usually, is news.

I must admit to still being fascinated by such things, as there always seems to be an interesting explanation behind them. Gone are days of just a server/storage unit/switch fell over, as active monitoring, cross connections, load balancing and failsafe designs tend to mean that a failure rarely gets to be an outage. Therefore, when an outage does occur, in the usual manner of an unforeseen situation, it tends to be a conjunction of circumstances, any one of which, may have been just a plain old PIA, as opposed outright disaster. Disaster say, as experienced by Visa payments in June, when they estimate 10% of some 51 million transactions were refused due to a data centre glitch.

The payments industry is heavily regulated and the requirement to maintain secure systems means that resilience is a key concern, not to mention the requirement to be able to facilitate a public service. So when an outage caused issues for some 10 hours on Friday 1 June, many were no doubt intrigued.

But the explanation was somewhat prosaic — switch failure where a back-up failed to kick in.

Backlog
European boss Charlotte Hogg said that while the company’s two data centres, both of which are UK based, constantly process transactions, and, as one would expect, each being able to handle all of Visa’s European transactions alone, leaving ample failover capability in the event of a TUS*, to do so requires monitoring and message coordination between the two, with a synchronised exchange.

Hogg said, in this instance, the switch being used in the primary data centre experienced a very rare, partial failure, which impacted the secondary site and prevented it from automatically processing all transactions, as it should have.

What?

“This created a backlog of messages at the secondary data centre, which, in turn, slowed down that site’s ability to process incoming transactions,” said Hogg.

It would appear from this that there was no back-up for the back-up, as the system meant to coordinate resources between the two systems failed. Now in this business continuity and disaster recovery related issue, it is worth asking the question if the whole truth of the story is really coming out here?

While the details are starkly prosaic, and sound plausible, could a company as security and availability conscious as this really have committed such a terrible oversight?

Non-intel
Another recent story which piqued one’s interest in similar manner is the recent, somewhat unexpected, departure of Intel CEO Brian Krzanich. Krzanich has departed the chip giant for what can only be described as an historical misdemeanour, but one for which there is a reported patchy level of enforcement.

There has been a longstanding ordinance within the company that one should not have a romantic engagement with anyone who reports directly, or indirectly, to you. That sounds fair enough, but people will, after all, be people.

It has been reported in various august organs that when such instances arise, one simply goes to HR, declares one’s love, and the object of it (though not necessarily from the rooftops), and stuff gets sorted. Apparently issues only arise where there is some level of deceit or concealment.

Observers close to the issue, and the organisation, have noted that even in the past when instances were discovered that had not been actively concealed, minor reprimands were made, and reporting lines clarified to avoid any conflict of interest or responsibility.

Therefore, when an investigation that has been cited as running for quite some time suddenly results in the removal of a CEO, one wonders if there isn’t something more at work.

When the explanation for an unexpected and shocking outcome is starkly simple, then it begs speculation.

While the tendency to overstate is just as apparent, as in no one was ever hit with a plain old malicious attack, it was always a sophisticated, advanced and persistent, the same applies to the overly simple.

Wherefores
In the case of Visa, why would a back-up system not have better resilience and greater redundancy built into it? In the case of Intel, if the rules for which Krzanich fell on his sword are not often enforced so readily, why do it now with the top banana?

Surely any media advisor in such an instance would warn of the fact that these reactions would be natural and predictable and should be handled in case speculation gets out of hand, causing even more reputational damage than the precipitating incident?

One would think so, anyway, but apparently not.

So on the one hand, it leaves a major payments company looking like they left their BC and DR planning to a second year computer student and a major global technology vendor looking like it needed an excuse to clear house.

Either way, neither situation reflects well on the respective sufferers and no doubt they will be feeling the effects for some time. In today’s world of unsubtle communication, it may well be that both were advised to stick to the facts and the facts only, but we all know between two facts can be a gap big enough to hide a lie (by omission) capable of bringing the whole house down.

I’m not for a moment intimating that this is the case, in either case, I’m merely saying that the handling of the respective stories leaves too much room for interpretation, which leads to speculation and that can lead anywhere but where you want it to go.

*TUS: a widely used description of a complete technical outage

Confounding with simplicity

Sign up for the Technology Minute

Support our advertisers

Listen to Tech Radio

Most Popular