Thoughts on a plane…
2 June 2017 | 0
It is being reported today that the British Airways IT outage that grounded flights and discommoded some 75,000 passengers was caused by a contractor switching off a UPS.
The Times UK reports that the UPS was inadvertently switched off by the contractor and that caused an immediate power loss to the system. Furthermore, when the unit was switched back on, it caused physical damage which meant that the IT systems did not restart as they should, compounding the problem.
“Systems should be engineered so that even sabotage, as well as sheer stupidity or Inspector Clouseau-like awkwardness, cannot easily take them out”
Though the reports are unconfirmed, they would appear to deflect blamer away from the outsourcing deal and inexperienced foreign support staff who had come under criticism in the wake of the debacle.
However, what it does highlight is the danger of the ID10T threat, or basic human error.
There was an old joke about automation that had a guard dog in the cockpit of a plane with a pilot. A journalist interviewing the pilot about the wonder of a fully automated airliner asks the pilot how it works.
The pilot says the plane is pre-programmed and flies itself with no need for human intervention. So, of course, the journalist asks why the dog is there. The pilot cheerfully responds that the dog is there to bite the pilot if he tries to touch anything. And then he asks the next inevitable question of why the pilot is there. And the pilot responds, to feed the dog.
The episode highlights a few issues in relation to BA outage.
Firstly, sometimes even highly trained staff can have a brain fart and do something stupid. Hitting the wrong button, the one on the left instead of the one on the right, elbowing a switch or tripping over a cable—we’ve all done it.
The point is that systems should be engineered so that even sabotage, as well as sheer stupidity or Inspector Clouseau-like awkwardness, cannot easily take them out. Protected routines, automated overrides, assurance delays and the like, have been around for some time. I once had an under the desk UPS that had a three-step shut down that asked for at least two confirmations before complying with a shut down. It also had bracketed power connectors to prevent inadvertent disconnection of power.
As well as that, in these days of cloud-enabled everything, one would have thought that any such critical system would be running in a mirrored or active-active configuration to allow immediate, graceful failover, even if it is only to another internal site or set of resources.
Despite seemingly getting BA off the hook for its cost cutting and IT outsourcing in recent years, this new development does not, by any means, exonerate BA’s IT.
The fact that a UPS could be accessed inadvertently and shut down by someone—contractor or no—is still highly questionable.
Added to all of this is that when the UPS was brought back online, and we have no information about when this was done, apart from the Times article that says “in an unplanned and uncontrolled fashion,” the fact that it caused damage, again shows a lack of planning and forethought.
Usually, a UPS sits between a back-up generator and a power distribution system. It can act as a final source of power in the event of a total power loss, allowing systems to shutdown gracefully, or failover safely. In this case, it seems that the loss of the UPS, without either a redundant system to take the load, or another connection to take over, undid the situation—the classic single point of failure.
One would image that this has led to quite a bit of soul-searching BA and the wider IAG group. With Aer Lingus having escaped the ravages of the WannaCry outbreak, one would imagine it is keen not to have a similar bib-dirtying event to its group partner.