Stressed businessman

It’s only a disaster if you don’t recover

Pro
(Image: Stockfresh)

17 November 2014

In broad and admittedly crude terms, there are only two types of ICT disaster — the systems kind and the physical. And to be even more superficial for a moment, recovery from a physical event like fire or explosion or flood or typhoon is likely to be easier on all fronts. Why so? Alas, partly technical and largely human nature. When a building, HQ or factory or store, is knocked out completely by something it is clear and visible and often well publicised — or can be. So the organisation has the immediate sympathy of its public, paying or otherwise, and invariably the willing help of its partners. On the ICT side, this is what BC&DR plans are for and particularly good at.

There really should be no problem handling the PR after a ‘natural’ disaster. For a start, everyone understands what happened and to some degree what coping with the consequences must be like. There will always be something of a surge of goodwill, however temporary it may prove, that will help with all constituencies. Customers and clients will be more likely to forgive delays, although they are also quite likely to go elsewhere depending on the product or service and the degree of rapid fulfilment required. Some of that churn will be because of the perceived or anticipated delay rather than the actual, which certainly means that constant information flow and updating is a key part of the response. But all of that communication stuff is basic to any business continuity plan — or should be.

Because BC&DR is by no means entirely an ICT domain. In fact much of it in practical and effective terms is very people centred — where and how do staff resume work, what will happen for customers, what should suppliers do? What will you tell your customers and trading partners? How? All of those are policy, planning and communications issues although of course the solutions will involve ICT.

There is no sympathy and no reservoir of good will for systems failures. The position is compounded by two elements that almost inevitably accompany them: the actual cause is probably not identifiable straight off and the time to full restoration is almost impossible to forecast

But from the point of view of the CIO and IT department, and in the sympathetic environment of TechPro, a systems disaster is a real catastrophe. For a start, it is invisible. OK, so it was literally sparked by a failed component. Good luck with pinning the blame on that. We know that systems failures are human failures at least as often as software or hardware glitches, but that does not help very much either. There are no upsides to a systems crash that brings down business operations in whole or in part. The frantic 24-hour shifts by caffeine-fuelled systems people do not begin to compare with the visible heroics of firefighters or floodlit cranes and trucks and pumps or shirt-sleeved senior management embracing photo opportunities on the customer-facing front lines.

‘It should not have happened’ will be the popular cry — and the media theme if you are big enough — and there is just no answer to that. Of course it shouldn’t but neither should car or space vehicle crashes, train signalling failures or mains water leaks. There is no sympathy and no reservoir of good will for systems failures. The position is compounded by two elements that almost inevitably accompany them: the actual cause is probably not identifiable straight off and the time to full restoration is almost impossible to forecast. From the points of view of public, partners and customers that is not good enough. Human nature and tabloid hacks demand simple answers to complex questions. Whether they are understood or not is going to be moot anyway. As for ‘When will it be fixed?’ an answer is required now to decide what I will do if I am in any way affected — or to furnish ammunition for sniping.

There are very few simple prescriptions for what to do after a major systems breakdown in regard to communications and continuous contingency planning and re-planning. BC&DR manuals and best practice concentrate on the technical and procedural aspects. Understandably, because these are the things that can be planned for. There is a famous military saying, taught in all the academies, to the effect that no battle plan survives the first encounter with the enemy. Any disaster is going to be like that. The ABC procedural aspects of the recovery plan will probably apply OK, at least broadly, but the encounter with gritty, messy and unpredictable realities is likely to challenge the plan and require flexible leadership. Having drilled consistency and the unanimity of teamwork into its soldiers, military training genuinely promotes and cultivates initiative on top of that disciplined foundation. It is an approach that should also serve disaster recovery very well.

Which is all very well in theory. One key essential of any emergency plan is that it has to be tested to try to see if it will survive that encounter with reality and learn from the exercise. We have fire drills, total evacuation trials, major incident plans and security breach exercises. Sometime in IT we may even try restoring from back-up media. In fairness, we should not be sarcastic because that is a cumbersome exercise since traditionally it would have been restoration from tape and disk and separated as well as possible from the real production systems. Where there is a smart failover solution — on-premise, data centre or cloud — testing becomes much easier since you can promptly fail back if something goes wrong. It may be a tad on the expensive side, but that has always been the price of systems redundancy.

But what is very seldom tested or rehearsed, to the best of this columnist’s knowledge, is the range of non-IT actions in the BC&DR planning. Do key staff ever actually move to the alternative base of operations, for example, and run things from there for perhaps a working day? A quite common element of a recovery plan in today’s connected world (well, in most parts of Ireland) is for staff to be officially asked to work from home and on mobile devices for the duration. Everyone does that individually from time to time: but has a total abandoning of the premises for the day ever been tried? Other types of emergency planning involve full testing of likely scenarios and invariably discover unforeseen gaps. A huge civic exercise in a British city some years ago discovered the absence of any provision for feeding hundreds of emergency workers when they besieged the small catering van for the film crew covering the exercise.

Your systems are down and have brought most of the business to a standstill: what is the communications strategy? How soon do you concede to customers and the public that business is suspended? What do you ask them to do?  Do you even acknowledge that the restoration point is uncertain? What are staff briefed to say to enquiries — from customers, partners, service providers and suppliers, whoever they deal with regularly? There are different audiences and constituencies, with different concerns and nuances. That does not imply a set of spins. But your tech-savvy supplier which is already linked to you in an electronic supply chain needs a different answer from a customer, for example. A consumer business or a public utility will need a different strategy from a company in a B2B market.

The salient point for this column is that the CIO has the primary role whatever the nature of the disaster. If it is natural and physical, fire, flood or Ebola lockdown, the contingency planning and restoration of operations will be largely or even entirely in his or her remit. If it is a systems crash, the problem is entirely in CIO territory and carries the additional burden of blame.

Two approaches suggest themselves. First off and by no means novel, the relationship of the CIO and IT with the business leaders is crucial. Business recovery is above all else a team job. In real life it has been demonstrated often that sometimes it takes an emergency or even a disaster to pull an organisation together. That unfortunate distinction between a real and ‘natural’ event and an invisible systems breakdown still obtains but pulling together in time of crisis demands leadership.

Another approach that may be worth trying is related to the testing of BC&DR plans. Normally, any such test is pure cost and unpopular with all senior management for precisely that reason. So maybe this is initiative and lateral thinking time. Much of our ICT marketing pitches in recent times has been about ‘drop in solutions’ while in real business it is now very common for a new branch or operation to begin with a standard model of systems and applications, right down to user permissions, dropped in or virtually spun up ready to use.

Is there any reason in principle why the essential testing process for BC&DR might not be combined with the testing of such model or reference systems, or the proposed next iterations of any technology in the organisation? This does not mean prototyping or experiment, just testing of solutions in an environment uniquely close to real life practice. Yes, it would extend the planning burden and add some costs. But it would also add in a different and positive return on the overall costs of BC&DR testing, otherwise an investment and activity that is really dead money.

 

Read More:


Back to Top ↑

TechCentral.ie