Everything does not have to be resilient and highly available
TechFire hears risk-based approach and pragmatic application is best for resilience and continuity
8 October 2019 | 0
Everything in the organisation does not have to be highly available and highly resilient. That was the combined advice of the panel at the recent TechFire, in the context of questions from both public sector and non-profits about how to allocate resources, build in margins for contingency and lean processes.
Dr Sandra Bell, head of Resilience Consulting (Europe), Sungard AS, had made the point that lean processes have little leeway for reliability or resilience.
She elaborated when asked whether there was an inverse relationship between lean and resilient.
“There is an inverse relationship between lean and reliable, because if you cut back and cut back, you will introduce single points of failure and lose the ability to duplicate elsewhere,” Dr Bell.
“Though it does not necessarily mean that can be less resilient,” she said.
Dr Bell cited the example of Jaguar Land Rover (JLR), and its just-in-time production process. She said there was only 8 minutes slack in the production schedule. But from an efficiency perspective, this suited JLR, despite occasional costly actions to ensure minimal downtime.
The customer interviewee, Frank Moran, IT service continuity manager, Bank of Ireland, added that JLR had announced it would suspend production for one week after the Brexit deadline to relieve pressure on suppliers and facilities.
“It goes back to the minimum business continuity objective. Not everything has to be highly resilient and highly available.”
Both Dr Bell and Moran said that a risk-based approach, with an holistic understanding of what was critical and what wasn’t, was key to determining the resource level necessary to ensure resilience and continuity.
With regard to risk, Dr Bell cited the annual Allianz Risk Barometer which has found that cyber incidents joins business interruption as a leading global risk for companies for first time. In the eighth annual survey on top business risks from 86 countries, business interruption scenarios are becoming more diverse and complex with costs rising, and cyber incidents now outpaces fire and natural catastrophes.
As Dr Bell warned about planning and risk, she also said that incidence did not necessarily reflect impact. Of the most likely incidents in the risk survey, the order was (descending):
- Cyber incident
- Fire, explosion
- Natural catastrophes
- Lean processes – leaner, can often mean less resilient
- Plant/machine breakdown
But fire and explosion were far more likely to cause damage and disruption than a cyber incident.
For the first time at a TechFire, an interactive scenario was presented, with a fictional supermarket group experiencing a weather event and an IT systems failure. This helped support the point that human behaviour is a critical factor in incident response, as attendees responded in real time to the unfolding scenario.
A show of hands from the audience indicated that around 15% of attendees had previously participated in a gamed scenario as part of either training or testing.
Games and reality
When asked about performance in training and gamed scenarios against real life, Dr Bell said there are different levels involved.
“You do exercises for different reasons; for educational purposes, so that people are aware of what the processes are, and how to manage them. But when you get into the crisis situation, it is less about the process and more about the cadence, and that is where it gets really interesting.”
What we find are two things, Dr Bell explained.
“Number one is that there has to be a realisation that there is no book for this – it is all about behaviours. We find this is almost easy to deal with once you give people a framework for dealing with information under stress, how to cope with it etc, but what we are now moving into is putting people realistically under stress.”
When under stress conditions, people can often act differently to how they normally would, and to how they have been trained, said Dr Bell.
“One of the biggest things they will have to do is crisis media events. Lots of people do media training where they are in front of people, but very few people do that in a hostile environment, and it is where people often fall down,” she said.
A further show of hands indicated that only around 12% of attendees had actually invoked an aspect of a continuity or incident response plan.
A question from the audience asked about the nature, extent and frequency of resilience and continuity plans.
While there was no one size fits all answer, there were general guidelines by which organisations might answer the question for themselves.
“What I tend to do,” said Dr Bell, “coming from a utilities background, are operations, exercises, testing and awareness programmes that focus on little and often.”
Different aspects should be tested regularly, with perhaps one big exercise once a year, she said.
However, she also cited a disturbing statistic in that the average success rate for recovery tests in IT is 32% – “this is a hard message to push out to the CEO when it fails,” she said.
“If you want it to work, you really need to be practising it,” argued Dr Bell.
Moran added that if your testing is 100% successful, you may need to re-evaluate your tests.
Plan B, and C
Returning to the theme of people and their ability to perform in a crisis, Dr Bell said that while it is vital to have a contingency plan, there were other needs too. First of all, she emphasised that if dealing with a cyber incident where there may have been network intrusion, the likelihood is that the attackers will have taken a copy of your contingency plan and know your every planned move.
However, more importantly, she said it is vital to also have a plan B, C and D to cope with when plan A, or any subsequent plan fails.
She highlighted the example of the London Bridge attacks, where on organisation found that not only was one of its workplaces within what was designated a crime scene, but that the extent of the designation also covered one of their fall back sites. This seriously impacted their ability to enact their continuity plan.
Furthermore, Dr Bell warned of the potential psychological impact for people in being able to adequately deal with the failure of an incident response plan. People need to be trained and prepared for those eventualities when a primary response fails.
She went on to give some general advice with regard to the value of cross training.
“The very best thing you can do to ensure resilience is to invest in cross skills training, and everyone understanding how the whole system works. So that if the system fails, you’ve got all of that brain power to innovate and work slightly differently,” said Dr Bell.
“If you rely only on having tolerances in there, it will probably fail and the thing that is going to make your organisation work is not the machines, it is the people, and their knowledge.”