Tech outages in 2016 and prevention for 2017
3 January 2017 | 0
2016 saw major downtime events lead to lost revenue for a number of highly-recognisable brands causing severe knocks to reputations and consumer confidence. One of the most common causes of outages is unplanned configuration changes to a system, often when an immediate fix for a bug or potential system vulnerability unintentionally creates a much larger problem.
Before examining the recommended steps necessary to avoid unexpected downtime let us review some of the computer- and server-related outages of the past 12 months.
About 836 Southwest flights were delayed in October in what was also described as a problem related to the airline’s technology systems. Employees had to work around issues with primary systems and used back-up procedures to get customers and their checked luggage to their destinations, the airline said.
The airline confirmed in an update that a power outage in Atlanta that started at 02:30 Eastern US Time had affected its computer systems and operations worldwide, leading to the flight delays. It warned of large-scale flight cancellations and said that airport screens and other flight status systems were incorrectly showing flights as being on time. It is estimated the 5-hour outage led to 2,000 flights cancelled with an estimate of $150 million (€143.3 million) lost.
The cloud applications company said on its website that the over 12 hours disruption was the result of a database failure on the NA14 instance, which introduced a file integrity issue in the NA14 database.
Revenue impact was estimated at approximately $20 million (€19.1 million).
In June, internet services such as iCloud, App Store, iTunes and Apple TV were down for 9 hours. Also in early December users could not access their iCloud accounts.
In June, 3 million users lost Slack for 2 hours due to web servers being overwhelmed.
Identify what is mission critical: To avoid unexpected downtime, IT management specialist BigPanda recommends that IT Ops teams tier their services and identify the systems that are mission critical to the business. Top-tier applications should include those that are directly linked to the success or failure of the business, such as point-of-sale, ticketing, or billing.
Develop an ironclad failover plan for top-tier systems: Offering a high level of availability is not something that happens by chance. It must be carefully planned for in every aspect of the system architecture. Top-tier systems should be bolstered by an ironclad failover plan — one that carefully plans for load capacity to handle unexpected spikes.
Invest in a best-of-breed monitoring stack: You cannot protect against what you cannot be seen coming. In the age of continuous integration and continuous delivery, the only way to ensure that you have an accurate pulse on the health of your IT systems is to implement the best monitoring tool for each layer of your stack (e.g. systems monitoring, application monitoring, web and user monitoring, logging, error tracking, etc.) The industry is rapidly replacing monolithic monitoring architectures with this “best-of-breed” approach to better service increasingly complex and dynamic IT systems.
Implement alert correlation to distinguish signal from noise: More tools — monitoring more moving parts — leads to more noise. It’s a simple fact. In order to efficiently identify, triage, and remedy potential issues before they have the chance to do real damage, IT teams require a way to properly separate the signal (e.g. “the real problem”) from the many sources of noise. By implementing an alert correlation solution, IT teams can see how alerts from their various monitoring tools are related, allowing them to quickly filter non-critical issues and focus on what matters most.
IDG News Service