Trouble with troubleshooting: network-management tools are letting IT pros down
Troubleshooting is perhaps the most vital responsibility of a network operations team. When IT services are interrupted or degraded, engineers and admins race to diagnose and remediate the problem. Every minute counts, because transactions, employee productivity, and customer satisfaction all suffer while the network team is doing this work.
Given the stakes, network management tools must have well-defined workflows and technical functionality to support the troubleshooting process. Unfortunately, many tools are letting network managers down.
Root-cause analysis (RCA) is the critical aspect of network troubleshooting. Network engineers must form a theory of the problem and test that theory. Only after they have confirmed their theory of the problem can they move forward confidently with a solution.
Over the years, network managers have told Enterprise Management Associates (EMA) that RCA is one of the most time-consuming aspects of their job. Given that network-management tools are clearly failing to support this task, engineers and admins must perform complex calculations themselves. The tools often present dashboards with vast arrays of alerts and time-series graphs that show patterns and indicators of a possible problem, but no clear definition of the nature of the problem. As a result, IT pros have to infer the root cause by looking for patterns of cause and effect. This is no easy tasks, especially given that network managers said that 42.7% of the alerts produced by their tools are false alarms, not indicative of an actionable problem.
Problem isolation and identification is the other least supported troubleshooting task. Before network managers can theorise a root cause, they need to find the problem, so they spend their days looking at their tools, which display red and yellow alerts and charts that reveal mysterious spikes and dips and traffic and device metrics. Engineers have to sift through this information and figure out which data are tied to an actual problem. Trouble tickets may offer clues, but isolating the source of a problem is not easy.
A better toolset
In conversations with network-management vendors, I see signs that help is on the way. Tool developers are working to define better workflows for troubleshooting. This has not always been the case, as an engineer who procures and implements network management tools for a large North American government agency once told me: “I see vendors who do not quite understand or research how the product is going to be used, whether it will be an engineering tool or an operations tool. [They fail to ask] how do I make this product fit into a workflow?”
Some of my confidential conversations with vendors have shifted toward a focus on problem isolation and RCA. These conversations are about specific workflows to support the process, but also about presenting data in a new way that makes it easier to find and work with essential information. Ease of use is a term they often use. When vendors talk about it, they are usually referring to making their tools useful to tier-1 admins, not just the elite engineers who are the last line of defence during a fire drill.
AIOps is also starting to deliver results. Many commercial network-management vendors are adding new features that use machine learning and big-data technology to make their tools smarter. Rather than present data, they offer natural language explanations of a problem, possible root causes and recommended fixes.
These AIOps features are still maturing, and not every vendor offers them, but there is progress. The question is whether IT organisations will be willing to pay for them. Vendors are investing a lot of resources into this technology, and many of them are asking themselves, “Is this going to be worth it?”
Some vendor might see AIOps as a competitive differentiator that can help them earn new business so they do not charge extra for it. Others charge a premium for AIOps products and services that enrich the core tool. Network managers need to ask themselves whether a tool that is better at solving complex problems is worth paying for. Afterall, these tools will deliver tremendous business benefits, from better end-user experience to improved IT productivity.
Shamus McGillicuddy is vice president of research networking at Enterprise Management Associates