AI and self-driving data centres
Early adopters are driving the use of AI to optimise power and cooling systems, automate predictive maintenance, and improve workload distribution in enterprise data centres
11 August 2020 | 0
Most of the buzz around artificial intelligence (AI) centres on autonomous vehicles, chatbots, digital-twin technology, robotics, and the use of AI-based ‘smart’ systems to extract business insight out of large data sets. But AI and machine learning (ML) will one day play an important role down among the server racks in the guts of the enterprise data centre.
AI’s potential to boost data-centre efficiency – and by extension improve the business – falls into four main categories:
- Power management: AI-based power management can help optimise heating and cooling systems, which can cut electricity costs, reduce headcount, and improve efficiency. Representative vendors in this area include Schneider Electric, Siemens, Vertiv and Eaton Corp.
- Equipment management: AI systems can monitor the health of servers, storage, and networking gear, check to see that systems remain properly configured, and predict when equipment is about to fail. According to Gartner, vendors in the AIOps IT infrastructure management (ITIM) category include OpsRamp, Datadog, Virtana, ScienceLogic and Zenoss.
- Workload management: AI systems can automate the movement of workloads to the most efficient infrastructure in real time, both inside the data centre and, in a hybrid-cloud environment, between on-prem, cloud and edge environments. There are a growing number of smaller players offering AI-based workload optimisation, including Redwood, Tidal Automation and Ignio. Heavyweights like Cisco, IBM and VMware also have offerings.
- Security: AI tools can ‘learn’ what normal network traffic looks like, spot anomalies, prioritise which alerts require the attention of security practitioners, help with post-incident analysis of what went wrong, and provide recommendations for plugging holes in enterprise security defences. Vendors offering this capability include VectraAI, Darktrace, ExtraHop and Cisco.
Put it all together and the vision is that AI can help enterprises create highly automated, secure, self-healing data centres that require little human intervention and run at high levels of efficiency and resiliency.
“AI automation can scale to interpret data at levels beyond human capacity, gleaning imperative insights needed for optimising energy use, distributing workloads and maximising efficiency to achieve higher data-centre asset utilisation,” explains Said Tabet, distinguished engineer in the global CTO office at Dell Technologies.
Of course, much like the promise of self-driving cars, the self-driving data centre isn’t here yet. There are significant technical, operational, and staffing barriers that stand in the way of AI breakthroughs in the data centre. Adoption is nascent today, but the potential benefits will keep enterprises looking for opportunities to move the needle.
Power management taps into server workload management
Data centres are estimated to consume 3% of the global electric supply and cause about 2% of greenhouse gas emissions, so it’s no surprise that so many enterprises are taking a hard look at data-centre power management, both to save money and to be environmentally responsible.
Daniel Bizo, senior analyst at 451 Research, says AI-based systems can help data-centre operators understand current or potential cooling issues, such as insufficient cold air delivery due to, for example, a high-density cabinet that’s blocking the air flow, an underperforming HVAC unit, or inadequate air containment between hot and cold aisles.
AI promises to deliver benefits “beyond what’s possible with simply good facilities design,” Bizo says. AI systems “can learn a facility by correlating HVAC systems data and environmental sensory readings” on the data-centre floor.
Power management is the low-hanging fruit, adds Greg Schulz, founder of IT advisory and consultancy firm StorageIO. “Today, it’s about productivity, about getting more work done per BTU, more work done per watt of energy, which means working smarter and getting the equipment to work smarter.”
There’s also a capacity planning angle. In additional to looking for hot spots and cool spots, AI systems can make sure data centres are powering the right number of physical servers and also have the available capacity to spin up (and spin down) new physical servers if there’s a temporary burst in demand.
Schulz adds that power management tools are developing hooks up into the systems that manage equipment and workloads. If sensors detect that a server is running too hot, for example, the system might quickly and automatically move workloads to an underutilised server in order to avoid a potential outage that might impact mission critical applications. The system could then investigate the cause of the server overheating – it might be a fan that failed (an HVAC issue), a physical component that is about to crash (an equipment issue), or maybe the server has just been overloaded (a workload issue).
AI-driven health monitoring, configuration management oversight
Data centres are full of physical equipment that needs regular maintenance. AI systems can go beyond scheduled maintenance and help with the collection and analysis of telemetry data that can pinpoint specific areas that require immediate attention. “AI tools can sniff through all of that data and spot patterns, spot anomalies,” Schulz says.
“Health monitoring starts with checking if equipment is configured correctly and performing to expectations,” Bizo adds. With hundreds or even thousands of IT cabinets with tens of thousands of components, such mundane tasks can be labour intensive, and thus not always performed in a timely and thorough fashion.”
He points out that predictive equipment- failure modelling based on vast amounts of sensory data logs can “spot a looming component or equipment failure and assess whether it needs immediate maintenance to avoid any loss of capacity that might cause a service outage.”
Michael Bushong, vice president of enterprise and cloud marketing at Juniper Networks, argues that enterprise data-centre operators should ignore some of the overpromises and hype associated with AI, and focus on what he calls “boring innovations.”
Yes, AI systems may one day “tell me what’s wrong and fix it,” but at this point, many data-centre operators would settle for “if something goes wrong, tell me where to look,” Bushong says.
Dependency mapping is also an important, but not especially exciting area where AI can be useful. If data-centre managers are making policy changes to firewalls or other devices, what might the unintended consequences be? “If I propose a change, it’s useful to know what might be inside the blast radius,” Bushong says.
Another important aspect of keeping equipment running smoothly and safely is controlling something called configuration drift, a data-centre term for the way that ad hoc configuration changes over time can add up to create problems. AI can be used as “an additional safety check” to identify impending configuration-based data-centre issues, Bushong says.
AI and security
According to Bizo, AI and machine learning “can simplify event handling (incident response) by performing rapid classification and clustering of events to identify important ones and separate them from the noise. Quicker root-cause analysis helps human operators make informed decisions and act on them.”
AI can be particularly useful in real-time intrusion detection, adds Schulz. AI-based systems can detect, block and isolate threats and can then go back and conduct a forensic investigation to determine exactly what happened and what vulnerabilities the hacker was able to exploit.
Security professionals working in a security operations centre (SOC) are oftentimes overloaded with alerts, but AI-based systems can scan through vast amounts of telemetry data and log information, clearing mundane tasks off the deck, so that security pros are freed up to handle deeper types of investigations.
AI-based workload optimisation
At the application layer, AI has the potential to automate the movement of workloads to the appropriate landing spot, whether that’s on-premises or in the cloud. “AI/ML should in the future make real-time decisions on where to place workloads against the multitude of specifications for performance, cost, governance, security, risk and sustainability,” Bizo says.
For example, workloads could be automatically moved to the most power-efficient servers, while making sure that the servers operate at peak efficiency, which would be 70-80% utilisation. AI systems could integrate performance data into the equation, so time-sensitive apps are running on the high-efficiency servers, while at the same time making sure that excess energy is not being burned on applications that don’t require fast execution, Bizo says.
AI-based workload optimisation has caught the eye of MIT researchers, who announced last year that they had developed an AI system that automatically learns how to schedule data-processing operations across thousands of servers.
But, as Bushong points out, the reality is that workload optimisation today is the province of the hyperscalers like Amazon, Google and Azure, not the average enterprise data centre. And there are a number of reasons for that.
The challenges of implementing AI
Optimising and automating the data centre is an integral part of ongoing digital transformation initiatives. Dell’s Tabet adds that “with COVID-19, many companies are now looking at further automation, pushing the ideas of ‘digital data centres’ that are AI-driven and capable of self-healing.”
Google announced in 2018 that it had turned control of its cooling systems in several of its hyperscale data centres to an AI program, and the company reported that the recommendations provided by the AI algorithm delivered a 40% reduction in energy usage.
But, for companies not named Google, AI in the data centre is “largely aspirational,” Bizo says. “Some AI/ML features are available in event handling, infrastructure health and cooling optimisation. But it will take more years before AI/ML models achieve more visible breakthroughs beyond what’s possible with standard Data Centre Infrastructure Management (DCIM) today. Much like with autonomous vehicle development, early stages may be interesting, yet far from the breakthrough economics/business case it ultimately promises.”
Some of the barriers, according to Tabet, are that “the right people need to either be hired or trained to manage the system. Another issue to be aware of is the need for data standards and relevant architectures.”
Gartner puts it this way: “AIOps platform maturity, IT skills and operations maturity are the chief inhibitors. Other emerging challenges for advanced deployments include data quality, and lack of data science skills” within IT infrastructure and operations teams.
Bushong adds that the biggest barrier is always the people. He points out that going out and hiring data scientists is a challenge for many enterprises, and training existing employees is also a hurdle.
Plus, there’s a long history of employees resisting technologies that take control out of their hands, Bushong says. He notes that software-defined networking (SDN) has been around for a decade, yet more than three-fourths of IT operations are still CLI-driven.
“We have to believe that operators across all manner of infrastructure are prepared to give up control to AI,” Bushong says. “If a group of people don’t yet trust controllers to make decisions, how do you train, educate, and comfort a group of people to make a transition of this magnitude when the prevailing attitude in the industry is that, if I do this, I will lose my job?'”
That is why Bushong suggests that enterprises take those small, boring steps toward AI and not get caught up in the hype that so often surrounds a new technology.
IDG News Service