Shutdown or meltdown: DC dilemma

Blogs

Image: Stockfresh

7 September 2018

I’m always fascinated by cloud outages.

With all the arguments for resilience and uptime, redundancy and adaptability, there are still times when the whole thing takes a dive.

Just such an incident occurred for Microsoft in Texas in recent weeks. As reported by The Register under the delicious headline “Thunderstruck”, lightning at a DC facility caused voltage surges that took out cooling equipment.

“This resulted in a power voltage increase that impacted cooling systems,” said Microsoft. “Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process.”

It was deemed preferable to have a controlled shutdown rather than an uncontrolled meltdown.

As a result, some Azure and Visual Studio Teams services were impacted.

But, even in Texas, how likely would that have been?

“Intel ran experiments in the New Mexico desert that saw essentially free cooling of servers, with little filtering either, and variable humidity, increase the equipment failure rate by less than 1% — going from 3.83% to 4.64%”

What if the facility had been able to just use outside air to cool the data centre, would they have faced the same set of choices?

Texas weather for the last week or so, has been in the range of 27 to 33c. Not scorching, and not even hot by Texas standards.

Plus, by the time air is drawn in and circulated, the motion and manipulation usually results in a bit of a drop, so even at the upper range, there still should have been a cooling effect.

Current thinking
Current American Society of Heating, Refrigerating and Air Conditioning Engineers (ASHRAE) standards say 18 – 27c, with individual vendors varying from Oracle on the low side (21-23c) to HPE, IBM, Cisco and Dell on the high end at 27c too.

However, since 2008, experiments have shown that these are very conservative guidelines when it comes to not even survivability, but reliability.

Intel ran experiments in the New Mexico desert that saw essentially free cooling of servers, with little filtering either, and variable humidity, increase the equipment failure rate by less than 1% — going from 3.83% to 4.64%. The temperatures there climbed to 33c.

Microsoft too, not long after, ran a server farm under canvas, through the summer and into autumn, in Redmond, and again, variable humidity and ambient temperatures did little to affect either reliability or availability, with zero failures and 100% uptime.

Since then, more scientific experiments have been carried out, and all show that inlet temperatures can be safely raise, with inherent savings.

Reliability
In 2012, the University of Toronto looked specifically at component reliability in the data centre under raised temperature conditions. The conclusion was that “the effect of temperature on hardware reliability is weaker than commonly thought.”

These kinds of studies do not go unnoticed and the exascale providers reacted. Facebook in 2011 acknowledged that it was exceeding ASHRAE’s recommendations with inlet temperatures of 26 – 29c.

Google acknowledged something similar with ranges around 26.6 (80F) as it “helps with efficiency”.

Schneider Electric too has conducted studies to see what effect raising temperatures and using essentially unfiltered, unmodified air to cool data centres. In white paper #221, it concluded that there were efficiency benefits to be had without raising failures rates, but that the benefits are not linear.

The paper concludes that the cooling architecture and geographic location (specifically the temperature profile of the climate) has a significant impact on the optimal IT temperature set point, and the shape of the server fan and CFM curve are key drivers.

It says that while raising temperatures improves the chiller efficiency (by increasing economiser hours), savings can be offset by an increase in IT energy consumption as well as the air handlers.

Operating conditions such as percent load and computer room air handler (CRAH) oversizing/redundancy influence whether savings or a cost penalty are seen.

It should not be assumed that that raising the temperature is always a good thing, the paper concludes.

However, failures rates were almost entirely unaffected.

Conservatism
There is little doubt that conservatism rules when it comes it determining data centre operating temperatures, but repeated experimentation and study has shown that IT equipment is far more resilient and resistant to failure than has generally been the thought.

Temperature, humidity and even dust levels far higher than would be recommended can be withstood by IT equipment with little or no impact on reliability or survivability.

There is a broad window, confirmed by various vendors and researchers, that shows that savings can be made with higher temperatures, and less manipulated air.

With respect to the Microsoft Texas incident, the damage may have been such that the fans simply would not turn. However, it would appear that pure free cooling should be a back up option when chiller units fail, as the tolerance for such practices has been proven.

However, it also suggests that whoever takes the plunge to up temperatures and reduce conditioning will benefit, but what to say if a failure occurs?

How sympathetic would a CEO be if such a DC regime experienced an incident?

The jury is still out.