Windows Azure outage due to performance update
A service interruption on Azure storage services late Tuesday was caused by Microsoft’s attempts to roll out a performance update that had been earlier tested for several weeks.
An issue was discovered as part of the update to Azure Storage that resulted in reduced capacity across services using Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services, wrote Jason Zander, corporate vice president of the Microsoft Azure team in a blog post Wednesday.
The interruption impacted storage services across the US, Europe and parts of Asia. A limited subset of customers were still affected by intermittent issues on Wednesday, according to the company. On the Azure status page, Microsoft was still reporting problems in the West Europe region on Wednesday evening.
The company had tested the performance update in a subset of its customer-facing storage service for Azure Tables, a process known as ‘flighting,’ when the company identifies issues before broad deployment of an update.
“During the roll-out we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting,” wrote Zander. The front ends were as a result unable to handle further traffic, which affected other services built on top, he added.
The change was rolled back once the issue was discovered but a restart of the storage front ends was required in order to fully undo the update, according to Zander.
The blog will be updated with a root cause analysis “to ensure customers understand how we have addressed the issue and the improvements we will make going forward,” Zander said. Customers are currently being assisted by both engineering and support teams.
Interruptions like the one starting Tuesday are likely to affect Microsoft’s pitch that it is easier to rent computer capacity in the cloud rather than set up systems in-house. A number of websites of businesses and Microsoft’s own Xbox Live were hit in the interruption this week, according to reports.
Users complained that the rollout of the update should have been done in stages rather than across all regions at one go. The Microsoft Azure team admitted that it had not followed the standard protocol of applying production changes in incremental batches. One person commented on the Azure blog that as a result of the outage, her company had decided to move all its sites to rival Amazon Web Services.
John Ribeiro, IDG News Service