Amazon's biggest problem is digital

So now you know what it takes to bring down loads of websites in one fell swoop. The answer is not a lot.

We all know about the outage at AWS on 28 February which affected its US-East 1 data centre and disrupted access to a number of websites including Netflix, Slack, Tinder, Airbnb, the US SEC, Expedia and Prime Instant Video.

The extent of the disruption, which lasted for five hours, reflected Amazon’s dominant position in the cloud market where it has a share of more than 40%. It also showed that many of its customers had placed all of their data in the one centre instead of spreading it across a number of locations to provide redundancy.

Some people reading Amazon’s summary of the S3 service disruption, might be surprised at just how brittle the system appears to be.

According to Amazon, the S3 team “was debugging an issue causing the S3 billing system to progress more slowly than expected”. An S3 team member “using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended”.

And then, like a house of cards or a Jenga tower or a chorus of Dem Bones, everything started to unravel in a most unwelcome sequence. In Amazon’s words: “The servers that were inadvertently removed supported two other S3 subsystems. Both subsystems required a full restart and while they were being restarted “S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from an S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable”.

Further, Amazon revealed it had not “completely restarted the index subsystem or the placement subsystem in our larger regions for many years”. This was compounded by the massive growth of S3 “over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected”.

Amazon apologised to customers and pledged to make changes to ensure the problem didn’t happen again. “While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further,” it concluded.

Let’s hope so because a lot of companies and customers rely on AWS for their cloud activities and the outage demonstrates the potential pitfalls of doing so. As people are left to ponder on just how reliable and robust their cloud services are, perhaps they can afford a wry grimace at the thought that, if nothing else, the service disruption brings a whole new meaning to “giving the Internet the finger”.