Amazon details S3 outage
Specifically, a command to remove servers as part of a billing system debug operation was not accurate enough in its parameters and this led to more servers being removed than intended.
“The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected,” said the statement. “At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.”
An affected system was one that carried metadata and location information for “all S3 objects in the region”. This took out a number of services relying on these servers, hence the extent of the outage.
The statement goes on to say, “While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”
Amazon said that while these systems have been in place for some years, “we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years”.
The statement says that steps have been taken to address the issue, preventing such an operation error from being able to have such an impact in the future.
“We are making several changes as a result of this operational event,” says the statement.
“While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.”
“We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem.”