Amazon blames human error for cloud-service disruption

Adjust Comment Print

The hours-long Amazon Web Services incident that knocked major sites offline and caused problems for several others on Tuesday was caused by a typo, AWS reported Thursday.

Amazon.com (AMZN) on Thursday blamed human error for an outage at its cloud-services unit that caused widespread disruption to internet traffic across the US earlier this week.

Furthermore, one of the affected servers was Amazon's own status page, which for most of the outage showed that everything was running smoothly, even if around 20% of all Internet sites were affected, according to an estimation by Shawn Moore, CTO at Solodev. At 12.37 EST the worker executed a command that was meant to remove a small number of servers for one of the S3 subsystems used in the billing process. With a few mistaken keystrokes, the employee wound up knocking out systems that supported other systems that help AWS work properly.

"While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years", AWS said. An employee entered what they thought was a routine command to remove servers from an S3 subsystem. Due to sluggish operational activity in a handful of its S3 servers, one of the employees made a decision to debug the service by taking some billing servers offline.

If you've ever rebooted an older computer and notice it chugging on start-up, you'll understand the feeling AWS must have had while waiting for the system to come back.

"We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions", AWS said.

In a note posted to customers today, Amazon revealed the cause of the problem: a typo.

Affected organisations included Business Insider, Expedia, Coursera, Quora, and Slack. Amazon had to carry out a series of safety checks to make sure that no stored files were corrupted in the process, and it took the company specifically four hours and 17 minutes to get its systems running once again. "We will do everything we can to learn from this event and use it to improve our availability even further". To prevent a simple human error from causing another blackout, Amazon is taking steps like adding "safeguards" to keep engineers from being able to remove such a large server capacity.

Comments