Furthermore, one of the affected servers was Amazon's own status page, which for most of the outage showed that everything was running smoothly, even if around 20% of all Internet sites were affected, according to an estimation by Shawn Moore, CTO at Solodev. At 12.37 EST the worker executed a command that was meant to remove a small number of servers for one of the S3 subsystems used in the billing process. With a few mistaken keystrokes, the employee wound up knocking out systems that supported other systems that help AWS work properly.
"While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years", AWS said. An employee entered what they thought was a routine command to remove servers from an S3 subsystem. Due to sluggish operational activity in a handful of its S3 servers, one of the employees made a decision to debug the service by taking some billing servers offline.
If you've ever rebooted an older computer and notice it chugging on start-up, you'll understand the feeling AWS must have had while waiting for the system to come back.
"We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions", AWS said.
In a note posted to customers today, Amazon revealed the cause of the problem: a typo.
Affected organisations included Business Insider, Expedia, Coursera, Quora, and Slack. Amazon had to carry out a series of safety checks to make sure that no stored files were corrupted in the process, and it took the company specifically four hours and 17 minutes to get its systems running once again. "We will do everything we can to learn from this event and use it to improve our availability even further". To prevent a simple human error from causing another blackout, Amazon is taking steps like adding "safeguards" to keep engineers from being able to remove such a large server capacity.