If you weren't trying to use your EditMe site on April 21, consider yourself lucky. For those of you that were, this post attempts to explain what happened in layman's terms.
On April 21 at 4AM, all EditMe sites went down. We quickly identified a problem with the service built into our hosting provider that provides disk storage to our servers. To put it simply, the ability to read and write data from the disks storing all data and programs needed to run the EditMe service slowed to a halt. Since disk storage is a fundamental requirement for a computer system to run, this locked up everything.
For some background, EditMe is hosted "in the cloud" using a service owned by Amazon (yes, the online retailer) called Amazon Web Services (AWS). I get into why we use AWS in some detail below.
We checked the AWS service status page to see if they were aware of the problem. Since there was no indication there that they were, we notified AWS. A few Twitter searches told us right away that lots of AWS customers were experiencing the same problem.
The huge benefit of using a service like Amazon Web Services (AWS) is that, when there is a problem, service can be easily recovered. When there is a problem with a particular server, it can be moved to another working set of hardware within minutes without requiring any intervention from AWS support staff or data center personnel. In traditional non-cloud data centers, a server failure requires manual intervention and usually spells several hours of downtime for that server. In this case, the problem was affecting all of our servers, so we realized it was not a problem at the server level.
When the problem is more widespread, there is still opportunity for fairly speedy self-recovery, and another huge benefit of this infrastructure. AWS segments their massive data centers into independently operating sections called Availability Zones. Each zone has separate power, networking and other services to avoid a single point of failure. When there's a serious problem that brings down service, AWS customers can easily and fairly quickly migrate their service to one of the other zones. Again, this is done without having to involve AWS staff or feet on the ground.
After seeing that Amazon had recognized the problem and was working on it by 6:30AM, we decided to start migrating service to another zone. Unfortunately, we were unable to do this. It's still unclear exactly why, but this problem not only brought down service to our current availability zone, but attempts to restore service to other zones were also failing. This meant our hands were tied. Our service was down and we had no way to restore it until AWS restored service to at least one part of their data center.
As we monitored the problem into the morning, it became clear that this problem was massive in scope. By mid-morning we learned that it had brought down several major internet sites and the news from the AWS status page was not encouraging. The little detail they gave is that their storage service encountered a problem, to which it responded by recreating mirrors (the redundant disk copies that protect from disk failure). This effort is not easily stopped once started, and the performance effect of such a massive remirroring was devastating.
One of the challenges of cloud computing is that when there is a problem like this, each customer responds by trying to recreate their environment in a working section of the data center. This compounds a performance problem by doubling or tripling the resources required to serve existing hosting customers. We believe that the disk performance issue combined with the massive additional load required by customers attempting to recover service is what was behind the outage.
AWS has promised to perform a thorough postmortem and publish details of what happened, what they have learned from it and what will be done to prevent such a failure in the future. We are eagerly awaiting this information and will share it here when it becomes available.
By about 6PM on April 21, AWS had restored service to a point where we were able to begin migrating the EditMe service to one of the other Availability Zones that was working. This process was completed between 8-10PM and all sites were restored to service. Over the following 24 hours, performance issues continued to plague the service causing intermittent, though mostly brief, outages.
In order to restore service this way, we needed to start from the backup made prior to the original outage event. This meant changes made between 10PM ET on 4/20/11 and 4AM ET on 4/21/11 (when the outage began) would be lost. Given that this is a very slow time for EditMe site changes and that we were confident the old volumes were not lost forever, we decided it was better to restore to these backups and resume service than to wait an unknown amount of time for the volume recovery process. We posted a detailed message to the editing screen of all EditMe sites notifying Administrators of the situation and explaining their options.
We're glad we did restore from backup because, as of this writing and four days after the outage began, we are still unable to recover two of the four volumes affected by the outage. We continue to work with AWS to restore these volumes. We're also glad because very very few customers have come forward indicating that they had entered important content during that lost time frame.
This outage received a lot of press coverage, most of it questioning the viability of cloud computing and warning of "putting all your eggs in one basket". Most of these peices were written by journalists who don't have a solid understanding of the technology behind a service like AWS. The massively redundant nature of AWS data centers means that hosting with them is hardly putting all your eggs in one basket. I recall a 24 hour outage to the data center EditMe was first hosted at back in 2004. They had a fire that caused significant physical damage to their facility. I bring this up only as an example of fallibility in any hosting environment.
Cloud hosting is fairly new. There are unparalleled benefits to the kind of service AWS provides, and we think that retreating to a traditional dedicated hosting solution is an unwise reaction to this outage. Clearly this has been a painful process for AWS and their customers. We expect that from it, AWS will learn from these mistakes and become a more stable and reliable service because of it. If there's one to be seen, this is the silver lining of last week's crisis.
AWS was the pioneer in this space. They continue to run the biggest and most thorough suite of cloud computing infrastructure services. We could run to one of their competitors, but it's not likely the grass is greener with less experienced companies. In short, we feel that the benefits of cloud computing substantially outweigh the risks and that AWS is still one of best, if not the best, providers in this space.
Thanks to all of you who offered words of understanding and encouragement during these trying days. We're terribly sorry for the very real disruption this outage caused our customers.