Amazon’s EC2 & EBS outage
Unless you are living under a rock, you’ve probably heard, if not felt Amazon’s outage (you can read more about it all over the web, e.g. Cade Metz at the register or Julianne Pepitone at CNN money, [edit Apr26] Todd Hoff ‘s list of posts on the subject) This incident is very interesting in the technical sense as well as very disturbing.
First off it is a major event since Amazon controls almost 60% of the Infrastructure as a Service market (per WSJ) and an incident like this is bad publicity for the whole cloud concept. After all if Amazon is fledgling what does that mean for Rackspace and the other IaaS players – not to mention PaaS players like Google and Microsoft (since PaaS solutions require more complicated software and higher integration with end-solutions.
It is also worth mentioning that this isn’t the major outages for Amazon. One notable “availability event” occurred in 2008 where S3 had major problem for about half a day and a few minor problem with EC2 in 2009. What’s stands out here is that the availability zones features that was supposed to isolate this type of breakdown to smaller areas broke and only sites that had data-center redundancy (like Netflix for example) managed to handle the interruption while sites like foursquare, reddit and even, it seems, a company monitoring cardiac arrests
were all harmed.
Another alarming behavior on the part of Amazon is the lack of transparency / bad crisis management in handling the outage. For example Keith Smith CEO of BigDoor (one of the startups affected by the outage) writes
“Starting at 1:41 a.m. PST, Amazon’s updates read as if they were written by their attorneys and accountants who were hedging against their stated SLA rather than being written by a tech guy trying to help another tech guy.”
When a big supplier fumbles a lot of companies are affected and it is bound to get big press. Also, systems, esp. complex ones, have bugs and its understandable that things may break from time to time (and I am sure that Amazon, which evidently has top talent would find a way to prevent this from recurring). However, Cloud providers need to understand that “with great power comes great responsibility” and the technical offering need to be strengthened with great support and transparency.
On the technical level, it also means that, while moving to the cloud carries a lot of benefits and can save a lot on operations costs, the responsibility for our application’s up time is still our responsibility and depending on the cost of failure (on the scale from few dollars to lives of people) we should also architect for disaster regardless of vendor claims. When we build closed systems we may look at the Mean-Time Between Failure (MTBF and MTBCF) advertised by hardware manufacturers but we’d also add software based reliability mechanisms – for the cloud that may mean cross data-center (region) deployments, cross cloud provider deployments, or even on-premise backup – This is the same measures you’d take when your dealing with the electric company, if it is important enough, you’d install UPSs and generators and alternate sites and whatnot, you just have to figure out how important business continuousness is
I guess we’ll all wait to see how all this unfolds and what will be the after-effects of this outage on Cloud-computing. I personally think that it is still a good move in many cases, however this incident, help focus that the responsibility for our application’s wellness is still ultimately ours
Edit Apr. 24th:
I just read a post in Coding Horror which refers to a year old post in Netflix’s blog called “5 lessons we’ve learned using AWS [Amazon WebSevices]“. Netflix, in case you’re wondering survived Amazon’s outage and indeed, in lesson #3 they explain that if you want to survive failures you have to plan and constantly test for it:
3. The best way to avoid failure is to fail constantly.We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.