Unless you are living under a rock, you’ve probably heard, if not felt Amazon’s outage (you can read more about it all over the web, e.g. Cade Metz at the register or Julianne Pepitone at CNN money, [edit Apr26] Todd Hoff ‘s list of posts on the subject) This incident is very interesting in the technical sense as well as very disturbing.
First off it is a major event since Amazon controls almost 60% of the Infrastructure as a Service market (per WSJ) and an incident like this is bad publicity for the whole cloud concept. After all if Amazon is fledgling what does that mean for Rackspace and the other IaaS players – not to mention PaaS players like Google and Microsoft (since PaaS solutions require more complicated software and higher integration with end-solutions.
It is also worth mentioning that this isn’t the major outages for Amazon. One notable “availability event” occurred in 2008 where S3 had major problem for about half a day and a few minor problem with EC2 in 2009. What’s stands out here is that the availability zones features that was supposed to isolate this type of breakdown to smaller areas broke and only sites that had data-center redundancy (like Netflix for example) managed to handle the interruption while sites like foursquare, reddit and even, it seems, a company monitoring cardiac arrests
were all harmed.
Another alarming behavior on the part of Amazon is the lack of transparency / bad crisis management in handling the outage. For example Keith Smith CEO of BigDoor (one of the startups affected by the outage) writes
“Starting at 1:41 a.m. PST, Amazon’s updates read as if they were written by their attorneys and accountants who were hedging against their stated SLA rather than being written by a tech guy trying to help another tech guy.”
When a big supplier fumbles a lot of companies are affected and it is bound to get big press. Also, systems, esp. complex ones, have bugs and its understandable that things may break from time to time (and I am sure that Amazon, which evidently has top talent would find a way to prevent this from recurring). However, Cloud providers need to understand that “with great power comes great responsibility” and the technical offering need to be strengthened with great support and transparency.
On the technical level, it also means that, while moving to the cloud carries a lot of benefits and can save a lot on operations costs, the responsibility for our application’s up time is still our responsibility and depending on the cost of failure (on the scale from few dollars to lives of people) we should also architect for disaster regardless of vendor claims. When we build closed systems we may look at the Mean-Time Between Failure (MTBF and MTBCF) advertised by hardware manufacturers but we’d also add software based reliability mechanisms – for the cloud that may mean cross data-center (region) deployments, cross cloud provider deployments, or even on-premise backup – This is the same measures you’d take when your dealing with the electric company, if it is important enough, you’d install UPSs and generators and alternate sites and whatnot, you just have to figure out how important business continuousness is
I guess we’ll all wait to see how all this unfolds and what will be the after-effects of this outage on Cloud-computing. I personally think that it is still a good move in many cases, however this incident, help focus that the responsibility for our application’s wellness is still ultimately ours
Edit Apr. 24th:
I just read a post in Coding Horror which refers to a year old post in Netflix’s blog called “5 lessons we’ve learned using AWS [Amazon WebSevices]”. Netflix, in case you’re wondering survived Amazon’s outage and indeed, in lesson #3 they explain that if you want to survive failures you have to plan and constantly test for it:
3. The best way to avoid failure is to fail constantly.We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.
I think my, “Should we go Microsoft Azure or AWS EC2?” question has just been answered and the answer would be maybe EC2 but never Azure. Netflix didn’t go down because EC2 was part of their solution with backup data centers. However, if I’m on Platform as a Service what is my backup? Another Platform as a Service provider for the same platform? How do the users of a PaaS like Heroku cross data centers? Seems like a EC2 style Cloud allows for cross data center redundancy where a PaaS you are all eggs in basket.Our SaaS TrackResults went down for a day with the EC2 outage and we are now planning to leverage another cloud provider as part of our redundant data center strategy. This is something that would be difficult if not completely defeating the purpose of using a PaaS.
Tyler, a PaaS solution can provide geo-replication across data centers as part of their platform. Azure Storage, for instance, supports geo-replication with seemless failover to another data center. Having another layer of disaster recovery above what’s built into the PaaS solution seems somewhat paranoid.
PaaS with geo-replication was surely not the case with Heroku’s PaaS. Who was built on top of Amazon’s EC2 and who went down for more than 24 hours. Where does that leave their clients? Also Amazon was supposed to have physical separation between zones but that didn’t stop the single service provider from deploying what in my opinion was a software bug that corrupted the system as a whole. I think a window of what can happen with a single provider in spite of an amazing reputation and how things are supposed to work has been created by Amazon.
@Tyler As I said in the post the solution you should consider depends on the costs of a downtime – i.e. what does it mean for your company to be out-of-service fo 2 seconds/5 minutes/ r 4 hours/1 day/ etc.
[…] Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz […]
[…] Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz […]