In the last few years, we see the advent of highly distributed systems. Systems that have clusters with lots of servers are no longer the sole realm of the googles’ and facebooks’ of the world and we begin to see multi-node and big data systems in enterprises. e.g. I don’t think a company such as Nice (the company I work for) would release an Hadoop based analytics platform and solutions, something we did just last week, 5-6 years ago.
So now that large(r) clusters are more prevalent, I thought it would be a good time to reflect on the fallacies of distributed computing and how/if they are relevant; should they be changed.
If you don’t know about the fallacies you can see the list and read the article I wrote about them at the link mentioned above. In a few words I’d just say that these are statement, originally drafted by Peter Deutsch, Tom Lyon and others in in 1991-2, about failed assumptions we are tempted to make when working on distributed systems which turn out as fallacies and cost us dearly.
So the fallacies help keep in mind that distributed systems are different, and they do seem to hold, even after the 20 years that passed. I think, however, that working with larger cluster we should also consider the following 3 as fallacies we’re likely to assume
- Instances are free
- Instances have identities
- Map/Reduce is a panacea
Instances are free
A lot of the new technologies of the big-data and noSQL era bring with them the promise of massive scalability. If you see a performance problem, you can just (a famous lullaby word) add another server. In most cases that is even true, you can indeed add more servers and get better performance. What these technologies don’t tell you is that instances have costs. More instances mean increased TCO starting from management effort monitoring, configuring etc, as well as operations cost either for the hardware; the rented space and electricity in a hosted solution or the usage by hours in a cloud environment. So from the development side of the fence the solution is easy – add more hardware. In reality sometimes it is better to make the effort and optimize your code/design. Just the other week we had a more than a 10 fold improvement in query performance by removing query parts that were no longer needed after a change in the data flow of the system – that was way cheaper than adding 2-3 more nodes to achieve the same results.
Instances have identities
I remember, sometime in Jurassic age, when I set up a network for the first time (A Novell Netware 3.11 if you must ask) it had just one server. Naturally that server was treated with a lot of respect. It had a a printer connected, it had a name, nobody could touch it but me. One server to rule all them clients. Moving on I had server farms, so just a list of random names began to be a problem so we started to use themes like gods, single malts (“can you reboot the Macallan please”) etc. Anyway, that’s all nice and dandy and if you are starting small with a (potentially) big data project you might be tempted to do something similar. If you are tempted – don’t. When you have tens of servers (and naturally even worst when you have hundreds or thousands) you no longer care about the individual server. You want to look at the world as pools of server types. you have a pool of data nodes in your Hadoop cluster, a pool of application servers , a pool of servers running configuration x and another with configuration y. You’d need tools like chef, Ansible or similar products to manage this mess. But again, you won’t care much about XYZ2011 server and even it runs tomcat today, tomorrow it may make more sense to make it part of the Cassandra cluster. What matters are the roles in the pools of resources and that the pool sizes will be enough to handle the capacity needed.
Map/Reduce is a panacea
Hadoop seems to be the VHS of large clusters. It might not be the ultimate solution, but it does seem to be the one that gets the most traction – a lot of vendors old (like IBM, Microsoft, Oracle etc.) and new (Hortonworks, Cloudera, Pivotal etc.) offer Hadoop distros and many other solutions offer Hadoop adaptors (Mongodb, Casandra, Vertica etc.) and Hadoop, well hadoop is about the distributed file system and, well, map/reduce.
Map/Reduce, which was introduced in 2004 by Google is an efficient algorithm for going over a large distributed data set without moving the data (map) and then producing aggregated or merged of results (reduce). Map/Reduce is great and it is a very useful paradigm applicable for a large set of problems.
However it shouldn’t be the only tool in your tool set as map/reduce is inefficient when there’s a need to do multiple iterations on the data (e.g. grpah processing) or when you have to do many incremental updates to the data but don’t need to touch all of it. Also there’s the matter of ad-hoc reports (which I’ll probably blog about separately) Google solved these in pregel, percolator and dremel in 2009/2010 and now the rest of the world is playing catchup as it did with map/reduce a few year ago – but even if the solutions are not mature yet, you should keep in mind that they are coming
Instances are free; Instances have identities; and map/reduce is a panacea – these are my suggested additions to the fallacies of distributed computing when talking about large clusters. I’d be happy to hear what you think and/or if there are other things to keep in mind that I’ve missed