I’ve been working with Hadoop for a few years now and the platform and ecosystems has been advancing at an amazing pace with new features and additional capabilities appearing almost on a daily basis. Some changes are small like better scheduling in Oozie; some are still progressing like support for NFS some are cool like full support for CPython in Pig but, in my opinion, the most important change is the introduction of YARN in Hadoop 2.0.
Hadoop was created with HDFS, a distributed file system, and Map/Reduce framework – a distributed processing platform. With YARN hadoop moves from being a distributed processing framework into a distributed operating system.
“operating system”, that sounded a little exaggerated when I wrote it, so just for fun, I picked up a copy of Tanenbaum’s “Modern Operating Systems”*, I have lying around from my days as … Read More »
In the last few years, we see the advent of highly distributed systems. Systems that have clusters with lots of servers are no longer the sole realm of the googles’ and facebooks’ of the world and we begin to see multi-node and big data systems in enterprises. e.g. I don’t think a company such as Nice (the company I work for) would release an hadoop based analytics platform and solutions, something we did just last week, 5-6 years ago.
So now that large(r) clusters are more prevalent, I thought it would be a good time to reflect on the fallacies of distributed computing and how/if they are relevant; should they be changed.
If you don’t know about the fallacies you can see the list and read the article I wrote about them at the link mentioned above. In a few words … Read More »
One of our team leaders approached me in the hall today and asked if I could land a hand in troubleshooting something. He and our QA lead were configuring one of our test Hadoop clusters after an upgrade and they had a problem with one table they were trying to set up:
When they tried to create the table in HBase shell they got an error that the table exists
When they tried to delete the table they got an error that the table does not exist
HBase ships with a health-check and fix util called hbck (use: hbase hbck to run. see here for details) – they’ve run hbase reports everything is fine and dandy
Hmm, The first thing I tied to do is to look at the .META. table. This is where HBase keeps the tables and the regions they use. I … Read More »
By now you’ve probably heard something about Apple’s new iOS6 maps app. In case you’ve been living under a rock, it turns out the new and shiny application that replaces Google maps in the new iOS release produces a lot of inaccuracies, mangled graphics, navigation errors and what not (just like the image you see on the left – for more examples you can see this site). Kidding (or gloating) aside, this debacle carries with it a few important lessons that anyone who is building a big data project should keep in mind.
Apple took data from various sources like Waze, Tomtom, yelp and others to build their database. thinking that it is all just geographical data using the same coordinate system so everything should be just fine. Well, it doesn’t work like that – out first and probably most important … Read More »
My twitter feed spewed a very good list of distributed computing related papers (compiled by Dan Creswell). There are links to a lot of papers there. Few of my favorites include The fallacies of distributed computing by Peter Deutsch – you may also want to check out the paper I wrote explaining them; Life beyond distributed transactions an apostate’s opinion by Pat Helland and also “The Byzantine generals problem” by Leslie Lamport, Robert Shostak and Marshall Pease, “A note of distributed computing” by Samuel C. Kendall, Jim Waldo, Ann Wollrath and Geoff Wyant and “Harvest, yield, and scalable tolerant systems” by Armando Fox, Eric A. Brewer which I mentioned before in “10 papers every architect should read”
There are a also a few additional papers that are not in that list and that I found illuminating:
“Architectural Styles and the Design … Read More »
The internet is ablaze with posts and articles on HPs abandonment of WebOS and the PC business. I don’t have a lot to add to this discussion (It’s a pity, I was considering a Pre3 device, blah blah). I am personally more interested in the last bit of their announcement that talked about the 10.3bn$ acquisition of Autonomy. If we add this to HP’s acquisition of Vertica just 6 months ago and the (supposedly )failed attempt to acquire Tibco we can see HP is making a play for the whole unstructured/big-data/analytics field.
Vertica is a columnar database optimized for analytics. In a nut shell, columnar databases allow fast aggregation of data as well as holding a lot of columns per row – both traits are helpful when trying to report on data (but vey wasteful when you try to do … Read More »