Category: Big Data
Fallacies of massively distributed computing
In the last few years, we see the advent of highly distributed systems. Systems that have clusters with lots of servers are no longer the sole realm of the googles’ and facebooks’ of the world and we begin to see multi-node and big data systems in enterprises. e.g. I don’t think a company such as Nice (the company I work for) would release an hadoop based analytics platform and solutions, something we did just last week, 5-6 years ago.
So now that large(r) clusters are more prevalent, I thought it would be a good time to reflect on the fallacies of distributed computing and how/if they are relevant; should they be changed.
If you don’t know about the fallacies you can see the list and read the article I wrote about them at the link mentioned above. In a few words … Read More »
Herding Apache Pig – using pig with perl and python
the past week or so we got some new data that we had to process quickly . There are quite a few technologies out there to quickly churn map/reduce jobs on Hadoop (Cascading, Hive, Crunch, Jaql to name a few of many) , my personal favorite is Apache Pig. I find that the imperative nature of pig makes it relatively easy to understand what’s going on and where the data is going and that it produces efficient enough map/reduces. On the down side pig lacks control structures so working with pig also mean you need to extend it with user defined functions (UDFs) or Hadoop streaming. Usually I use Java or Scala for writing UDFs but it is always nice to try something new so we decided to checkout some other technologies – namely perl and python. This post highlights some of … Read More »
Killing the HBase zombie table
One of our team leaders approached me in the hall today and asked if I could land a hand in troubleshooting something. He and our QA lead were configuring one of our test Hadoop clusters after an upgrade and they had a problem with one table they were trying to set up:
When they tried to create the table in HBase shell they got an error that the table exists
When they tried to delete the table they got an error that the table does not exist
HBase ships with a health-check and fix util called hbck (use: hbase hbck to run. see here for details) – they’ve run hbase reports everything is fine and dandy
Hmm, The first thing I tied to do is to look at the .META. table. This is where HBase keeps the tables and the regions they use. I … Read More »
The NoSQL landscape in diagrams
Here’s the NoSQL landscape in 3 slides (and hey, at least mine looks different :) )
451 research published their view of the NoSql/NewSql world in a unified diagram.
Infochimps published a similar diagram
And here’s mine from SOA Patterns chapter 10 (discussing “SOA & big data”)
5 lessons big-data projects should learn from the iOS6 map debacle
By now you’ve probably heard something about Apple’s new iOS6 maps app. In case you’ve been living under a rock, it turns out the new and shiny application that replaces Google maps in the new iOS release produces a lot of inaccuracies, mangled graphics, navigation errors and what not (just like the image you see on the left – for more examples you can see this site). Kidding (or gloating) aside, this debacle carries with it a few important lessons that anyone who is building a big data project should keep in mind.
Apple took data from various sources like Waze, Tomtom, yelp and others to build their database. thinking that it is all just geographical data using the same coordinate system so everything should be just fine. Well, it doesn’t work like that – out first and probably most important … Read More »

