and with YARN the game changes

Posted on October 17th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. 2 comments

I’ve been working with Hadoop for a few years now and the platform and ecosystems has been advancing at an amazing pace with new features and additional capabilities appearing almost on a daily basis. Some changes are small like better scheduling in Oozie; some are still progressing like support for NFS some are cool like full support for CPython in Pig but, in my opinion, the most important change is the introduction of YARN in Hadoop 2.0.

Hadoop was created with HDFS, a distributed file system, and Map/Reduce framework – a distributed processing platform. With YARN hadoop moves from being a distributed processing framework into a distributed operating system.
“operating system”, that sounded a little exaggerated when I wrote it, so just for fun, I picked up a copy of Tanenbaum’s “Modern Operating Systems”*, I have lying around from my days as … Read More »

Fallacies of massively distributed computing

Posted on April 29th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. 5 comments

In the last few years, we see the advent of highly distributed systems. Systems that have clusters with lots of servers are no longer the sole realm of the googles’ and facebooks’ of the world and we begin to see multi-node and big data systems in enterprises. e.g. I don’t think a company such as Nice (the company I work for) would release an hadoop based analytics platform and solutions, something we did just last week, 5-6 years ago.

So now that large(r) clusters are more prevalent, I thought it would be a good time to reflect on the fallacies of distributed computing and how/if they are relevant; should they be changed.
If you don’t know about the fallacies you can see the list and read the article I wrote about them at the link mentioned above. In a few words … Read More »

Herding Apache Pig – using pig with perl and python

Posted on March 4th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. No Comments

the past week or so we got some new data that we had to process quickly . There are quite a few technologies out there to quickly churn map/reduce jobs on Hadoop (Cascading,  Hive,  Crunch, Jaql to name a few of many) , my personal favorite is Apache Pig.  I find that the imperative nature of pig makes it relatively easy to understand what’s going on and where the data is going and that it produces efficient enough map/reduces. On the down side pig lacks control structures so working with pig also mean you need to extend it with user defined functions (UDFs) or Hadoop streaming. Usually I use Java or Scala for writing UDFs but it is always nice to try something new so we decided to checkout some other technologies – namely perl and python. This post highlights some of … Read More »

Introducing H-Rider

Posted on December 10th, by Arnon Rotem-Gal-Oz in Blog. 4 comments

In the last year and half or so (since I  joined Nice Systems ) we’ve been hard at work building our big data platform based on a  lot of open source technologies including Hadoop and HBase and quite a few others. Building on open source brings a lot of benefits and helps cut development time by building on the knowledge and effort of other.

I personally think that this has to be  two-way street and as a company benefits for open source it should also give something back. This is why I am very happy to introduce Nice’s first (hopefully first of many) contribution back to the open source community. A UI dev tool for working with HBase called h-rider. H-rider offers a convenient user interface to poke around data stored in HBase which our developers find very useful both for development and debugging  

h-rider … Read More »

Develop Map/Reduce with reduced assumptions

Posted on April 16th, by Arnon Rotem-Gal-Oz in Blog. No Comments

It all started with this odd bug…

One of our teams is writing a service, that among other things, runs map/reduce jobs built as pig scripts with Java UDFs. The scripts accepts CSV files which have a text header followed by lines of data. It performs some grouping and then calls a UDF which essentially filter, enrich and transforms the data outputting another CSV with a new header followed by data – something like the following:

input = load ‘/data/INPUT2.dat’ using PigStorage(‘,’)…
grouped = GROUP input BY key;
results = FOREACH grouped GENERATE evaluateUDF(input) as output;
STORE output…

and all was well. The job spawns a few map tasks partitions, groups the data and runs a single reduce where the actual evaluation happens.

Then someone wanted it to run faster – We can do that by adding reducers we can do that by adding PARALLEL X to the … Read More »