Herding Apache Pig – using pig with perl and python
the past week or so we got some new data that we had to process quickly . There are quite a few technologies out there to quickly churn map/reduce jobs on Hadoop (Cascading, Hive, Crunch, Jaql to name a few of many) , my personal favorite is Apache Pig. I find that the imperative nature of pig makes it relatively easy to understand what’s going on and where the data is going and that it produces efficient enough map/reduces. On the down side pig lacks control structures so working with pig also mean you need to extend it with user defined functions (UDFs) or Hadoop streaming. Usually I use Java or Scala for writing UDFs but it is always nice to try something new so we decided to checkout some other technologies – namely perl and python. This post highlights some of the pitfalls we met and how to work around them.
Yuval, who was working with me on this mini-project likes perl (to each his own, I suppose) so we started with that. searching for pig and perl examples, we found something like the following
A = LOAD 'data'; B = STREAM A THROUGH `stream.pl`;
The first pitfall here is that the perl script name is surrounded by a backtick (the character on the tilde (~) key) and not a single quote (so in the script above ’data’ is surrounded by single quotes and `stream.pl` is surrounded by backticks ).
The second pitfall was that the code above works nicely when you use pig in local mode (pig -x local) but it failed when we tried to run it on the cluster. It took some head scratching and some trial and error but eventually Yuval came with the following:
1 2 3
DEFINE CMD `perl stream.pl` ship ('/PATH/stream.pl'); A = LOAD 'data' B = STREAM A THROUGH CMD;
Basically we’re telling pig to copy the pig script to HDFS so that it would be accessible on all the nodes.
So, perl worked pretty well, but since we’re using Hadoop Streaming and get the data via stdin we lose all the context of the data that pig knows. We also need to emulate the textual representations of bags and tuples so the returned data will be available to pig for further work. This is all workable but not fun to work with (in my opinion anyway).
I decided to write pig UDFs in python. python can be used with Apache streaming, like perl above, but it also integrates more tightly with Pig via jython (i.e the python UDF is compiled into java and ships to the cluster as part of the jar pig generates for the map/reduce anyway).
Pig UDFs are better than streaming as you get Pig’s schema for the parameters and you can tell Pig the schema you return for your output. UDFs in python are especially nice as the code is almost 100% regular python and Pig does the mapping for you (for instance a bag of tuples in pig is translated to a list of tuples in python etc.). Actually the only difference is that if you want Pig to know about the data types you return from the python code you need to annotate the method with @outputSchema e.g. a simple UDF that gets the month as an int from a date string in the format YYYY-MM-DD HH:MM:SS
1 2 3 4 5 6 7 8 9 10 11
@outputSchema("num:int") def getMonth(strDate): try: dt, _, _ = strDate.partition(".") return datetime.strptime(dt, "%Y-%m-%d %H:%M:%S").month except AttributeError: return 0 except IndexError: return 0 except ValueError: return 0
Using the PDF is as simple as declaring the python file where the UDF is defined. Assuming our UDF is ina a file called utils.py, it would be declared as follows:
Register utils.py using jython as utils;
And then using that UDF would go something like:
A = LOAD 'data' using PigStorage('|') as (dateString:chararray); B = FOREACH A GENERATE utils.getMonth(dateString) as month;
Again, like in the perl case there are a few pitfalls here. for one the python script and the pig script need to be in the same directory (relative paths only work in in the local mode). The more annoying pitfall hit me when I wanted to import some python libs (e.g. datetime in the example which is imported using “from datetime import datetime”). There was no way I could come up with to make this work. The solution I did come up with eventually was to take a jyhton standalone .jar (a jar with a the common python libraries included) and replace Pig’s jython Jar (in the pig lib directory) with the stanalone one. There’s probably a nicer way to do this (and I’d be happy to hear about it) but this worked for me. It only has to be done on the machine where you run the pig script as the python code gets compiled and shipped to the cluster as part of the jar file Pig generates anyway.