Is there a future for Map/Reduce?


Posted on June 3rd, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. 3 comments

8w9jj

Google’s Jeffrey Dean and Sanjay Ghemawat filed the patent request and published the map/reduce paper  10 year ago (2004). According to WikiPedia Doug Cutting and Mike Cafarella created Hadoop, with its own implementation of Map/Reduce,  one year later at Yahoo – both these implementations were done for the same purpose – batch indexing of the web.

Back than, the web began its “web 2.0″ transition, pages became more dynamic , people began to create more content – so an efficient way to reprocess and build the web index was needed and map/reduce was it. Web Indexing was a great fit for map/reduce since the initial processing of each source (web page) is completely independent from any other – i.e.  a very convenient map phase and you need  to combine the results to build the reverse index. That said, even the core google algorithm –  the famous pagerank is iterative (so less appropriate for map/reduce), not to mention that  as the internet got bigger and the updates became more and more frequent map/reduce wasn’t enough. Again Google (who seem to be consistently few years ahead of the industry) began coming up with alternatives like Google Percolator  or  Google Dremel (both papers were published in 2010, Percolator was introduced at that year, and dremel has been used in Google since 2006).

So now, it is 2014, and it is time for the rest of us to catch up with Google and get over Map/Reduce and  for multiple reasons:

  • end-users’ expectations (who hear “big data” but interpret that as  “fast data”)
  • iterative problems like graph algorithms which are inefficient as you need to load and reload the data each iteration
  • continuous ingestion of data (increments coming on as small batches or streams of events) – where joining to existing data can be expensive
  • real-time problems – both queries and processing

In my opinion, Map/Reduce is an idea whose time has come and gone – it won’t die in a day or a year, there is still a lot of working systems that use it and the alternatives are still maturing. I do think, however, that if you need to write or implement something new that would build on map/reduce – you should use other option or at the very least carefully consider them.

So how is this change going to happen ?  Luckily, Hadoop has recently adopted YARN (you can see my presentation on it here), which opens up the possibilities to go beyond map/reduce without changing everything … even though in effect,  a lot  will change. Note that some of the new options do have migration paths and also we still retain the  access to all that “big data” we have in Hadoopm as well as the extended reuse of some of the ecosystem.

The first type of effort to replace map/reduce is to actually subsume it by offering more  flexible batch. After all saying Map/reduce is not relevant, deosn’t mean that batch processing is not relevant. It does mean that there’s a need to more complex processes. There are two main candidates here  Tez and Spark where Tez offers a nice migration path as it is replacing map/reduce as the execution engine for both Pig and Hive and Spark has a compelling offer by combining Batch and Stream processing (more on this later) in a single engine.

The second type of effort or processing capability that will help kill map/reduce is MPP databases on Hadoop. Like the “flexible batch” approach mentioned above, this is replacing a functionality that map/reduce was used for – unleashing the data already processed and stored in Hadoop.  The idea here is twofold

  • To provide fast query capabilities* – by using specialized columnar data format and database engines deployed as daemons on the cluster
  • To provide rich query capabilities – by supporting more and more of the SQL standard and enriching it with analytics capabilities (e.g. via MADlib)

Efforts in this arena include Impala from Cloudera, Hawq from Pivotal (which is essentially greenplum over HDFS), startups like Hadapt or even Actian trying to leverage their ParAccel acquisition with the recently announced Actian Vector . Hive is somewhere in the middle relying on Tez on one hand and using vectorization and columnar format (Orc)  on the other

The Third type of processing that will help dethrone Map/Reduce is Stream processing. Unlike the two previous types of effort this is covering a ground the map/reduce can’t cover, even inefficiently. Stream processing is about  handling continuous flow of new data (e.g. events) and processing them  (enriching, aggregating, etc.)  them in seconds or less.  The two major contenders in the Hadoop arena seem to be Spark Streaming and Storm though, of course, there are several other commercial and open source platforms that handle this type of processing as well.

In summary – Map/Reduce is great. It has served us (as an industry) for a decade but it is now time to move on and bring the richer processing capabilities we have elsewhere to solve our big data problems as well.

Last note  - I focused on Hadoop in this post even thought there are several other platforms and tools around. I think that regardless if Hadoop is the best platform it is the one becoming the de-facto standard for big data (remember betamax vs VHS?)

One really, really last note – if you read up to here, and you are a developer living in Israel, and you happen to be looking for a job –  I am looking for another developer to join my Technology Research team @ Amdocs. If you’re interested drop me a note: arnon.rotemgaloz at amdocs dot com or via my twitter/linkedin profiles


*esp. in regard to analytical queries – operational SQL on hadoop with efforts like  Phoenix ,IBM’s BigSQL or Splice Machine are also happening but that’s another story

illustration idea found in  James Mickens’s talk in Monitorama 2014 –  (which is, by the way, a really funny presentation – go watch it) -ohh yeah… and pulp fiction :)

Tweet about this on Twitter14Share on LinkedIn27Share on Facebook1Share on Google+3Buffer this pageShare on Reddit0Share on StumbleUpon0Email this to someone




  • asafm

    Great article! I’d like to add: Shark which is basically Hive using Spark as its execution engine.
    Apache Drill – looks like an amazing project by reading the source code.
    Facebook Presto
    Stinger Initiative – upgraded Hive to use Tez as its execution engine.
    Regarding YARN – it’s Job design scheme prevent it from using it as as hoc resource management thus I’m not sure it will truly rock.

  • http://arnon.me Arnon Rotem-Gal-Oz

    Thanks :)

    Regarding the technologies I just gave a few examples. You are right that there are many options and Presto, for instance is a good one. Regarding Shark – I think that it is on its way out now that Spark SQL is part of the platform. Drill seems to have a long way to go , same goes for Tajo (which you didn’t mention)

    I did mention Hive on Tez and that’s definitely a good direction for Hive.

    As for Yarn – I does have shortcomings – I’ve mention them in my yarn presentation http://www.slideshare.net/arnonrgo/hadoop-yarn-35157325 – note that recently Apache slider was introduces and it aims to close most of these gaps http://hortonworks.com/blog/apache-slider-technical-preview-now-available/

  • asafm

    Cool!
    Regarding Shark – it has two major points Spark will have to add: supports Hive tables – any format. That’s a very strong feature. And Sticky RRD which if I remember correctly Spark doesn’t have – such that if a node fails, this RRD portion can be reloaded. Haven’t used any of those tools in production though so don’t know if the last feature is really useful practically.