Byzantine Reality

Searching for Byzantine failures in the world around us

History of MapReduce, Part 3

Now that we’ve covered the most popular (thus far) part of MapReduce’s life, let’s move on to its present and uncertain future. This time around we’ll cover extremely new material starting this year and will try to avoid rampant speculation about the future since there tends to be a high probability of it being flat out wrong. This article is primarily concerned with two topics: a new paper out that will be discussed at this year’s SIGMOD conference (A Comparison of Approaches to Large-Scale Data Analysis), and Hadoop Streaming, a fun new way to play with MapReduce that’s being used in some serious new cloud projects.

Let’s deal with these topics in the reverse order. Last time around, we showed that MapReduce could be useful for solving a wider class of problems then it is generally given credit for. One downside of this is that the vanilla MapReduce implementation forces the user to write Mappers and Reducers in Java. Don’t get me wrong, I like Java (despite the badmouthing I’ve done about it in the past), but I’d like to use other languages as well.

Enter Hadoop Streaming. This very interesting component that’s bundled in with Hadoop lets the user specify an arbitrary Map program and an arbitrary Reduce function. Now we can write MapReduce in any language and have it supported by Hadoop in an extremely simple fashion. The example given on the Streaming page itself uses bash:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
   
-input myInputDirs \
   
-output myOutputDir \
   
-mapper /bin/cat \
   
-reducer /bin/wc

This is pretty cool, suffice to say. The page also shows using Python, and this really lowers the barrier to using MapReduce to the point where the  “average programmer” can pick it up and start using it. And it seems Amazon has figured this out as well, albeit quite some time ago, and now have Amazon’s Elastic MapReduce. This looks like a pretty frontend on Hadoop Streaming that now lets you write code in one of seven languages currently (adding more would likely require installing those compilers on the default VM itself, which is actually not too difficult). But let’s substantiate that claim real fast. Why do I think Elastic MapReduce sits on top of Hadoop Streaming? Well, Amazon doesn’t make it a secret that it’s using Hadoop (see the previous link), but to know that it’s using Streaming, just look in the help manual:

elasticmapreduce

Using Hadoop is pretty much a no-brainer for Amazon. Hadoop added in support for S3 URLs quite some time ago so running it in EC2 itself wasn’t that far off. We’re in the process of doing a similar feat for AppScale, and while researching how to do MapReduce with arbitrary languages drew me to Streaming. After I saw the potential of that, I tripped onto the page containing the above picture and saw that this idea was apparently sound enough since Amazon is already doing this. The only real question that was left to me was how to get the data from our databases (HBase or HyperTable at present, Cassandra and MySQL soon enough).

This appears to be a non-trivial problem. For HBase it’s easy enough: they provide abstract classes you can implement that will take care of input and output, although it appears this would only work with vanilla MapReduce and not Streaming. The easiest way around the problem is to just require that the input be on the file system, which is exactly how vanilla MapReduce and Streaming do it (specifically it must be in either HDFS or S3).

Unfortunately, about a year ago, a flame war erupted from part of the database community, who are of the impression that MapReduce is a major step backwards. This apparently generated such controversy in the interwebs that a followup to the original post was made to address the comments of the first. The basic idea sums up to this:

MapReduce is fun and nice and all but was essentially already done 30 years ago. It’s nothing new and by the way SQL can do anything MapReduce can do but way better.

The first sentence is completely true. The renewed interest around MapReduce is all around the pretty packaging of it and accessibility to the “average programmer”. But the second sentence is where the flame war got started and where it rages on to this day. Furthermore, the authors of these posts noticed how much attention this was getting and published a paper in this year’s SIGMOD conference named A Comparison of Approaches to Large-Scale Data Analysis, which specifically compared Hadoop’s MapReduce to relational databases. Since this paper really sums up their argument against MapReduce and shows insights to their viewpoints, let’s focus on it except when the blog posts answer questions it doesn’t.

If you aren’t already familiar with the war between the MR-folks and whatever fraction of the database community it is that doesn’t care for MR very much, it really boils down to this. 

should have compared HBase and MapReduce to Vertica and MySQL with stored procedures.

Although we were not surprised by the relative performance advantages provided by the two parallel database systems, we were impressed by how easy Hadoop was to set up and use in comparison to the databases. The Vertica installation process was also straightforward but temperamental to certain system parameters. DBMS-X, on the other hand, was difficult to configure property and required repeated assistence from the vendor to obtain a configuration that performed well. For a mature product such as DBMS-X, the entire experience was indeed disappointing. Given the upfront cost advantage that Hadoop has, we now understand why it has quickly attracted such a large user community.

(DBMS-X gets a bit of a beating like this in the paper as well, so maybe this is why they didn’t reveal which database it was.) With all that being said and done, the paper is still an interesting read. If you buy the argument that since users are leaving relational databases for MapReduce and that this makes them equal comparisons, I think you’ll get more out of it than if you don’t. Since I really haven’t tried ditching a relational database and doing the exact same work in MapReduce, it’s really difficult to speculate too much about it. In the meanwhile, I’ll keep my fingers crossed hoping to see an HBase v. MySQL showdown although it seems like it’s hoping for a bit much since that’s not what it originally was and there’s no “Future Work” section in the paper.

Since this “history lesson” covers the present age of MapReduce, it would be also remiss of me to leave out at least a mention about Pig Latin and friends. It will certainly be interesting to see how they fare and what specifically about it lets them succeed or stunts their acceptance in the MR community, and if we end up seeing comparisons between it and SQL down the line :) .