Last time we looked at the Google File System paper. Next on the list chronologically brings us to the famous MapReduce paper. This is the paper that drew my attention more than any other, as I heard about it years ago but never really got a chance to thoroughly review it. This work in a lot of ways is in fact the polar opposite of GFS. Whereas GFS is very low-level, being a file system and all, MapReduce is the exact opposite. It provides excellent results on an extremely constrained set of problems, much like GFS, but instead of being aimed at files, is intended for programmers. And like GFS, the MapReduce paper has been out for a few years, so let’s see what’s been done with it and where it’s future could be going.
So let’s break down MapReduce as literally as possible. Map and reduce are two common functions found in functional programming languages (Lisp, ML, etc.). Map takes a list (or tuple), applies a given function to it, and spits out a new list. For example, a common map function would be to double each item in a list, in which case you would see this happen:
Map x2 to (1 2 3 4 5) --> (2 4 6 8 10)
So in this case your mapping function would be “double”. Reduce similarly takes a list (or tuple), and applies a given function to it to produce a single value. Common reduce functions would thus be “average”, “min”, or “max”. Using “min” as an example, you would see this happen:
Reduce "min" to (2 4 6 8 10) --> (2)
From the way we’ve structured the last two examples you may be able to see where we’re going. The clever guys and gals at Google caught on to the notion of doing a Map and feeding its results into a Reduce. This of course was nothing new, but where they took it surely was. They have two notable contributions here:
- They brought Map and Reduce out of functional-programming-land into the rest of the programming world by providing MapReduce libraries in C++, Java, and Python.
- They brought MapReduce into the world of distributed computing and out of being ran solely on a single computer.
Let’s look at each of these points a little closer. Map and reduce had been around for fifty years already with Lisp, but there’s a key reason it never really showed up in other languages: it just couldn’t. Map and reduce both require a function as input to apply to the given data, and most non-functional languages really make it a huge pain in the ass to pass around functions to each other. With C, you have to use function pointers, and with Java, you have to use Reflection (a bit expensive) or hide the functions inside of classes and pass those around. With functional languages, functions are first-class and you can pass them around and receive them extremely easily. With Google’s MapReduce library, you conceptually don’t need to worry about this kind of stuff, the library and runtime handles it for you.
The second point is the more interesting one, however. With Google, everything must be done BIG, so now we take this MapReduce “job” and farm it out to many, many computers. The simplest explanation for how this is done is the following picture, shamelessly taken from this interesting programming blog:
You start off with some huge amount of raw data and some task you want to perform on it, coded to adhere to MapReduce standards. As far as the paper goes, you give this data and code to a master node to run it as a MapReduce job. The master then sends out the data and your Map function to some number of worker nodes, who phone home to the master once they’re done. Once the Map workers all finish, the lists they emit are sorted and distributed to Reduce workers, who then do their work and write their output to a file. Of course there’s more going on here (what happens if workers die, how can this be optimized, etc.), but that’s all explained in the paper and not the point here.
The paper describes all of this in a good amount of detail and shows a number of sample applications as well. It shows how they perform and what bottlenecks can come up (as well as how they can be resolved). It’s important to note, however, two caveats:
- The Map and Reduce functions should be deterministic (stateless). If they are non-deterministic they can become exceedingly difficult to debug or reason about.
- The problem you want to solve must be expressible within the constraints of Map and Reduce (multiple applications of Map or Reduce are acceptable). Otherwise, MapReduce obviously can’t help you! MapReduce isn’t the be-all, end-all, but for many embarrassingly parallel applications, MapReduce will work wonders for you. Examples given in the paper include sorting huge amounts of data, counting the number of occurrences of each word in many documents, and searching for a given pattern in a large amount of text.
Like last time, there’s a key statement in the paper that really sums it up:
[MapReduce] allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
That’s really what it comes down to. Concurrent and distributed systems have been treacherous ground for quite some time now, and although we do have many best practices, it’s still easily problematic. This greatly simplifies the programming of an entire class of applications (a welcome tool in our toolbelt). So now we ask the same question we did last time: Where can I get my hands on this? And like last time, you can’t. Not the official version, anyways. But unlike last time, there are three interesting implementations of MapReduce out there that warrant discussion: Hadoop, http-mr, and Ruby’s MapReduce (two of which are open-source, ripe for the pickings).
Hadoop is an open-source implementation of MapReduce put out by the same team that put out the open-source version of the Google File System, the appropriately named Hadoop Distributed File System (HDFS). Hadoop lets you easily write MapReduce code in Java which it will then farm out to the nodes running HDFS, and appears to work in the same fashion as Google’s MapReduce. I’ve used it a little bit and was pleasantly surprised. The documentation looks great and if you had a free afternoon, you could get pretty familiar with it. It’s actively developed (being part of Apache) and is the most stable of the three MR implementations we’re discussing. The only downside of this is that you have to put your data in HDFS before Hadoop will operate on it, and you have to pull it out when you’re done, so a few extra steps are always required. If you can get those parts scripted quickly, however, then this complaint is moot (or just use hdfs-fuse).
http-mr works a little bit differently. It’s not open-source (UPDATE 02/09/09: Apparently it is open-source and I just missed the link to where you can fetch it from svn), so there’s not a ton of data out on it, but it is aGoogle App Engine app that implements MapReduce. Since communication with App Engine is restricted to ports 80 and 443 and you can’t write to the file system, http-mr does all its communication over port 80 (hence the project’s name). This one is more interesting as a proof-of-concept, but since you can’t play with it, there’s really not much to say.
Finally, while snagging the previous picture for this post, I saw that the same blogger behind that picture had written MapReduce in Ruby to get a feel for it (Josh Carter). It is open-source and provides a very simple way to get MapReduce in Ruby using standard libraries. The author also gives a good amount of examples showing how to program the Map and Reduce functions and looks pretty cool too.
If you’re better at Java than Ruby, go check out Hadoop, and if it’s the other way around, go check out Josh Carter’s MR implementation. They both give a great feel for what MapReduce is and how it differs from your everyday programming, and should make for an interesting afternoon project. As usual, if you do end up doing something fun with it, drop me a line with what it was!