Byzantine Reality

Searching for Byzantine failures in the world around us

History of MapReduce, Part 2

Recently for one of my classes I got the chance to research a parallel programming technique and it’s applications. Being a bona-fide MapReduce whore-in-training, I naturally had an easy choice. Since this is “Part 2″, we cover the era I’ll dub the “Return of MapReduce”, in which Google re-introduces and rebrands MapReduce as the technique we love to hate and love to love (choose whichever suits you best) and new advancements are made with it. This covers about the 2004-2008 time period, in contrast to Part 1 (the beginning of existence-2004), where map-reduce exists as part of Lisp, MPI, and other environments but is not particularly accessible to the average programmer, and Part 3 (2009-present), which I will leave as a surprise for those of you diligently following along at home.

With that said, here’s part 2, beginning with the paper that changed MapReduce and revived mainstream interest in the technique. Enjoy!

Although the map-reduce technique is well known in the worlds of distributed and parallel computing, Google’s MapReduce paper[1] has brought new passion to this programming paradigm. For those not familiar with the technique, the user provides the system with a “map” function (a function taking a tuple and outputting a tuple) and a “reduce” function (a function taking a tuple and outputting a single value, although it can also output a tuple if needed).

This simple idea has been shown to be an effective method of programming embarassingly parallel problems, as both the map and reduce functions are stateless and thus easily parallelizable. Further excitement was added to this when Apache’s Hadoop project was released, providing users with an open-source implementation of MapReduce. It is programmed almost entirely in Java, with only a few native C libraries (interestingly enough, these are all compression libraries).

Their implementation provided also to be highly scalable, and last year placed first place in the Terabyte Sort Benchmark. In order to do so, they used 910 nodes, each with 2 cores (a total of 1820 cores), and were able to keep the data entirely in memory across the nodes. With this many machines and their open-source MapReduce implementation, they were able to sort one terabyte of data in 209 seconds, in contrast to the previous record of 297 seconds. Not one to be left behind, Google showed that they were still quite ahead, claiming they could do the same with their closed-source implementation in only 68 seconds. The only details they note are that they use 1000 nodes, but since it’s not verifiable or mentioned on the official Sort Benchmark Home Page, it may not be worthwhile to discuss it further.

In the years since MapReduce’s “reintroduction” by Google, researchers have been actively seeking problems to use MapReduce on and investigate. Another implementation of MapReduce, named Phoenix [2], has been created to facilitate optimal use of multi-core and multiprocessor systems. Written in C++, they sought to use the MapReduce programming paradigm to perform dense integer matrix multiplication as well as performing linear regression and principal components analysis on a matrix (amongst other applications). The following graph shows the speedup they achieve over varying number of cores:

phoenix1

Researchers have taken MapReduce in other directions. MapReduce has also been exported for use on FPGAs and GPUs [3]. Users program map and reduce functions in ANSI C, and the paper presents results for Monte Carlo simulations approximating the value of pi and the n-body problem under gravitational force (again, amongst other problems). The numbers below harness a GPU with 16 multiprocessors, each with 8 cores, yielding 128 processing units. We begin by showing the numbers for the Monte Carlo approximation of pi, followed by the speedup for all benchmarks implemented on the GPU and FPGA:

pi-computationfpga-gpu

The final evolution of MapReduce we shall examine here is done by the same people we started off looking at, the Hadoop project. They assert that the MapReduce paradigm can be too low level, giving a programming language (Pig Latin [4]) and runtime (Pig) that builds on top of MapReduce. This dataflow language is extremely restrictive but also extremely parallelizable. It allows users to write Java code that interfaces with the system as well as use their Pig code to process large datasets. If allowed, a project investigating the types of applications that can be written with this system and integrating it into our open-source cloud platform would be both novel and useful for the open-source community.

By this point we have hopefully conveyed the following:

  • Numerous scientific computing problems can be solved through the MapReduce programming model, although many others cannot.
  • MapReduce has been implemented in various programming languages (C, C++, Java) and for various environments (CPUs, GPUs, FPGAs).
  • MapReduce is scalable on embarrassingly parallel problems, but has not been explored in the cloud computing community yet despite demand for it.

We must also make an important point: MapReduce does not work on all problems. It relies on the problem being able to be specified as a combination of a map function and a reduce function, which is not always possible. However, I wish to investigate whether the domain of problems that can be solved with MapReduce is larger than originally suspected (just embarrassingly parallel problems) or not, and why or why not. Furthermore, I wish to see if we can integrate MapReduce’s power with our open-source cloud software in a way that can be harnessed by the general community in a way similar to how Hadoop has done over the last half-decade with their work.

References:

[1] MapReduce: Simpified Data Processing on Large Clusters

[2] Evaluating MapReduce for Multi-core and Multiprocessor Systems

[3] Map-reduce as a Programming Model for Custom Computing Machines

[4] Pig Latin: A Not-So-Foreign Language for Data Processing