Byzantine Reality

Searching for Byzantine failures in the world around us

Large Scale Data Analysis Talk

As part of the seminar I’m in on Large Scale Data Analysis, I gave a talk on the continuing battle in the MapReduce world between DeWitt and Stonebraker on the side of parallel databases versus Dean and Ghemawat on the side of MapReduce. For those of you not interested in reading these long articles, it basically boils down to this: DeWitt and Stonebraker originally claimed that MapReduce allowed for fast data movement but was slow for actual computation, so you should use parallel databases instead (they suggested Vertica and “DBMS-X”, a mystery database).

They now say that for “quick and dirty” one-off jobs you should use MapReduce due to the fast data movement but in all other cases you should use Vertica. Dean and Ghemawat responded by saying that all the faults DeWitt and Stonebraker accused MapReduce of having are really faults of Hadoop’s MapReduce implementation and not MapReduce as an algorithm (that is, Google MapReduce doesn’t suffer from these problems). Specifically, DeWitt and Stonebraker’s MapReduce numbers turned out to be really slow because they stored the data as strings and parsing the data out was extremely expensive (often more expensive than the actual computation involved). To remedy this problem the data can simply be stored as Protocol Buffers, which DeWitt and Stonebraker were unable to do since Hadoop MapReduce doesn’t support it (although Google MapReduce was). There is an open ticket for this feature in Hadoop MapReduce but it appears to be orphaned long ago. If we actually get this feature in it will make the comparison extremely interesting.

The second half of my talk covered two recent papers on virtual machine migration, which is really handy if you need to reboot the physical machine for upgrades and maintenance or migrate the virtual machines for load balancing or power management, but as far as my research goes, none of those really help me out. Sysadmins will love these features, but the rest of us are really more concerned with reacting to VM failures and not so much to proactive VM failures.

Either way, I uploaded the slides as usual for your enjoyment. Hope you find them useful!