One of the most viewed posts we’ve written about here is our post on MapReduce from a little while ago, so when picking out the next paper to look over, I thought something related to it would be optimal. With that in mind, Yahoo! Research has a relatively new paper that they published in SIGMOD ’08, titled Pig Latin: A Not-So-Foreign Language for Data Processing. Pig Latin bills itself as a natural progression of MapReduce in several ways, and indeed looks pretty interesting.
Before we can really talk about what the next step is as far as MapReduce is concerned, we need to see what the current step doesn’t really do. The authors end up making a pretty solid case while keeping it pretty simple: MapReduce only solves a certain subset of problems, is unfamiliar to programmers, and thus results in difficult to maintain code being produced. However, it’s important to not throw out the proverbial baby with the bathwater here. They realize that MapReduce is pretty useful, only that we should not burden programmers with having to deal with the low-level inner workings of MapReduce.
Enter Pig Latin (the language) and Pig (the runtime). Pig Latin is a high-level dataflow language that allows users to write SQL-like code that translates into MapReduce jobs. Saying it’s SQL-like is a bit of a slippery slope: they’re both query languages, but SQL is declarative versus Pig Latin’s dataflow. The Pig runtime can perform optimizations and intelligent scheduling of MapReduce jobs, which gives it a feel similar to SQL’s query optimization. The authors make a specific point to note the differences:
Automatic query optimization has its limits, especially with uncataloged data, prevalent user-deﬁned functions, and parallel execution, which are all features of our target environment.
So if nothing else we already have an improvement over traditional MapReduce. Surely in the last few decades somebody else has worked on intelligent scheduling of MapReduce jobs, but this at least takes it further than the Google paper.
Pig also goes out of its way to note that if you’re not satisfied with the functions they offer, they also provide the ability for you to provide User Defined Functions (UDFs). They note it as a “first-class” feature of the language, but oddly enough, these UDF’s must be written in Java. It’s not clear if they can still parallelize the work involved since it’s no longer in the highly restricted but easily parallelizable form of Pig Latin, and the problem would presumably only get harder as they add support for more languages (they cite C/C++, Perl, and Python as future candidates). It’s hard to say what I would prefer given that I haven’t actually used the system, but I would venture that if I’m going to use this language to write my code in, I should be able to define my own functions in Pig Latin so that the system can still automatically parallelize it.
Let’s move on for now. Pig Latin provides the LOAD keyword that can be used to read input in from a given file, and if the file is not already in a preferred format, a UDF can be provided to deserialize the data. Similarly, the STORE keyword is provided to write output to a given file, and a UDF can be provided to write to whatever format is preferred. It’s a little bit easier to see why UDFs can be written in Java once we see this, since now we can leverage on Java’s extensive library support instead of only Pig Latin’s. Since Pig also runs over Hadoop, they also claim they can parallelize file reads and writes over the Hadoop Distributed File System, leveraging parallelism even further.
As we’ve hinted at earlier, the language is very amendable to parallelism because it’s incredibly restricted. There’s a relatively few number of operations that can be done (although UDFs can break this), and most of them are aimed at manipulating tuples and atoms of data. It gives you just enough power to work with, although there is the mandatory learning curve involved. They do provide a JOIN operator analogous to SQL, but recommend the use of a similar COGROUP operator over it. It’s pretty hard to concretely describe the difference without being exceedingly confusing, and since they use a picture to do so, I suspect they feel the same way. So here’s that picture, from page six of their paper:
In the end, a JOIN is just a COGROUP plus a FLATTEN operation, so for performance reasons COGROUP is recommended. They do allow certain commands to be nested as well, but at this time it’s only a subset of all the commands, presumably also for performance reasons.
Finally, the Pig team also offers an interesting debugging tool named Pig Pen for use with Pig. It’s an Eclipse plugin that allows you to take the relatively large datasets you may be working on and generate smaller random datasets and run your code against for debugging purposes. It has a very REPL-like feel to it, and allows you to step through each query to make sure you get back what you think you should be. Again, without actually using it, it’s hard to evaluate it, but it certainly looks like an interesting tool to make using the system much easier than without it.
Since we always try to contrast closed-source with open-source technologies when we look at these papers, let’s continue the trend and wrap things up. Pig Latin doesn’t have an immediate idea it’s open-sourcing, but it’s closest competitor would appear to be Google’s Sawzall. Sawzall is also a level of abstraction above MapReduce, but still appears to be a bit more restrictive than Pig Latin. Again, this is probably for performance reasons, but since we haven’t looked at the Sawzall paper yet, let’s not speculate further.
So if you have a spare Hadoop cluster laying around and could use some data processing on it, check out Pig! It certainly looks interesting and could very well be a great layer of abstraction over MapReduce, but if you’ve actually used it, drop us a line with your experience with it!