My First Run with the X10 Programming Language
Not so long ago I discussed my experiment to learn the X10 programming language as part of my “learn one programming language per year” project. Having constructed my first non-trivial application in X10 recently, I am now ready to make a preliminary opinion about its usability and the pros/cons. With that said, let’s take a look into one possible future for concurrent programming.
X10 follows the Partitioned Global Address Space (PGAS) model that languages such as Unified Parallel C(UPC) before it have taken. Thus like UPC, it has global arrays distributed over processors, but unlike UPC, the semantics of these arrays are quite a bit different. This is because computation is the first-class citizen of X10, while data is a second-class citizen. Don’t get me wrong, data operations are done very nicely in X10 (with great support for multi-dimensional arrays), but fundamentally this appears to be in order to make computation that uses this look nicer. With that, computation is really nice – if you want to run some code at a particular thread/process (known in X10 speak as a ‘Place’), just say
at (place) { do something }and you’re good to go! But since data is a second-class citizen, making sure that the data you want is available at the right place or how to move it there is the big paradigm shift that X10 requires you to learn.
In my particular case, I was converting some MPI code that I had which was doing a distributed power method computation to X10. Like when I converted this code to UPC in order to learn it, a similar (but certainly not the same) inversion of thought is necessary in order for me to do this. And inevitably, another inversion of thought is needed to change that code into something that runs quickly. But that’s not a bad thing – it’s just the learning process associated with a new programming language.
A while ago I was stuck in a vi/emacs or GTFO mindset, and being in this mindset made X10 development very difficult. I had initially run into X10 at a developer day IBM put together back in March and thought that it was different from what I had seen before but not too different, such that it might actually be a language I could get real work done in. However, X10 falls into a similar level of verbosity to C#, which is just enough where vi is a bit counter-productive. So having got over that mentality since my initial encounter with X10, I embraced Eclipse (but just for X10 development so far) and was able to really get into it. Thus, throughout this project, I was inherently at a disadvantage of having to learn both Eclipse and X10 – the interaction of which is a big part of this review.
With that said, I was able to complete my Power Method code and run it over a cluster of machines (as one of the backends compiles to MPI).
What I liked:
- In many cases, X10 is more succinct than Java, and the type inference for immutable variables (vals) does a good job. If you’re using mutable variables (vars), however, you still have to specify the type by hand, which is about the same amount of code as Java.
- Back in X10 2.0.6, the tutorials were the best I have every seen – in Eclipse there was about two hours worth of tutorials explaining all the new concepts and what differs from Java. They made the language easy to pick up since there were many things that could be molded into the code I needed and oftentimes showed me what I was doing wrong.
- The concurrency abstractions are pretty cool and nice and high-level – need to fire off a lot of threads and block until they’re done? Just use ‘async’, give it a block of code, and slap that in a ‘finish’ block! Most programming languages offer some version of this – Ruby’s is pretty good for example but in contrast is not thread-safe by default – but this is certainly the cleanest I’ve seen thus far.
- X10 easily supports mixing its code with Java or C++ libraries / code – haven’t needed to use this myself, but a cool selling point. The syntax doesn’t look too bad for this, but since the point of this experiment was to get away from Java and C++, I don’t see anything immediately compelling me to use this. Similarly, they have alpha support for compiling to CUDA code for running on GPUs – again, I don’t have a GPU sitting around to try this on, but looks very interesting.
- The Eclipse support (called X10DT) is done very nicely – there’s just a plugin that natively interfaces with Eclipse that has installed for me without any problems over the last two releases I’ve played around with.
What snagged me or could have been improved:
- BadPlaceException – get used to seeing this fellow all the time once you start doing non-trivial programs. As the name suggests, a thread is trying to access data it doesn’t own, which causes it to violently explode. Debugging this is a pain since the line number in your X10 code doesn’t show up in the stack trace (the line number in whatever code it generates does show up, but that’s not helpful to me). Eclipse may have some debugging support integrated for this, but as I’m new to Eclipse, I have no idea what it is and the X10 documentation doesn’t appear to mention it. This was a double pain for me since once I was done getting rid of these on my local machine, I tried running it over a cluster of machines only to run into the same problem again – which without a line number for or debugging caused me to revert to the well documented “just put print line statements everywhere” strategy.
- Remember how I said the tutorials were really good in X10 2.0.6? Well I upgraded to the current version (X10 2.1) and all that cool documentation vanished, leaving me with the language reference. Now I have to specifically know the classes I want info on and use this and the X10 Standard Library documentation – essentially JavaDocs. It’s not that bad for me since I already got my first grasps on X10 with the cool tutorials, but I think it hurts new users not to have it. Interestingly, a new updated tutorial on X10 2.1 was just put out today that has a lot of great info – check that outhere.
- Syntax highlighting takes forever, and oftentimes error messages don’t go away until I save my code and let it do a pass over my code. Presumably future releases will fix this, but for now I just don’t trust the messages unless I save first and wait a few seconds for it to tell me what the real errors are.
- Lack of auto-completion support – I figured that since I was using Eclipse I could do the usual hot-key and do method completion or it would show up whenever I hit the “.” on my keyboard, but alas, the current version doesn’t have it. Again, this is likely due to the newness of the project or relative unimportance compared to more pressing issues, but since the rest of the “real languages with IDE support” have this, I’d expect X10 to have this in the next release or two.
- Putting my computer into sleep mode occasionally causes code to become not editable, which as you would suspect is a big pain in the ass. Interestingly, the code is always saved, so I just close Eclipse and re-launch it, but then I have to re-open the Eclipse X10 Help Menu to get back to the language reference (which takes 10-20 seconds to load), and get back to where I was before, and so on. Thankfully it doesn’t happen too often, and hopefully a future release will address this.
- The Java backend sets the number of Places to 4 by default, and when I ran it over MPI with the C++ backend, it ran with a single Place. Since my code wasn’t tested on that, it naturally didn’t work (BadPlaceException) and I couldn’t figure out how to run it over the Java backend (which runs really fast) with a single Place. I struggled for quite a while looking at environment variables to set and after trying what I thought were all of them when compiling or running my code, I found my solution: just run it with MPI on the C++ backend with
-n 1 -np <number of places>
And that will do it! Debugging this still tends to be a pain since I’m now back to vi and 10-20 second compile times, but now I have some great code that looks a bit nicer than my MPI code. Since setting the number of Places is quite an important setting, I would also expect the documentation on how to set this to be improved by the next release or two.
Let’s wrap things up with an analogy: X10′s concurrency support is to MPI as Java’s garbage collection is to C++. MPI can give you super-fast code, but you’ll need to be a master coder to do so, and then it will likely only be optimized on whatever cluster your code was constructed on, just like how your C++ code will be memory-leak free only if you’re a pro at it – and in both cases, even the coding masters have had horribly slow code in MPI or leaky programs in C++. In the same way, Java and X10 offer much better programmer productivity since most of us aren’t memory management experts or gurus of concurrent programming (fun times with race conditions, shared program state, and so on).
X10 mixes things up just enough to be interesting, so it’s a project I will certainly be keeping an eye on and using in projects that permit it. If you have code you need to run concurrently or haven’t learned a new language recently, I definitely recommend checking it out!
Thoughts on Google (meta)
Now that our coverage on the Big 5 Google papers has concluded, I thought it may be handy to aggregate the links to them all in one handy spot. That being said, here’s the links to our series of reviews (sorted by post date / chronological publishing of paper):
- Thoughts on the Google File System: How one constructs a fault-tolerant file system with which to build everything else on that performs optimally on huge data sets.
- Thoughts on MapReduce: How to write programs to process huge amounts of data as quickly as possible given the huge amount of resources available.
- Thoughts on BigTable: How to store data in a more structured format yet keep replication and high speed processing.
- Thoughts on Chubby: How to locate specific nodes serving a specific purpose though the use of a highly available naming service.
- Thoughts on Paxos: How to implement a consensus algorithm to ensure agreement amongst nodes in the presence of failures and keep high performance in the meanwhile.
So check them out if you haven’t, and be sure to come back next time for our review of Amazon’s Dynamo!
Thoughts on Paxos
After a month of Google papers, we finally come to the last paper in our Google reading list: Paxos Made Live. Like the Chubby paper from last week, this paper doesn’t present any new research, but instead tells us how to build a system from the current research out there such that it can withstand incredible load. It also aims to show how we could possibly be confident in a non-trivial amount of code as well as optimizations that can be made to improve the performance and reliability of the system. So how did they do and how did it turn out? Read on!
A good starting point in the paper is its concise definition of the Paxos algorithm. As it is about a page long, I’ll omit it here but leave you with a more general description of Paxos, simply that it is an algorithm that assures agreement on a value between a majority of nodes in a distributed system. Of course if you’re interested in the algorithm this paper is an excellent starting point, and the writeup on Wikipedia is a good re-write of it (Basic Paxos).
An inherent downside of this is that since it is a consensus algorithm, the FLP Impossibility Result tells us that liveness can be impossible with a faulty processor (although not guaranteed and the probability of such an event is unknown). Now we have a problem. Liveness is the ability of the distributed system to make “progress”, that is, for new values to be agreed on over time, and now there’s an important theoretical result telling us that we can’t guarantee it if a single node fails. Interestingly, this paper does not state how they resolve the problem, but the Chubby paper does:
Paxos maintains safety without timing assumptions, but clocks must be introduced to ensure liveness; this overcomes the impossibility result of Fischer et al.
So how do clocks solve the problem? The original paper by Fischer et al. tells us the answer: just have the proposed value time-out and fail on all nodes. Since the nodes all have synchronized clocks, this is entirely possible and solves the problem altogether.
Now comes the obvious question: why use Paxos in the first place? Why do they even need it? This comes about because Chubby was using a commercial three-times replicated database but found it had bugs in its replication code, and since Google needs Chubby for extremely high availability, this warranted a rewrite of all the involved code. With Paxos, we can ensure that a distributed system can reach agreement on a value. In this case, the value is a change to the database, and once all the nodes agree on a change (technically, a majority, but other nodes can catch up on it later if they miss it), they can make the change to the database and log it so that snapshots of the database can be taken (for backup and recovery purposes). Now we’ve got a distributed database whose code we can debug (of course, “we” refers to Google and not us) should worse come to worse.
If you did happen to read the Paxos algorithm or the Wikipedia article on Paxos (or even just look at the diagram), you can see that this algorithm runs in a number of phases, and that there’s lots of potential for optimization. A common optimization is that the first step of the algorithm, electing a leader (master) for this run of Paxos, can be omitted if the node proposing the change is already the leader. And just like that we’ve knocked off O(2n) messages (n messages sent out requesting to be the master and potentially n messages received with the response). In fact, citing the paper (footnote, Page 7):
In a loaded system, under one percent of the Paxos instances run the full Paxos algorithm.
This ends up being pretty good for the final performance of the system, but before we can get that far, we have to at least mention the various problems that arise. With Chubby, masters can fail and new masters need to take their place. Paxos allows for this, but a problem encountered was that the intended behavior was to have events issued to old masters fail, while in practice nothing was preventing these events from succeeding (since nothing identified a master’s lifetime). So to enforce the requested behavior, exactly that had to be added in: epoch numbers, which represented a master’s lifetime (analogous to noting that our current president is the 44th, to distinguish him from all others).
Since Chubby is designed to be highly available, proper coding measures were taken to ensure that rolling it out would be as successful as possible. The paper details the numerous problems they ran into (problems upgrading, problems rolling back to an old version, and so on). But more interesting is how they were confident of its functionality. Extensive test suites were constructed to verify that even if faults were injected into the system, that it would remain safe (that is, one node couldn’t get an update unless a majority of them did) and live (over time, nodes will get updates).
The more interesting development note is how they coded Paxos in the first place. They note that they coded core parts of the algorithm as state machines in a specification language and then used a compiler to translate it to C++. This makes it much easier to verify its correctness, as it’s much easier to verify a page of pseudocode-like specifications than many thousands of lines of C++ code. Upon reading this, I thought of one thing: a controversial blog post by Joel Spolsky about Wasabi. Essentially it’s a similar idea: they needed certain features from their language but it just ended up being so mundane to do so that it was unproductive. So they wrote a compiler named Wasabi to give them all the language features they needed and compiled that to the language they needed. Since it was a lot of addons to existing languages, compiling it to the existing language ends up being much easier than writing a whole new language. Just like the Google guys, they’ve both used the right tool for the right job, and got a better product that was easier to program (given some initial development time on the compiler itself).
They follow up that interesting idea with another one: they had a bit of trouble finding where problems were in the system due to the fault-tolerant nature of the system. When a node would die, the fact of it would be hidden from the developers and it became harder to notice without extra logging procedures. Of course, the problem was resolved, but it makes it inherently difficult to address different types of problems as they arise.
That being said, let’s look at how it performs. From page 14 of the paper:

Like last time, these numbers really speak for themselves. It outperforms their original database in every metric they measure it against and gives them the peace of mind they need should something come up down the line (since they also own the code base and are familiar with it).
The paper ends on a bit of an ominous note: they note that the current state of fault-tolerant systems is lacking, to say the least. They make it clear that the community has not thought enough about how testing fault-tolerant applications is done, and that taking these algorithms and modifying them to make them work in practice destroys any hope of correctness and proving useful properties about them in a reasonable amount of time.
This rounds up our tour of the “Big 5″ Google papers. As we’ve seen before, these applications were all designed for a very specialized environment (Google) with very specific needs. Processing huge amounts of data quickly requires a fault-tolerant filesystem (GFS), a database that can be run on it (BigTable), a programming model to use it with (MapReduce), a naming service to tie everything together with (Chubby), and an algorithm to ensure agreement between a set of nodes (Paxos). They’re all at varying degrees of abstraction and made for a fun ride. Of course, it doesn’t work for every situation, and most likely won’t work for yours, but it’s always interesting to see what everyone else is up to and how they pull it all off. We’ll continue next time with another paper from the reading list we looked at a while ago, so stay tuned!
Thoughts on Chubby
The last few times around we’ve been talking about performance. Extreme performance. We’ve talked about how fast the Google File System is, how fast MapReduce is, and how fast BigTable is. Yet all three rely on a common service we hinted upon last time, a lock service (amongst other things) named Chubby, the topic of today’s discussion. Chubby doesn’t bill itself as high performance like the other services we’ve mentioned, but is designed with high availability in mind. Let’s dive in further and see what exactly makes up Chubby and why it’s so special in the Google world.
Chubby provides a number of services, but at its heart, it is a lock service. It normally consists of five replicas (as opposed to the usual three from the Google File System) and is considered to be up if at least three are functional (a majority). This is essential because leader election is done via the Paxos algorithm, which requires a majority of the nodes to be up for progress to be made, but as it is the topic of our next discussion, we can brush it under the rug for now.
A main point in the paper is that although Chubby is a lock service, it is also used heavily as a naming service. It names locks in a file-system-like manner, and thus a server can place it’s location in a file named “/ls/myserver” (ls for lock-server). Since the server holds the lock on that file and renews the lock as long as it is up and providing a service, it is assured that the data is never stale. Clients can then find this server by querying Chubby for everything in the “/ls/” directory and voila! The naming problem is pretty much solved. The paper refers to these files as being “ephemeral files”, namely temporary files used to indicate that clients/servers are alive and which locks are valid. They also hint that it can be used to indicate whether other resources are valid asides from locks, such as caches, but that it is currently an unexplored area.
What is explored thoroughly is what Chubby does for other Google services. It is only stated that it is a “rendezvous mechanism” for MapReduce, which I would expand upon and speculate refers to naming which machines are running Map jobs and Reduce jobs so that the scheduler can map the output from one to the input to the other in a coherent manner. Both the Google File System and BigTable need to elect master servers (primaries), and Chubby is used in what appears to be the exact fashion we described in the last paragraph (note this also provides naming for where the primaries are). Finally, they also use it as a repository for files requiring high availability. They cite is as being used for access control lists, and it is a fair presumption to say that it is likely used on ACLs for metadata on files of particular importance (e.g., are about the filesystem itself or about particularly important data).
An interesting design choice that was made early on was to provide Chubby as a lock service over providing it as a library that developers could use in their programs. Both ways have their pros and cons, but the big downside of using it as a library is that the programmer would have to form their program into a state machine (required for Paxos), which requires a non-trivial amont of forethought and much greater re-designwork after the fact. They go on to state that they preferred the lock service idea since it’s relatively familiar to programmers, despite their knowledge that programmers don’t know how to use locks correctly, and that it’s even more difficult a problem in the case of distributed systems.
Like BigTable, metadata stored in Chubby contains a version number per file, but unlike BigTable, old versions cannot be recovered. It appears this version number is primarily used to determine whether or not a user has the most recent version of a file before they update it (e.g., they grab version 4 and want to write changes as version 5 but in the meanwhile someone else comes in and writes two sets of changes, pushing the current version number to 6).
Furthermore, while Chubby is a lock service, it only provides advisory locks (not mandatory). Nothing prevents a client from writing or reading a particular file they do not have the lock for, provided they have the correct ACLs to do so. The motivation behind this is more of convenience: it is perceived as a failure to force a user to reboot their machine to force it to release a lock and that having them release their lock is out of date is far preferable. Which brings us to a not-so-minor point we lightly touched on earlier: locks must be renewed constantly, as the leases on them tend to be small (although they can be variable length as set by Chubby). We shall later see that the traffic incurred by these renewals is the greatest source of traffic in the system by an order of magnitude.
In an attempt to make using Chubby more user-friendly, calls to it can be either synchronous or asynchronous. The programmer can specify a callback function as well to be called upon certain events happening. The primary events cited in the paper that callbacks are used is when a master fails over, when locks become invalid, or file contents are modified by a different user (all cases we’ve talked about earlier). Two methods are also given to close a user’s session: a Close method, which destroys the object the user is manipulating (this is all C++ so we have to do garbage collection ourselves), and a Poison method, which does not free the object but renders it unusable (all method calls on it fail) so that other threads can free it.
That being said, let’s look at exactly how Chubby performs. These numbers are lifted straight from the paper and claim to be typical of Chubby instances at Google (note that RPC rates are over a period of ten minutes):

This is a good summary of everything we’ve talked about: extreme availability (one 14 second downtime every 18 days) in the face of a huge number of clients (54000). And like we mentioned earlier, the lock renewal messages (KeepAlive RPCs) make up the vast majority of RPCs (93% versus 2% for getting a file’s metadata, the next closest contendor). Since there are so many messages going to the server, a proxy can be used (see the proxied clients row) to reduce the load going to the Chubby server, and its importance is not understated in the paper:
We have found that the key to scaling Chubby is not server performance; reducing communication to the server can have far greater impact.
So now you know a decent amount about Chubby and how it interoperates with the other Google services. If I’ve piqued your interest at all you should certainly read the paper itself, as it goes more into detail about everything we’ve talked about and more (design decisions, human factors, and so on). It also talks a little bit more about the various Chubby instances, such as how there typically is one five-replica instance for each data center and one five-replica instance spread out across the world (the “global cell”), ensuring high availability and relatively quick access speeds no matter where you are on the planet.
Something to end on today is to note the recurring theme of leader election in distributed systems. It came up originally in the GFS paper, showed up again in BigTable, and once more here in Chubby. If you aren’t familiar with how this is done, then stay tuned for next time, when we shall look at how Google solved this problem using a well known solution (the Paxos algorithm) and the problems they ran into.
Thoughts on BigTable
Now that we’ve looked at the Google File System and MapReduce, we come to the next paper on our reading list: BigTable. It comes to us two years after the MapReduce paper and presents an interesting new step in databases. Like MapReduce, it is another notable paper and takes a very familiar programming construct and revamps it to new heights. Let’s see if I can whet your appetite with a spoiler:
Maybe a better name for it is Big Hash Table.
Now that I have your interest, let’s take a step back and have a quick history lesson. To this point, pretty much all databases I’ve run into have fallen into one of three categories:
- Relational: If you’ve ever used a database, you likely have used a relational database (MySQL and friends). No need to expand much here.
- Hierarchical: Certainly not as popular or well-known as relational databases, but hierarchical databases like LDAP provide a great way to fetch and store your data if you can organize them in a tree-like structure.
- Object-oriented: Can’t say I know a whole lot about these. I hear they were quite the rage back in the ’90s, since an object-oriented database was supposed to be a perfect fit for an object-oriented language. But for some reason or another it just didn’t work out, and two new technologies supplanted it that are the new hot thing. Rails has made Object-Relational Mapping (ORM) popular for their users, while C# has Language Integrated Query (LINQ).
So from above we can begin to see a trend: each database type seems to be modeled after some type of data structure. In order, we look like we’ve got a two-dimensional array (more or less), a tree, and an object. The notable exception from the data structures you would learn about in a basic data structure class is thus the hashtable, so BigTable’s entry in the field is not as surprising as people seem to give it credit for. Now, let’s see what makes BigTable stand out.
As we’ve hinted before, BigTable is a relatively large hash table. It’s simple to read from and write to, as you are restricted to standard hashtable operations (get, put, delete). You still have rows and columns, and data items also have a timestamp associated with it, so that versioning is automatic. BigTable then garbage collects older versions as specified by the user (e.g., garbage collect everything but the most recent three versions of all items).
One big difference between BigTable and relational databases is how transactions are handled. Row-level transactions are possible in BigTable, but not any other type of transactions. The paper states that this feature may make it confusing for programmers new to BigTable to use, but that apparently this is a minor obstacle to writing much more scalable code.
The unsung hero of the day that really makes BigTable shine, however, is Chubby. Chubby is a lock-providing service that is highly available and allows many difficult design decisions to be abstracted away from BigTable. When a BigTable server (master or slave) comes up, it reserves a lock through Chubby with its name and path, and renews the lock periodically. This makes it trivial for the master to determine which nodes are up and down, as it can just query Chubby to see who has locks (thus these nodes are up). Since the BigTable master also knows the locations of the slaves, it can trivially determine if Chubby is down (it would be unable to see who has locks in Chubby but still be able to contact them). But since Chubby is the next paper on our reading list, let’s not say too much more about it right now.
Another reason BigTable shines so brightly for Google is the infrastructure it’s been built on, namely the last two papers we’ve looked at. By placing BigTable on the Google File System, the database automagically gains three-copy replication and fault tolerance. Furthermore, since the file system is already geared towards processing huge amounts of data, the database now doesn’t suffer from doing so (in fact it performs even better on it). BigTable adds onto this by allowing certain columns in a table to be split up from other columns, speeding up access time on queries for only those columns. The example given in the paper is that if the table contains web pages, maybe we can put meta-data in one column and the pages contents in another column. Then, clients that only need the page’s meta-data are able to save a lot of time versus querying all the columns.
BigTable also gains substantially from being able to use MapReduce. Now complex operations can be performed against this huge repository of data with many many machines and can be done so reliably (that is, it recovers from failures). This seems to be the primary reason why BigTable doesn’t mention any ability to perform joins or any other standard relational database features (as MapReduce makes it simple enough to do so on their own).
So as always, we end up with a common question: how can I get my hands on this to mess around with? And as usual, you can’t get it, but there are some open-source versions you can fiddle around with. Just asHadoop puts out the Hadoop Distributed File System (open source Google File System) and Hadoop Map Reduce (open source Google Map Reduce), they also put out HBase, an open source implementation of BigTable. It also looks to be the most stable and easy to get going of all the open source BigTable implementations. There’s also HyperTable, but we’ve had orders of magnitude more difficulty getting it up and running than HBase, so I’d recommend HBase over it at this point in their development. And as usual, if you’re up to any kind of work in this area, drop a line and let all of us know!
Thoughts on MapReduce
Last time we looked at the Google File System paper. Next on the list chronologically brings us to the famous MapReduce paper. This is the paper that drew my attention more than any other, as I heard about it years ago but never really got a chance to thoroughly review it. This work in a lot of ways is in fact the polar opposite of GFS. Whereas GFS is very low-level, being a file system and all, MapReduce is the exact opposite. It provides excellent results on an extremely constrained set of problems, much like GFS, but instead of being aimed at files, is intended for programmers. And like GFS, the MapReduce paper has been out for a few years, so let’s see what’s been done with it and where it’s future could be going.
So let’s break down MapReduce as literally as possible. Map and reduce are two common functions found in functional programming languages (Lisp, ML, etc.). Map takes a list (or tuple), applies a given function to it, and spits out a new list. For example, a common map function would be to double each item in a list, in which case you would see this happen:Map x2 to (1 2 3 4 5) --> (2 4 6 8 10)
So in this case your mapping function would be “double”. Reduce similarly takes a list (or tuple), and applies a given function to it to produce a single value. Common reduce functions would thus be “average”, “min”, or “max”. Using “min” as an example, you would see this happen:Reduce "min" to (2 4 6 8 10) --> (2)
From the way we’ve structured the last two examples you may be able to see where we’re going. The clever guys and gals at Google caught on to the notion of doing a Map and feeding its results into a Reduce. This of course was nothing new, but where they took it surely was. They have two notable contributions here:
- They brought Map and Reduce out of functional-programming-land into the rest of the programming world by providing MapReduce libraries in C++, Java, and Python.
- They brought MapReduce into the world of distributed computing and out of being ran solely on a single computer.
Let’s look at each of these points a little closer. Map and reduce had been around for fifty years already with Lisp, but there’s a key reason it never really showed up in other languages: it just couldn’t. Map and reduce both require a function as input to apply to the given data, and most non-functional languages really make it a huge pain in the ass to pass around functions to each other. With C, you have to use function pointers, and with Java, you have to use Reflection (a bit expensive) or hide the functions inside of classes and pass those around. With functional languages, functions are first-class and you can pass them around and receive them extremely easily. With Google’s MapReduce library, you conceptually don’t need to worry about this kind of stuff, the library and runtime handles it for you.
The second point is the more interesting one, however. With Google, everything must be done BIG, so now we take this MapReduce “job” and farm it out to many, many computers. The simplest explanation for how this is done is the following picture, shamelessly taken from this interesting programming blog:
You start off with some huge amount of raw data and some task you want to perform on it, coded to adhere to MapReduce standards. As far as the paper goes, you give this data and code to a master node to run it as a MapReduce job. The master then sends out the data and your Map function to some number of worker nodes, who phone home to the master once they’re done. Once the Map workers all finish, the lists they emit are sorted and distributed to Reduce workers, who then do their work and write their output to a file. Of course there’s more going on here (what happens if workers die, how can this be optimized, etc.), but that’s all explained in the paper and not the point here.
The paper describes all of this in a good amount of detail and shows a number of sample applications as well. It shows how they perform and what bottlenecks can come up (as well as how they can be resolved). It’s important to note, however, two caveats:
- The Map and Reduce functions should be deterministic (stateless). If they are non-deterministic they can become exceedingly difficult to debug or reason about.
- The problem you want to solve must be expressible within the constraints of Map and Reduce (multiple applications of Map or Reduce are acceptable). Otherwise, MapReduce obviously can’t help you! MapReduce isn’t the be-all, end-all, but for many embarrassingly parallel applications, MapReduce will work wonders for you. Examples given in the paper include sorting huge amounts of data, counting the number of occurrences of each word in many documents, and searching for a given pattern in a large amount of text.
Like last time, there’s a key statement in the paper that really sums it up:
[MapReduce] allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
That’s really what it comes down to. Concurrent and distributed systems have been treacherous ground for quite some time now, and although we do have many best practices, it’s still easily problematic. This greatly simplifies the programming of an entire class of applications (a welcome tool in our toolbelt). So now we ask the same question we did last time: Where can I get my hands on this? And like last time, you can’t. Not the official version, anyways. But unlike last time, there are three interesting implementations of MapReduce out there that warrant discussion: Hadoop, http-mr, and Ruby’s MapReduce (two of which are open-source, ripe for the pickings).
Hadoop is an open-source implementation of MapReduce put out by the same team that put out the open-source version of the Google File System, the appropriately named Hadoop Distributed File System (HDFS). Hadoop lets you easily write MapReduce code in Java which it will then farm out to the nodes running HDFS, and appears to work in the same fashion as Google’s MapReduce. I’ve used it a little bit and was pleasantly surprised. The documentation looks great and if you had a free afternoon, you could get pretty familiar with it. It’s actively developed (being part of Apache) and is the most stable of the three MR implementations we’re discussing. The only downside of this is that you have to put your data in HDFS before Hadoop will operate on it, and you have to pull it out when you’re done, so a few extra steps are always required. If you can get those parts scripted quickly, however, then this complaint is moot (or just use hdfs-fuse).
http-mr works a little bit differently. It’s not open-source (UPDATE 02/09/09: Apparently it is open-source and I just missed the link to where you can fetch it from svn), so there’s not a ton of data out on it, but it is aGoogle App Engine app that implements MapReduce. Since communication with App Engine is restricted to ports 80 and 443 and you can’t write to the file system, http-mr does all its communication over port 80 (hence the project’s name). This one is more interesting as a proof-of-concept, but since you can’t play with it, there’s really not much to say.
Finally, while snagging the previous picture for this post, I saw that the same blogger behind that picture had written MapReduce in Ruby to get a feel for it (Josh Carter). It is open-source and provides a very simple way to get MapReduce in Ruby using standard libraries. The author also gives a good amount of examples showing how to program the Map and Reduce functions and looks pretty cool too.
If you’re better at Java than Ruby, go check out Hadoop, and if it’s the other way around, go check out Josh Carter’s MR implementation. They both give a great feel for what MapReduce is and how it differs from your everyday programming, and should make for an interesting afternoon project. As usual, if you do end up doing something fun with it, drop me a line with what it was!
Thoughts on the Google File System
The first paper on the extremely interesting list of distributed systems readings is the Google File System. It is not a particularly new paper, published way back in 2003 at the Symposium on Operating Systems Principles (SOSP), but is still a great read. I’ve chosen to start with this paper over the other, more recent papers, since this paper is both the earliest chronologically and the lowest layer of abstraction. If you have never heard of it before this point and have the vaguest interest in file systems, then read on!
A caveat: if you’re really into file systems, then you should read the paper yourself instead of taking my word on it. My review of the paper isn’t as concerned with the specifics of it as you may like.
The basic idea is like this: Google has way different needs than you and I, and they are different enough where they need to squeeze out as much performance as possible. Furthermore, failure is a common scenario and fault tolerance is a must. Byzantine (or malicious) failures don’t need to be worried about; it’s safe to assume that no one inside of Google would maliciously attack their own file system and their networking controls prevent the outside world from seeing any direct interface to the file system itself. They specifically say that they have made several modifications to the Linux kernel, but after a while it just became a better idea to make this file system and architect it to their specs.
So now that it’s decided that a new file system is needed, what makes it different from your run-of-the-mill file system? With Google, extremely large data sets are commonplace, so their chunk size (the smallest block of storage) is 64MB. Contrast this with ext3, which allows for block sizes between 1KB and 8KB, a factor of 64000 smaller! Of course, we’ve already talked about the ‘why’. We work on much smaller datasets! 64MB is completely ridiculous for you and I but completely necessary for Google.
That being said, that’s a common theme of the paper. You see a lot of “Google needs this and you won’t, so that’s how we got away with these decisions.” The paper is also surprisingly honest: you get a good feel for what the file system is good at and what it’s not. It’s not great if a ton of people are running the same executable file and it’s not replicated in enough places. By default, everything is replicated three times, but if you have a lot of people requesting the exact same thing, you quickly run into issues with the Google File System. This is expected to be an atypical case, however, and if you were to have such a file, you could tag it as such and have many extra replications out there to avoid his problem.
The best sentiment the paper expresses, I would say, is this:
One should be careful not to overly generalize from our workload. Since Google completely controls both GFS and its applications, the applications tend to be tuned for GFS, and conversely GFS is designed for these applications. Such mutual influence may also exist between general applications and file systems, but the effect is likely more pronounced in our case.
This pretty much summarizes the cool things you can do when you have applications that exhibit very specific behaviors and enough talent behind it. You can make a totally new file system just for it and run a chunk of a huge company on it. A downside though, is that we will really never get to play with it on our own.
Enter Hadoop. It provides both a MapReduce implementation (to be discussed at a future time) that runs on their open-source implementation of the Google File System, appropriately named the Hadoop Distributed File System (HDFS). Its specifications show it to be done very similarly to GFS, so for a while, this is the closest we’re going to get. One downside of HDFS versus GFS is that from what I’ve seen, you can’t have computers dynamically joining or leaving the filesystem. You have to set it all up ahead of time, which always seems to end up being imperfect. On the plus side, like GFS, HDFS automatically replicates your data for you, handles faults, and it’s all abstracted away from you, the user.
The other downside is that in order to run standard Unix-like commands, you have to go through the hadoop executable file. So it breaks the immersive flow a bit to have to always do “hadoop ls”, “hadoop cd /foo”, and so forth. A natural solution to this problem is to use FUSE to give you something that resembles a regular file system, thus giving us hdfs-fuse. It’s not clear how well it works, but it’ll be a fun side-project to investigate one of these days. If you’ve got any experience with it, drop me a line!
Exploiting Software
Now that finals are over I can FINALLY return to writing like I promised you all a billion times. With that in mind, let’s look at a book I picked up for my security class, Exploiting Software: How to Break Code. But does the book hold up to the badass name behind it? Yes. Yes it does. To sum it up, this book is a very broad overview of a number of common security vulnerabilities and how to exploit them. They summarize each technique they use into an “Attack Pattern”, named after the classic Design Patterns book. Since they’re a lot shorter than Design Patterns, they really end up being more like the examples in the also-classic Refactoring book: if you see this code, exploit it like this. It’s short and sweet and gets straight to the point. It’s done really well and consistently throughout the entire book so if you get bored for a section then just read the patterns until you get drawn back in again. With a high level technical book there’s another important question we should ask: is it better thanWikipedia? Definitely! I’ve been interested in knowing how format string attacks work but found Wikipedia strangely vague on the topic. In contrast, this book pulls through and gives a great explanation on it and how it’s pulled off. This is just one example of many where the book shines where Wikipedia has yet to (although in time I’m sure it will as well). The first half of the book is pretty introductory; it covers the motivation for the book as well as the main tools used. A lot of the book’s examples use the Interactive Disassembler (IDA-Pro), which is certainly the best tool on the market today at what it does (although it’s sadly pretty expensive). There’s tons of attack patterns and many real-world examples of why they’re relevant. The authors are pretty OS-agnostic, and everybody gets their fair share of security flaws. The second half of the book is where it starts to get interesting. Now that you’ve got the basics underway, they talk about local exploits and work they’ve done in the past. The local exploits are well defined, but I think that The Shellcoder’s Handbook does a far superior job of describing them since they get to devote a chapter for each type of exploit (e.g., one chapter for stack overflows, one for heap overflows, one for shellcode, etc.). Whereas the Shellcoding book gives you examples to do at home, this book shows you how they’re done and some examples and moves on, which is fine, but doesn’t give you the same end result. I’d say the end of the book is the best though. There’s an extensive discussion about how one of the authors engineered an exploit to overwrite two bytes of memory in a program and how it resulted in disabling all of Windows NT’s security features. Pretty awesome. They show you the vulnerable code, how they pull it off, and it’s a fun read. They also show you how to bypass common anti-black-hat tricks in a simple fashion and drive home a good point: Automated protection won’t save a poorly written program from being exploited. They also end up tempting you with their other books with the last chapter: Rootkits. I’ve known about them for a while but never really how they worked, and the book really shines here. There’s tons of code and they go into a good amount of detail on what rootkits can do and how they do them. They presumably save the good stuff for their book of the same name (Rootkits: Subverting the Windows Kernel), and this chapter really whets your appetite for it (at least it did for me). So it’s a great high level overview of the security world. If you need to end up doing stack overflows or other exploits, The Shellcoder’s Handbook may be a better choice, but I think this is still a great read. They’re very complementary books and they do a great job of covering the gaps each other misses, and while The Shellcoder’s Handbook really only covers local exploits, Exploiting Software covers remote exploits as well. Check it out if you’ve got some downtime and let me know if you enjoyed it as much as I did! Also, since classes are up, I’ll take a little break from the million book reviews and ramble on about other stuff for a short while like we used to do. Enjoy!
C as the Fat Man
It’s pretty rare that I find a good programming language-video game metaphor, but after a bit of thinking, I think I’ve got something. For me, it’s a natural match between the C programming language (and C++) and Fallout 3′s Fat Man, a “tactical nuclear catapult”. They’re both super-powerful and exceedingly dangerous to your health if not used correctly. With the Fat Man, you need to know how to use it safely as well as the common ways that using it incorrectly will get you killed. It’s exactly the same for C (substitute your player’s death for your program’s death). And with both, you only learn from experience. It’s debatable whether or not either can be the first you use or whether it’s really practical for everyday use. If you’ve got enough ammo (which you really don’t), then you technically could use the Fat Man for whatever you need to, although there’s some close quarters action where you really can’t use it without killing yourself in the midst of it. And the same goes for C: you could use C for whatever you felt like, but there’s some tasks where it’s just not suitable. And the same goes the other way: there’s some tasks where the Fat Man and C really excel. The last thirty years have shown us that if you’re writing an operating system, it’s gotta be C. Many projects have tried to wrestle the throne away: SPIN (Modula-3), JavaOS (Java), and Singularity (C#). Of course this list is far from exhaustive, but it just goes to show, it just hasn’t happened yet. In the same way, for taking out the big enemies in Fallout 3 you’ve just gotta use the Fat Man: Of course it’s common knowledge all over again: no programming language is perfect for everything just as no weapon does every task perfectly well, otherwise, why would we need more than one of either? And I’d go further to say that the preeminence of garbage collected languages have shown us that as programmers, we just aren’t that great at managing memory ourselves. It’s hard enough to use malloc() and free() correctly on non-trivial systems, even with years of experience. But there’s a different angle I want to look at this from: Like the Fat Man, C gives you more than enough power to shoot your leg off. But with programs, you now have untrusted users running your programs as well as hackers nitpicking through your code looking for vulnerabilities. So now you’ve got the Fat Man being operated by people with no idea what they’re doing and by people who want to cripple themselves or each other. Here’s where it got dangerous: the C language and it’s runtime has so many different types of vulnerabilities that letting new programmers use them is straightup dangerous. Of course, you have to use it to learn it, but for new programmers, here’s a word of advice to you, in traditional CS style: Rule 1: If you have to program in C, don’t. Rule 2: If you really have to program in C, have a copy of The Shellcoder’s Handbook to your left and a copy of Code Complete to your right. If you’re programming in C and you don’t know how to use it correctly and how to use it safely, you’re no better than pointing the Fat Man straight at the ground and clicking the trigger. You’re begging for your program to memory leak or segfault, and that’s if you’re lucky. In all actuality, you’re opening up security holes waiting to be exploited (equivalent to the radiation poisoning you’ll get from the Fat Man if it doesn’t immediately kill you upon improper use). It’s a valuable lesson that language designers picked up for newer languages: they saw how C was being exploited and patched a lot of those issues with Java and its runtime system (of course not just Java, but that’s the notable example). And of course, I’ve used C and have enjoyed it and will continue to use it via the rules shown earlier. Always know the pros and the cons of the technology you’re using, and keep them nearby so they don’t slip by you.
Beware the Dreaded 'Reflector'
Sorry about the incredible lack of posting lately. Have had too much time drained by a particularly nasty homework assignment. Now that it’s out of the way, I will hopefully return to the ‘regular’ posting schedule until the next vicious assignment comes up. For you masochists at home who have 20 hours to spare, here’s the specs for it from my teacher’s site: Develop a tool, called reflector, which reflects against an attacker’s host the attacker’s traffic. In order to do so, the tool is able to simulate two non-existent hosts, say victim and relayer, at both the Ethernet and IP levels. Whenever an attacker sends a packet to victim, the packet is intercepted by the reflector application and re-sent as a packet from relayer to the attacker’s host. The reply that is sent by the attacker’s host to relayer is then sent back as a packet fromvictim (in reply to the original packet) to the attacker’s host. This is an example of how this works. Suppose that the reflector application is running on host 192.168.1.10 and it is simulating: If host 128.111.48.69 (the attacker) sends a TCP SYN packet to host 192.168.1.11 thereflector application will capture that packet and transform it in a TCP SYN packet from 192.168.1.9 to 128.111.48.69. If the attacker host responds with a SYN/ACK packet (or a RST packet) to 192.168.1.9, then the reflector application will transform that packet in a packet from 192.168.1.11 to 128.111.48.69. This process is repeated until the application is stopped. You will have to sniff packets using libpcap. Also, you will have to spoof all the right ARP/IP packets that are needed to make it work reliably, using libnet. The application must be invoked with the following syntax: A non-default interface can be specified using the --interface command-line option. For example, in the example above, the invocation will be: What a doozy! The algorithm isn’t really even that complicated; just playing with the newest versions of libnet and libpcap is a real pain. But it’s all behind us now…
# reflector --victim-ip [IP Addr] --victim-ethernet [Ethernet Addr] \
--relayer-ip [IP Addr] --relayer-ethernet [Ethernet Addr]
# reflector --victim-ip 192.168.1.11 --victim-ethernet 00:0A:0B:0C:11:37 \
--relayer-ip 192.168.1.9 --relayer-ethernet 00:0A:06:1B:AB:B0

