Byzantine Reality

Searching for Byzantine failures in the world around us

Thoughts on the Google File System

The first paper on the extremely interesting list of distributed systems readings is the Google File System. It is not a particularly new paper, published way back in 2003 at the Symposium on Operating Systems Principles (SOSP), but is still a great read. I’ve chosen to start with this paper over the other, more recent papers, since this paper is both the earliest chronologically and the lowest layer of abstraction. If you have never heard of it before this point and have the vaguest interest in file systems, then read on!

A caveat: if you’re really into file systems, then you should read the paper yourself instead of taking my word on it. My review of the paper isn’t as concerned with the specifics of it as you may like.

The basic idea is like this: Google has way different needs than you and I, and they are different enough where they need to squeeze out as much performance as possible. Furthermore, failure is a common scenario and fault tolerance is a must. Byzantine (or malicious) failures don’t need to be worried about; it’s safe to assume that no one inside of Google would maliciously attack their own file system and their networking controls prevent the outside world from seeing any direct interface to the file system itself. They specifically say that they have made several modifications to the Linux kernel, but after a while it just became a better idea to make this file system and architect it to their specs.

So now that it’s decided that a new file system is needed, what makes it different from your run-of-the-mill file system? With Google, extremely large data sets are commonplace, so their chunk size (the smallest block of storage) is 64MB. Contrast this with ext3, which allows for block sizes between 1KB and 8KB, a factor of 64000 smaller! Of course, we’ve already talked about the ‘why’. We work on much smaller datasets! 64MB is completely ridiculous for you and I but completely necessary for Google.

That being said, that’s a common theme of the paper. You see a lot of “Google needs this and you won’t, so that’s how we got away with these decisions.” The paper is also surprisingly honest: you get a good feel for what the file system is good at and what it’s not. It’s not great if a ton of people are running the same executable file and it’s not replicated in enough places. By default, everything is replicated three times, but if you have a lot of people requesting the exact same thing, you quickly run into issues with the Google File System. This is expected to be an atypical case, however, and if you were to have such a file, you could tag it as such and have many extra replications out there to avoid his problem.

The best sentiment the paper expresses, I would say, is this:

One should be careful not to overly generalize from our workload. Since Google completely controls both GFS and its applications, the applications tend to be tuned for GFS, and conversely GFS is designed for these applications. Such mutual influence may also exist between general applications and file systems, but the effect is likely more pronounced in our case.

This pretty much summarizes the cool things you can do when you have applications that exhibit very specific behaviors and enough talent behind it. You can make a totally new file system just for it and run a chunk of a huge company on it. A downside though, is that we will really never get to play with it on our own.

Enter Hadoop. It provides both a MapReduce implementation (to be discussed at a future time) that runs on their open-source implementation of the Google File System, appropriately named the Hadoop Distributed File System (HDFS). Its specifications show it to be done very similarly to GFS, so for a while, this is the closest we’re going to get. One downside of HDFS versus GFS is that from what I’ve seen, you can’t have computers dynamically joining or leaving the filesystem. You have to set it all up ahead of time, which always seems to end up being imperfect. On the plus side, like GFS, HDFS automatically replicates your data for you, handles faults, and it’s all abstracted away from you, the user.

The other downside is that in order to run standard Unix-like commands, you have to go through the hadoop executable file. So it breaks the immersive flow a bit to have to always do “hadoop ls”, “hadoop cd /foo”, and so forth. A natural solution to this problem is to use FUSE to give you something that resembles a regular file system, thus giving us hdfs-fuse. It’s not clear how well it works, but it’ll be a fun side-project to investigate one of these days. If you’ve got any experience with it, drop me a line!