Byzantine Reality

Avatar

Searching for Byzantine failures in the world around us

Articles tagged with 'computer science'

Behind Deep Blue

While at the Computer History Museum a while ago for the Big Data Workshop, I spent slightly too long in the gift shop nerding out and on impulse, picked up Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. My reasons were two-fold and pretty straightforward: I like computers and I like chess, so why not actually learn about the computer that changed the world more than a decade ago? But my expectations going in were surprisingly far off, as it’s not so much a book about computers and chess as it is a book about people who happen to use computers and play chess (and use computers to play chess of course). Yet it is the first book that I read cover-to-cover in a single sitting and simply could not put down.

This book is not for everyone. If you saw the title of the book and weren’t instantly intrigued, then the book probably isn’t for you. It follows the exploits of the author, Feng-Hsiung Hsu, as he eventually gains interest in building a chess computer that can defeat the World Chess Champion. For the uninitiated, this was seen as a monumental task and many people simply assumed that a computer could never play chess better than the masters. There’s a bit of hubris involved in the reasoning behind this, and it tends to boil down to “it could play chess ok, but lacks the ‘soul’ or ‘humanity’ to really play chess at the highest levels”. With the joy of hindsight, we can now see that this was similar to the Titanic, with people saying “this boat cannot sink” before the maiden voyage.

So quite a bit of time had been spent by groups worldwide trying to build a chess computer with this goal ultimately in mind, but a lot of them could play chess OK (and no better). The book covers the three big projects on chess computer development at Carnegie Mellon University and IBM that Hsu was involved in: ChipTest, Deep Thought, and Deep Blue. If you’re not an electrical engineer or computer engineer, some of the finer points on the hardware will be lost on you (which is why the book is not for all), but most will be able to appreciate to some extent the drive that formed in Hsu over the 12 years of his struggle to be the best.

(Deep Blue, photo taken at the Computer History Museum)

The book actually has a very sad ending, which I never saw coming. As the Titanic reference pointed out, you likely know going into this story that by the end, Deep Blue will defeat Garry Kasparov, the World Chess Champion. But the sad ending comes on, strangely enough, right when Deep Blue wins. This win comes in the 1997 rematch, and in the 1996 match, Kasparov wins without too much trouble. In the 1996 match, everyone is happy for Kasparov and cheers him on as something akin to the “protector of humanity from the machines”, and in the book it is explicitly said that he felt quite a pressure to be this protector. The victory is seen as a triumph of man, so when the 1997 rematch occurs and Deep Blue wins, nobody is really happy for the Deep Blue team. His three-man team has put in 30 combined years of their lives into this project and it is seen as a failure for humanity, while all along Hsu and Company have been trying to tell people that this really is not a failure. In their eyes, this is the “triumph of man as a toolmaker over man as a performer” (the quote might not be exactly correct but is pretty close). Furthermore, Kasparov’s completely unsportsmanlike behavior during the rematch and the drama he caused afterwards makes the ending even more sad. He accuses the team of cheating, of making moves that the computer wouldn’t have made, and demands to see Deep Blue’s logs during the match (which is equivalent to reading its mind since he would then know all the moves it was going to make). I’m sure if accusations like those were leveled at him he would have simply stormed off instead of being surprisingly mature as the Deep Blue team was.

(Kasparov v. Deep Blue, photo taken at the Computer History Museum)

This brings us to another interesting point about the book. In some sections, adversarial characters sometimes act exceedingly irrationally, and Hsu spends a lot of time slowing down these moments to try to help us understand why (for example) Kasparov would do some of these outlandish things. He says it’s still not acceptable, but repeatedly tries to at least give a well-reasoned explanation of his adversaries’ disrespectful actions. This allows the book to stay personal and told through his viewpoint but removes a lot of the bias that you could otherwise see in it.

It ends on another sad note after that, to make things even more depressing. With their goal completed, the team disbands and go their separate ways, to do things completely unrelated to chess (not the sad part). However, Deep Blue is taken out of commission, as the resulting media firestorm by Kasparov convinces IBM that spending more money on chess computers is no longer necessary (two reasons here: one, they already beat him, and two, Kasparov’s demands were too extravagant for IBM to be able to afford both him and the chess computer project, causing them to abandon both). A much slower version of Deep Blue is left online for IBM employees to play with (not sure if it’s still up anymore) but is a shadow of the true Deep Blue.

Kasparov goes on to say that surely he could not lose again, and Hsu leaves IBM and takes the rights to Deep Blue with him, seemingly with a rematch in mind. Kasparov backs down, drama ensues, and Hsu eventually becomes disinterested with chess computers. Kasparov then says that he definitely won’t lose to a computer now, but was unable to defeat Deep Junior and X3D Fritz in 2003 (they tied in both occasions). Hsu claims that Deep Junior and X3D Fritz were far inferior to Deep Blue, as Deep Blue had a number of critical advances over Deep Junior and X3D Fritz despite being much older (if you want the technical details of why, go buy the book). I buy the arguments made, and it’s a shame that the best chess computer we even had doesn’t exist anymore. I can see why, however, and why there’s not much interest in resurrecting the project – again, read the book for details I’ve skipped on here – and it’s one of those “sad but true” kind of things.

As I said in the beginning, I couldn’t put it down even though I knew the ending going in, which for me, makes this book a keeper. If you’re into computers or chess, put down whatever you’re doing and go getBehind Deep Blue. It’s that good.

Large Scale Data Analysis Talk

As part of the seminar I’m in on Large Scale Data Analysis, I gave a talk on the continuing battle in the MapReduce world between DeWitt and Stonebraker on the side of parallel databases versus Dean and Ghemawat on the side of MapReduce. For those of you not interested in reading these long articles, it basically boils down to this: DeWitt and Stonebraker originally claimed that MapReduce allowed for fast data movement but was slow for actual computation, so you should use parallel databases instead (they suggested Vertica and “DBMS-X”, a mystery database).

They now say that for “quick and dirty” one-off jobs you should use MapReduce due to the fast data movement but in all other cases you should use Vertica. Dean and Ghemawat responded by saying that all the faults DeWitt and Stonebraker accused MapReduce of having are really faults of Hadoop’s MapReduce implementation and not MapReduce as an algorithm (that is, Google MapReduce doesn’t suffer from these problems). Specifically, DeWitt and Stonebraker’s MapReduce numbers turned out to be really slow because they stored the data as strings and parsing the data out was extremely expensive (often more expensive than the actual computation involved). To remedy this problem the data can simply be stored as Protocol Buffers, which DeWitt and Stonebraker were unable to do since Hadoop MapReduce doesn’t support it (although Google MapReduce was). There is an open ticket for this feature in Hadoop MapReduce but it appears to be orphaned long ago. If we actually get this feature in it will make the comparison extremely interesting.

The second half of my talk covered two recent papers on virtual machine migration, which is really handy if you need to reboot the physical machine for upgrades and maintenance or migrate the virtual machines for load balancing or power management, but as far as my research goes, none of those really help me out. Sysadmins will love these features, but the rest of us are really more concerned with reacting to VM failures and not so much to proactive VM failures.

Either way, I uploaded the slides as usual for your enjoyment. Hope you find them useful!

Major Area Exam

Just in case you were interested, I gave a talk last week at UCSB covering everything having to do with cloud computing. Unfortunately, I was only able to tape the first ten minutes of it, so I pretty much just cover virtualization and some basic introduction stuff as well. Here’s the slides if you want to follow along at home or see what I talked about after the video ended. Enjoy!


History of MapReduce, Part 3

Now that we’ve covered the most popular (thus far) part of MapReduce’s life, let’s move on to its present and uncertain future. This time around we’ll cover extremely new material starting this year and will try to avoid rampant speculation about the future since there tends to be a high probability of it being flat out wrong. This article is primarily concerned with two topics: a new paper out that will be discussed at this year’s SIGMOD conference (A Comparison of Approaches to Large-Scale Data Analysis), and Hadoop Streaming, a fun new way to play with MapReduce that’s being used in some serious new cloud projects.

Let’s deal with these topics in the reverse order. Last time around, we showed that MapReduce could be useful for solving a wider class of problems then it is generally given credit for. One downside of this is that the vanilla MapReduce implementation forces the user to write Mappers and Reducers in Java. Don’t get me wrong, I like Java (despite the badmouthing I’ve done about it in the past), but I’d like to use other languages as well.

Enter Hadoop Streaming. This very interesting component that’s bundled in with Hadoop lets the user specify an arbitrary Map program and an arbitrary Reduce function. Now we can write MapReduce in any language and have it supported by Hadoop in an extremely simple fashion. The example given on the Streaming page itself uses bash:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
   
-input myInputDirs \
   
-output myOutputDir \
   
-mapper /bin/cat \
   
-reducer /bin/wc

This is pretty cool, suffice to say. The page also shows using Python, and this really lowers the barrier to using MapReduce to the point where the  “average programmer” can pick it up and start using it. And it seems Amazon has figured this out as well, albeit quite some time ago, and now have Amazon’s Elastic MapReduce. This looks like a pretty frontend on Hadoop Streaming that now lets you write code in one of seven languages currently (adding more would likely require installing those compilers on the default VM itself, which is actually not too difficult). But let’s substantiate that claim real fast. Why do I think Elastic MapReduce sits on top of Hadoop Streaming? Well, Amazon doesn’t make it a secret that it’s using Hadoop (see the previous link), but to know that it’s using Streaming, just look in the help manual:

elasticmapreduce

Using Hadoop is pretty much a no-brainer for Amazon. Hadoop added in support for S3 URLs quite some time ago so running it in EC2 itself wasn’t that far off. We’re in the process of doing a similar feat for AppScale, and while researching how to do MapReduce with arbitrary languages drew me to Streaming. After I saw the potential of that, I tripped onto the page containing the above picture and saw that this idea was apparently sound enough since Amazon is already doing this. The only real question that was left to me was how to get the data from our databases (HBase or HyperTable at present, Cassandra and MySQL soon enough).

This appears to be a non-trivial problem. For HBase it’s easy enough: they provide abstract classes you can implement that will take care of input and output, although it appears this would only work with vanilla MapReduce and not Streaming. The easiest way around the problem is to just require that the input be on the file system, which is exactly how vanilla MapReduce and Streaming do it (specifically it must be in either HDFS or S3).

Unfortunately, about a year ago, a flame war erupted from part of the database community, who are of the impression that MapReduce is a major step backwards. This apparently generated such controversy in the interwebs that a followup to the original post was made to address the comments of the first. The basic idea sums up to this:

MapReduce is fun and nice and all but was essentially already done 30 years ago. It’s nothing new and by the way SQL can do anything MapReduce can do but way better.

The first sentence is completely true. The renewed interest around MapReduce is all around the pretty packaging of it and accessibility to the “average programmer”. But the second sentence is where the flame war got started and where it rages on to this day. Furthermore, the authors of these posts noticed how much attention this was getting and published a paper in this year’s SIGMOD conference named A Comparison of Approaches to Large-Scale Data Analysis, which specifically compared Hadoop’s MapReduce to relational databases. Since this paper really sums up their argument against MapReduce and shows insights to their viewpoints, let’s focus on it except when the blog posts answer questions it doesn’t.

If you aren’t already familiar with the war between the MR-folks and whatever fraction of the database community it is that doesn’t care for MR very much, it really boils down to this. 

should have compared HBase and MapReduce to Vertica and MySQL with stored procedures.

Although we were not surprised by the relative performance advantages provided by the two parallel database systems, we were impressed by how easy Hadoop was to set up and use in comparison to the databases. The Vertica installation process was also straightforward but temperamental to certain system parameters. DBMS-X, on the other hand, was difficult to configure property and required repeated assistence from the vendor to obtain a configuration that performed well. For a mature product such as DBMS-X, the entire experience was indeed disappointing. Given the upfront cost advantage that Hadoop has, we now understand why it has quickly attracted such a large user community.

(DBMS-X gets a bit of a beating like this in the paper as well, so maybe this is why they didn’t reveal which database it was.) With all that being said and done, the paper is still an interesting read. If you buy the argument that since users are leaving relational databases for MapReduce and that this makes them equal comparisons, I think you’ll get more out of it than if you don’t. Since I really haven’t tried ditching a relational database and doing the exact same work in MapReduce, it’s really difficult to speculate too much about it. In the meanwhile, I’ll keep my fingers crossed hoping to see an HBase v. MySQL showdown although it seems like it’s hoping for a bit much since that’s not what it originally was and there’s no “Future Work” section in the paper.

Since this “history lesson” covers the present age of MapReduce, it would be also remiss of me to leave out at least a mention about Pig Latin and friends. It will certainly be interesting to see how they fare and what specifically about it lets them succeed or stunts their acceptance in the MR community, and if we end up seeing comparisons between it and SQL down the line :) .

History of MapReduce, Part 2

Recently for one of my classes I got the chance to research a parallel programming technique and it’s applications. Being a bona-fide MapReduce whore-in-training, I naturally had an easy choice. Since this is “Part 2″, we cover the era I’ll dub the “Return of MapReduce”, in which Google re-introduces and rebrands MapReduce as the technique we love to hate and love to love (choose whichever suits you best) and new advancements are made with it. This covers about the 2004-2008 time period, in contrast to Part 1 (the beginning of existence-2004), where map-reduce exists as part of Lisp, MPI, and other environments but is not particularly accessible to the average programmer, and Part 3 (2009-present), which I will leave as a surprise for those of you diligently following along at home.

With that said, here’s part 2, beginning with the paper that changed MapReduce and revived mainstream interest in the technique. Enjoy!

Although the map-reduce technique is well known in the worlds of distributed and parallel computing, Google’s MapReduce paper[1] has brought new passion to this programming paradigm. For those not familiar with the technique, the user provides the system with a “map” function (a function taking a tuple and outputting a tuple) and a “reduce” function (a function taking a tuple and outputting a single value, although it can also output a tuple if needed).

This simple idea has been shown to be an effective method of programming embarassingly parallel problems, as both the map and reduce functions are stateless and thus easily parallelizable. Further excitement was added to this when Apache’s Hadoop project was released, providing users with an open-source implementation of MapReduce. It is programmed almost entirely in Java, with only a few native C libraries (interestingly enough, these are all compression libraries).

Their implementation provided also to be highly scalable, and last year placed first place in the Terabyte Sort Benchmark. In order to do so, they used 910 nodes, each with 2 cores (a total of 1820 cores), and were able to keep the data entirely in memory across the nodes. With this many machines and their open-source MapReduce implementation, they were able to sort one terabyte of data in 209 seconds, in contrast to the previous record of 297 seconds. Not one to be left behind, Google showed that they were still quite ahead, claiming they could do the same with their closed-source implementation in only 68 seconds. The only details they note are that they use 1000 nodes, but since it’s not verifiable or mentioned on the official Sort Benchmark Home Page, it may not be worthwhile to discuss it further.

In the years since MapReduce’s “reintroduction” by Google, researchers have been actively seeking problems to use MapReduce on and investigate. Another implementation of MapReduce, named Phoenix [2], has been created to facilitate optimal use of multi-core and multiprocessor systems. Written in C++, they sought to use the MapReduce programming paradigm to perform dense integer matrix multiplication as well as performing linear regression and principal components analysis on a matrix (amongst other applications). The following graph shows the speedup they achieve over varying number of cores:

phoenix1

Researchers have taken MapReduce in other directions. MapReduce has also been exported for use on FPGAs and GPUs [3]. Users program map and reduce functions in ANSI C, and the paper presents results for Monte Carlo simulations approximating the value of pi and the n-body problem under gravitational force (again, amongst other problems). The numbers below harness a GPU with 16 multiprocessors, each with 8 cores, yielding 128 processing units. We begin by showing the numbers for the Monte Carlo approximation of pi, followed by the speedup for all benchmarks implemented on the GPU and FPGA:

pi-computationfpga-gpu

The final evolution of MapReduce we shall examine here is done by the same people we started off looking at, the Hadoop project. They assert that the MapReduce paradigm can be too low level, giving a programming language (Pig Latin [4]) and runtime (Pig) that builds on top of MapReduce. This dataflow language is extremely restrictive but also extremely parallelizable. It allows users to write Java code that interfaces with the system as well as use their Pig code to process large datasets. If allowed, a project investigating the types of applications that can be written with this system and integrating it into our open-source cloud platform would be both novel and useful for the open-source community.

By this point we have hopefully conveyed the following:

  • Numerous scientific computing problems can be solved through the MapReduce programming model, although many others cannot.
  • MapReduce has been implemented in various programming languages (C, C++, Java) and for various environments (CPUs, GPUs, FPGAs).
  • MapReduce is scalable on embarrassingly parallel problems, but has not been explored in the cloud computing community yet despite demand for it.

We must also make an important point: MapReduce does not work on all problems. It relies on the problem being able to be specified as a combination of a map function and a reduce function, which is not always possible. However, I wish to investigate whether the domain of problems that can be solved with MapReduce is larger than originally suspected (just embarrassingly parallel problems) or not, and why or why not. Furthermore, I wish to see if we can integrate MapReduce’s power with our open-source cloud software in a way that can be harnessed by the general community in a way similar to how Hadoop has done over the last half-decade with their work.

References:

[1] MapReduce: Simpified Data Processing on Large Clusters

[2] Evaluating MapReduce for Multi-core and Multiprocessor Systems

[3] Map-reduce as a Programming Model for Custom Computing Machines

[4] Pig Latin: A Not-So-Foreign Language for Data Processing

Amazon's Dynamo

Now that we’ve spent more than enough time looking over Google’s highly scalable infrastructure, let’s turn our attention to an even newer paper from Amazon. Their datastore, dubbed Dynamo, is an interesting contrast to Google’s work that brings up many interesting questions and points to note from it. Here’s the executive summary:

Dynamo is built to be highly available, and sacrifices the traditional notion of consistency in order to do so.

Yet the paper itself perhaps gives a better one-liner:

Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly.

The paper says in the title that Dynamo is a key-value datastore (basically a hashtable again like BigTable), but from reading the paper it’s not clear that it’s at that low of a level of abstraction. I originally was under the impression that it fully encapsulated the database, but upon reading it all the way through it’s clear that’s not particularly true. Let me explain.

Dynamo is really more of a piece of middleware software that sits on top of other databases and provides the abstraction to the user of a key-value datastore. The paper states that Dynamo can hook into “aBerkeley Database (BDB) Transactional Data Store, BDB Java Edition, MySQL, and an in-memory buffer with persistent backing store.” Different databases are chosen based on the size of the data being stored in it (BDB for data > 100 KB, MySQL otherwise), and this is exposed as a configuration option for the developer’s software, since they would be the best ones to know how big all the data they’re storing in Dynamo is. This bring up another interesting thing about Dynamo to contrast with the Google papers we’ve seen:

Dynamo gives more choices to the developer to harness the system better, but at an increased level of complexity.

Now I have to consciously know as a programmer while writing my app to think of how big all my objects will be and which database stores them best. And what happens if some of my objects are really small and some are really large? I love giving more power to the programmer, but like a lot of powerful abstractions, I don’t know if this is begging for disaster. The paper does end up stating that applications should have data less than 1 MB, but it still doesn’t address the question of having lots of data < 100 KB in size and lots of data > 100 KB but < 1 MB.

But back on topic, since Dynamo dumps data to a database, I’d say it really qualifies more as a piece of middleware rather than it’s own database. It seems entirely possible that one could trivially also dump data to Amazon’s Simple Storage Service (S3), since it also uses a get/put model like Dynamo requires.

The recurring theme in this paper is how to achieve scalable availability. Dynamo achieves this through a notion of “eventual consistency”. Since the system is highly distributed and failures are likely, it is exceedingly costly to perform updates. Using the Paxos algorithm requires a majority of the nodes to be up, and the amount of communication required for all the rounds of Paxos is apparently far too high for Amazon’s constraints (not explicitly stated but seems to be the case). Therefore, if a certain number of nodes get an update and promise to eventually propogate the data, the system can then survive many more failures and ensure consistency, given enough time.

The big problem with this should be clear: what does “eventual consistency” mean to the programmers using the system? On this, the paper is clear. A program communicates with a “coordinator” who coordinates reads and writes with the various nodes. If the user attempts to read a value and the coordinator gets back different values from the various nodes, it simply packages them all up and gives them all to the user. The user then decides what to do with it. This is just like before: the user would ideally know what to do with all this data better than the system does, but at a cost of increased complexity in the system. It is possible to give simple rules to the system and have it “automatically resolve” conflicts, like having the newest write be the single value to return, but the choice on which to use is left to the programmer.

However, it is not clear how a coordinator comes to get this role in the system. The paper stresses that all nodes need to be symmetric and play the same role, in order to survive failures and keep the system simple. Therefore, how does one node become a coordinator? It is stated in the paper that the user’s program has a preference list of nodes to read from and write to, so it seems to be that the user’s program could play the role of coordinator. This would keep in line with the rest of the paper (add more complexity on the user’s side to keep things simple elsewhere), but as it’s not clear from the paper, let’s not speculate further on it. We can go a different route though and add in another quote from the paper that compounds the confusion (here the context is about how to ensure that if our program writes something it reads back whatever it wrote):

In particular, since each write usually follows a read operation, the coordinator for a write is chosen to be the node that replied fastest to the previous read operation which is stored in the context information of the request. This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting “read-your-writes” consistency.

So now it seems clear that in fact the coordinator is indeed one of the nodes, but it is not clear how it is elected to this status, except perhaps artificially by the user’s program. Maybe just the user’s program tells a node it is the coordinator for a write, gives it the write, and lets it do whatever it needs to do with it. This is the most probable, but it’s not clear what happens if that node dies while it is the coordinator since it now has state. Furthermore, the user’s program now has more state since it now is keeping track of who responded fastest to each read operation in order to know who to contact about writes, which also keeps in line with what we’ve seen here.

Now that we’ve agreed that Dynamo really acts more as a piece of middleware providing a database abstraction, a good question to ask is “What does it do to give this abstraction?”. On that, the paper is clear:

In addition to the actual data persistence component, the system needs to have scalable and robust solutions for load balancing, membership and failure detection, failure recovery, replica synchronization, overload handling, state transfer, concurrency and job scheduling, request marshalling, request routing, system monitoring and alarming, and configuration management.

It also enforces single key updates only (transactions thus are similar to BigTable and Cassandra) and uses a gossip-based failure detection and membership algorithm as well as an error correcting scheme implemented as a Merkle tree. It does all this to provide the abstraction of a distributed hash table just as we hinted at in the introduction quite some time ago, and a decent chunk of the paper is dedicated to how this hash table is partitioned and all that fun stuff that I’ll skip over here.

What I think is a bit more interesting is how many nodes need to participate in a read or write for it to go through successfully. Like many other things in Dynamo, you get to choose. Once more there’s much more power in your hands to make Dynamo more flexible, but now there’s more of a requirement that you really get to know Dynamo and mess it up a few times to learn how to choose these numbers. Some quick notation in the paper is useful to explain before going on. They use R to refer to the number of nodes required to participate in a successful read operation and W to represent the number of nodes required to participate in a successful write operation. N is a bit less clear. Sometimes in the paper it refers to the total number of nodes in the system, and in other places it refers to how many replicas you want for your file, and in other places it refers to both these at the same time. Take for example what I’d consider to be the most informative line of the paper (possibly after the bit in the intro):

The common (N,R,W) configuration used by several instances of Dynamo is (3,2,2).N replicas to perform a “durable write”.

Thus Dynamo gives you three-copy replication (same as BigTable) but also is hinting that a Dynamo cluster typically also only has three nodes in it. A majority of the nodes are required to participate in a simple read or write operation (here 2 out of 3). This number is customizable, so they do mention that a highly writable system can be produced by setting W to 1 (only one node needs to participate for a write to be successful) and the same for a highly readable system with R to 1. If one sets R or W to be less than a majority, then consistency becomes a bit weaker in the system and these issues we talked about earlier can show up.

So there’s Dynamo in a nutshell! It’s an interesting alternative to BigTable that shows promise, and it would be very interesting to know how much of this configurability in the system really costs the average programmer at Amazon. And since we always try to mention some open source versions of what you’ve read about here, it would be remiss of me not to mention Project Voldemort and Dynomite. I don’t really have any experience with them, but we may look into them in a little while (no promises however). Sorry about the long delay since our post last time, but I will be posting soon about what took up my time then, so stay tuned!

Thoughts on Google (meta)

Now that our coverage on the Big 5 Google papers has concluded, I thought it may be handy to aggregate the links to them all in one handy spot. That being said, here’s the links to our series of reviews (sorted by post date / chronological publishing of paper):

  1. Thoughts on the Google File System: How one constructs a fault-tolerant file system with which to build everything else on that performs optimally on huge data sets.
  2. Thoughts on MapReduce: How to write programs to process huge amounts of data as quickly as possible given the huge amount of resources available.
  3. Thoughts on BigTable: How to store data in a more structured format yet keep replication and high speed processing.
  4. Thoughts on Chubby: How to locate specific nodes serving a specific purpose though the use of a highly available naming service.
  5. Thoughts on Paxos: How to implement a consensus algorithm to ensure agreement amongst nodes in the presence of failures and keep high performance in the meanwhile.

So check them out if you haven’t, and be sure to come back next time for our review of Amazon’s Dynamo!

Thoughts on Paxos

After a month of Google papers, we finally come to the last paper in our Google reading list: Paxos Made Live. Like the Chubby paper from last week, this paper doesn’t present any new research, but instead tells us how to build a system from the current research out there such that it can withstand incredible load. It also aims to show how we could possibly be confident in a non-trivial amount of code as well as optimizations that can be made to improve the performance and reliability of the system. So how did they do and how did it turn out? Read on!

A good starting point in the paper is its concise definition of the Paxos algorithm. As it is about a page long, I’ll omit it here but leave you with a more general description of Paxos, simply that it is an algorithm that assures agreement on a value between a majority of nodes in a distributed system. Of course if you’re interested in the algorithm this paper is an excellent starting point, and the writeup on Wikipedia is a good re-write of it (Basic Paxos).

An inherent downside of this is that since it is a consensus algorithm, the FLP Impossibility Result tells us that liveness can be impossible with a faulty processor (although not guaranteed and the probability of such an event is unknown). Now we have a problem. Liveness is the ability of the distributed system to make “progress”, that is, for new values to be agreed on over time, and now there’s an important theoretical result telling us that we can’t guarantee it if a single node fails. Interestingly, this paper does not state how they resolve the problem, but the Chubby paper does:

Paxos maintains safety without timing assumptions, but clocks must be introduced to ensure liveness; this overcomes the impossibility result of Fischer et al.

So how do clocks solve the problem? The original paper by Fischer et al. tells us the answer: just have the proposed value time-out and fail on all nodes. Since the nodes all have synchronized clocks, this is entirely possible and solves the problem altogether.

Now comes the obvious question: why use Paxos in the first place? Why do they even need it? This comes about because Chubby was using a commercial three-times replicated database but found it had bugs in its replication code, and since Google needs Chubby for extremely high availability, this warranted a rewrite of all the involved code. With Paxos, we can ensure that a distributed system can reach agreement on a value. In this case, the value is a change to the database, and once all the nodes agree on a change (technically, a majority, but other nodes can catch up on it later if they miss it), they can make the change to the database and log it so that snapshots of the database can be taken (for backup and recovery purposes). Now we’ve got a distributed database whose code we can debug (of course, “we” refers to Google and not us) should worse come to worse.

If you did happen to read the Paxos algorithm or the Wikipedia article on Paxos (or even just look at the diagram), you can see that this algorithm runs in a number of phases, and that there’s lots of potential for optimization. A common optimization is that the first step of the algorithm, electing a leader (master) for this run of Paxos, can be omitted if the node proposing the change is already the leader. And just like that we’ve knocked off O(2n) messages (n messages sent out requesting to be the master and potentially n messages received with the response). In fact, citing the paper (footnote, Page 7):

In a loaded system, under one percent of the Paxos instances run the full Paxos algorithm.

This ends up being pretty good for the final performance of the system, but before we can get that far, we have to at least mention the various problems that arise. With Chubby, masters can fail and new masters need to take their place. Paxos allows for this, but a problem encountered was that the intended behavior was to have events issued to old masters fail, while in practice nothing was preventing these events from succeeding (since nothing identified a master’s lifetime). So to enforce the requested behavior, exactly that had to be added in: epoch numbers, which represented a master’s lifetime (analogous to noting that our current president is the 44th, to distinguish him from all others).

Since Chubby is designed to be highly available, proper coding measures were taken to ensure that rolling it out would be as successful as possible. The paper details the numerous problems they ran into (problems upgrading, problems rolling back to an old version, and so on). But more interesting is how they were confident of its functionality. Extensive test suites were constructed to verify that even if faults were injected into the system, that it would remain safe (that is, one node couldn’t get an update unless a majority of them did) and live (over time, nodes will get updates).

The more interesting development note is how they coded Paxos in the first place. They note that they coded core parts of the algorithm as state machines in a specification language and then used a compiler to translate it to C++. This makes it much easier to verify its correctness, as it’s much easier to verify a page of pseudocode-like specifications than many thousands of lines of C++ code. Upon reading this, I thought of one thing: a controversial blog post by Joel Spolsky about Wasabi. Essentially it’s a similar idea: they needed certain features from their language but it just ended up being so mundane to do so that it was unproductive. So they wrote a compiler named Wasabi to give them all the language features they needed and compiled that to the language they needed. Since it was a lot of addons to existing languages, compiling it to the existing language ends up being much easier than writing a whole new language. Just like the Google guys, they’ve both used the right tool for the right job, and got a better product that was easier to program (given some initial development time on the compiler itself).

They follow up that interesting idea with another one: they had a bit of trouble finding where problems were in the system due to the fault-tolerant nature of the system. When a node would die, the fact of it would be hidden from the developers and it became harder to notice without extra logging procedures. Of course, the problem was resolved, but it makes it inherently difficult to address different types of problems as they arise.

That being said, let’s look at how it performs. From page 14 of the paper:

table1

Like last time, these numbers really speak for themselves. It outperforms their original database in every metric they measure it against and gives them the peace of mind they need should something come up down the line (since they also own the code base and are familiar with it).

The paper ends on a bit of an ominous note: they note that the current state of fault-tolerant systems is lacking, to say the least. They make it clear that the community has not thought enough about how testing fault-tolerant applications is done, and that taking these algorithms and modifying them to make them work in practice destroys any hope of correctness and proving useful properties about them in a reasonable amount of time.

This rounds up our tour of the “Big 5″ Google papers. As we’ve seen before, these applications were all designed for a very specialized environment (Google) with very specific needs. Processing huge amounts of data quickly requires a fault-tolerant filesystem (GFS), a database that can be run on it (BigTable), a programming model to use it with (MapReduce), a naming service to tie everything together with (Chubby), and an algorithm to ensure agreement between a set of nodes (Paxos). They’re all at varying degrees of abstraction and made for a fun ride. Of course, it doesn’t work for every situation, and most likely won’t work for yours, but it’s always interesting to see what everyone else is up to and how they pull it all off. We’ll continue next time with another paper from the reading list we looked at a while ago, so stay tuned!

Thoughts on Chubby

The last few times around we’ve been talking about performance. Extreme performance. We’ve talked about how fast the Google File System is, how fast MapReduce is, and how fast BigTable is. Yet all three rely on a common service we hinted upon last time, a lock service (amongst other things) named Chubby, the topic of today’s discussion. Chubby doesn’t bill itself as high performance like the other services we’ve mentioned, but is designed with high availability in mind. Let’s dive in further and see what exactly makes up Chubby and why it’s so special in the Google world.

Chubby provides a number of services, but at its heart, it is a lock service. It normally consists of five replicas (as opposed to the usual three from the Google File System) and is considered to be up if at least three are functional (a majority). This is essential because leader election is done via the Paxos algorithm, which requires a majority of the nodes to be up for progress to be made, but as it is the topic of our next discussion, we can brush it under the rug for now.

A main point in the paper is that although Chubby is a lock service, it is also used heavily as a naming service. It names locks in a file-system-like manner, and thus a server can place it’s location in a file named “/ls/myserver” (ls for lock-server). Since the server holds the lock on that file and renews the lock as long as it is up and providing a service, it is assured that the data is never stale. Clients can then find this server by querying Chubby for everything in the “/ls/” directory and voila! The naming problem is pretty much solved. The paper refers to these files as being “ephemeral files”, namely temporary files used to indicate that clients/servers are alive and which locks are valid. They also hint that it can be used to indicate whether other resources are valid asides from locks, such as caches, but that it is currently an unexplored area.

What is explored thoroughly is what Chubby does for other Google services. It is only stated that it is a “rendezvous mechanism” for MapReduce, which I would expand upon and speculate refers to naming which machines are running Map jobs and Reduce jobs so that the scheduler can map the output from one to the input to the other in a coherent manner. Both the Google File System and BigTable need to elect master servers (primaries), and Chubby is used in what appears to be the exact fashion we described in the last paragraph (note this also provides naming for where the primaries are). Finally, they also use it as a repository for files requiring high availability. They cite is as being used for access control lists, and it is a fair presumption to say that it is likely used on ACLs for metadata on files of particular importance (e.g., are about the filesystem itself or about particularly important data).

An interesting design choice that was made early on was to provide Chubby as a lock service over providing it as a library that developers could use in their programs. Both ways have their pros and cons, but the big downside of using it as a library is that the programmer would have to form their program into a state machine (required for Paxos), which requires a non-trivial amont of forethought and much greater re-designwork after the fact. They go on to state that they preferred the lock service idea since it’s relatively familiar to programmers, despite their knowledge that programmers don’t know how to use locks correctly, and that it’s even more difficult a problem in the case of distributed systems.

Like BigTable, metadata stored in Chubby contains a version number per file, but unlike BigTable, old versions cannot be recovered. It appears this version number is primarily used to determine whether or not a user has the most recent version of a file before they update it (e.g., they grab version 4 and want to write changes as version 5 but in the meanwhile someone else comes in and writes two sets of changes, pushing the current version number to 6).

Furthermore, while Chubby is a lock service, it only provides advisory locks (not mandatory). Nothing prevents a client from writing or reading a particular file they do not have the lock for, provided they have the correct ACLs to do so. The motivation behind this is more of convenience: it is perceived as a failure to force a user to reboot their machine to force it to release a lock and that having them release their lock is out of date is far preferable. Which brings us to a not-so-minor point we lightly touched on earlier: locks must be renewed constantly, as the leases on them tend to be small (although they can be variable length as set by Chubby). We shall later see that the traffic incurred by these renewals is the greatest source of traffic in the system by an order of magnitude.

In an attempt to make using Chubby more user-friendly, calls to it can be either synchronous or asynchronous. The programmer can specify a callback function as well to be called upon certain events happening. The primary events cited in the paper that callbacks are used is when a master fails over, when locks become invalid, or file contents are modified by a different user (all cases we’ve talked about earlier). Two methods are also given to close a user’s session: a Close method, which destroys the object the user is manipulating (this is all C++ so we have to do garbage collection ourselves), and a Poison method, which does not free the object but renders it unusable (all method calls on it fail) so that other threads can free it.

That being said, let’s look at exactly how Chubby performs. These numbers are lifted straight from the paper and claim to be typical of Chubby instances at Google (note that RPC rates are over a period of ten minutes):

chubby

This is a good summary of everything we’ve talked about: extreme availability (one 14 second downtime every 18 days) in the face of a huge number of clients (54000). And like we mentioned earlier, the lock renewal messages (KeepAlive RPCs) make up the vast majority of RPCs (93% versus 2% for getting a file’s metadata, the next closest contendor). Since there are so many messages going to the server, a proxy can be used (see the proxied clients row) to reduce the load going to the Chubby server, and its importance is not understated in the paper:

We have found that the key to scaling Chubby is not server performance; reducing communication to the server can have far greater impact.

So now you know a decent amount about Chubby and how it interoperates with the other Google services. If I’ve piqued your interest at all you should certainly read the paper itself, as it goes more into detail about everything we’ve talked about and more (design decisions, human factors, and so on). It also talks a little bit more about the various Chubby instances, such as how there typically is one five-replica instance for each data center and one five-replica instance spread out across the world (the “global cell”), ensuring high availability and relatively quick access speeds no matter where you are on the planet.

Something to end on today is to note the recurring theme of leader election in distributed systems. It came up originally in the GFS paper, showed up again in BigTable, and once more here in Chubby. If you aren’t familiar with how this is done, then stay tuned for next time, when we shall look at how Google solved this problem using a well known solution (the Paxos algorithm) and the problems they ran into.

Thoughts on BigTable

Now that we’ve looked at the Google File System and MapReduce, we come to the next paper on our reading list: BigTable. It comes to us two years after the MapReduce paper and presents an interesting new step in databases. Like MapReduce, it is another notable paper and takes a very familiar programming construct and revamps it to new heights. Let’s see if I can whet your appetite with a spoiler:

Maybe a better name for it is Big Hash Table.

Now that I have your interest, let’s take a step back and have a quick history lesson. To this point, pretty much all databases I’ve run into have fallen into one of three categories:

  1. Relational: If you’ve ever used a database, you likely have used a relational database (MySQL and friends). No need to expand much here.
  2. Hierarchical: Certainly not as popular or well-known as relational databases, but hierarchical databases like LDAP provide a great way to fetch and store your data if you can organize them in a tree-like structure.
  3. Object-oriented: Can’t say I know a whole lot about these. I hear they were quite the rage back in the ’90s, since an object-oriented database was supposed to be a perfect fit for an object-oriented language. But for some reason or another it just didn’t work out, and two new technologies supplanted it that are the new hot thing. Rails has made Object-Relational Mapping (ORM) popular for their users, while C# has Language Integrated Query (LINQ).

So from above we can begin to see a trend: each database type seems to be modeled after some type of data structure. In order, we look like we’ve got a two-dimensional array (more or less), a tree, and an object. The notable exception from the data structures you would learn about in a basic data structure class is thus the hashtable, so BigTable’s entry in the field is not as surprising as people seem to give it credit for. Now, let’s see what makes BigTable stand out.

As we’ve hinted before, BigTable is a relatively large hash table. It’s simple to read from and write to, as you are restricted to standard hashtable operations (get, put, delete). You still have rows and columns, and data items also have a timestamp associated with it, so that versioning is automatic. BigTable then garbage collects older versions as specified by the user (e.g., garbage collect everything but the most recent three versions of all items).

One big difference between BigTable and relational databases is how transactions are handled. Row-level transactions are possible in BigTable, but not any other type of transactions. The paper states that this feature may make it confusing for programmers new to BigTable to use, but that apparently this is a minor obstacle to writing much more scalable code.

The unsung hero of the day that really makes BigTable shine, however, is Chubby. Chubby is a lock-providing service that is highly available and allows many difficult design decisions to be abstracted away from BigTable. When a BigTable server (master or slave) comes up, it reserves a lock through Chubby with its name and path, and renews the lock periodically. This makes it trivial for the master to determine which nodes are up and down, as it can just query Chubby to see who has locks (thus these nodes are up). Since the BigTable master also knows the locations of the slaves, it can trivially determine if Chubby is down (it would be unable to see who has locks in Chubby but still be able to contact them). But since Chubby is the next paper on our reading list, let’s not say too much more about it right now.

Another reason BigTable shines so brightly for Google is the infrastructure it’s been built on, namely the last two papers we’ve looked at. By placing BigTable on the Google File System, the database automagically gains three-copy replication and fault tolerance. Furthermore, since the file system is already geared towards processing huge amounts of data, the database now doesn’t suffer from doing so (in fact it performs even better on it). BigTable adds onto this by allowing certain columns in a table to be split up from other columns, speeding up access time on queries for only those columns. The example given in the paper is that if the table contains web pages, maybe we can put meta-data in one column and the pages contents in another column. Then, clients that only need the page’s meta-data are able to save a lot of time versus querying all the columns.

BigTable also gains substantially from being able to use MapReduce. Now complex operations can be performed against this huge repository of data with many many machines and can be done so reliably (that is, it recovers from failures). This seems to be the primary reason why BigTable doesn’t mention any ability to perform joins or any other standard relational database features (as MapReduce makes it simple enough to do so on their own).

So as always, we end up with a common question: how can I get my hands on this to mess around with? And as usual, you can’t get it, but there are some open-source versions you can fiddle around with. Just asHadoop puts out the Hadoop Distributed File System (open source Google File System) and Hadoop Map Reduce (open source Google Map Reduce), they also put out HBase, an open source implementation of BigTable. It also looks to be the most stable and easy to get going of all the open source BigTable implementations. There’s also HyperTable, but we’ve had orders of magnitude more difficulty getting it up and running than HBase, so I’d recommend HBase over it at this point in their development. And as usual, if you’re up to any kind of work in this area, drop a line and let all of us know!

profile for Chris Bunch at Stack Overflow, Q&A for professional and enthusiast programmers