Byzantine Reality

Searching for Byzantine failures in the world around us

Thoughts on Paxos

After a month of Google papers, we finally come to the last paper in our Google reading list: Paxos Made Live. Like the Chubby paper from last week, this paper doesn’t present any new research, but instead tells us how to build a system from the current research out there such that it can withstand incredible load. It also aims to show how we could possibly be confident in a non-trivial amount of code as well as optimizations that can be made to improve the performance and reliability of the system. So how did they do and how did it turn out? Read on!

A good starting point in the paper is its concise definition of the Paxos algorithm. As it is about a page long, I’ll omit it here but leave you with a more general description of Paxos, simply that it is an algorithm that assures agreement on a value between a majority of nodes in a distributed system. Of course if you’re interested in the algorithm this paper is an excellent starting point, and the writeup on Wikipedia is a good re-write of it (Basic Paxos).

An inherent downside of this is that since it is a consensus algorithm, the FLP Impossibility Result tells us that liveness can be impossible with a faulty processor (although not guaranteed and the probability of such an event is unknown). Now we have a problem. Liveness is the ability of the distributed system to make “progress”, that is, for new values to be agreed on over time, and now there’s an important theoretical result telling us that we can’t guarantee it if a single node fails. Interestingly, this paper does not state how they resolve the problem, but the Chubby paper does:

Paxos maintains safety without timing assumptions, but clocks must be introduced to ensure liveness; this overcomes the impossibility result of Fischer et al.

So how do clocks solve the problem? The original paper by Fischer et al. tells us the answer: just have the proposed value time-out and fail on all nodes. Since the nodes all have synchronized clocks, this is entirely possible and solves the problem altogether.

Now comes the obvious question: why use Paxos in the first place? Why do they even need it? This comes about because Chubby was using a commercial three-times replicated database but found it had bugs in its replication code, and since Google needs Chubby for extremely high availability, this warranted a rewrite of all the involved code. With Paxos, we can ensure that a distributed system can reach agreement on a value. In this case, the value is a change to the database, and once all the nodes agree on a change (technically, a majority, but other nodes can catch up on it later if they miss it), they can make the change to the database and log it so that snapshots of the database can be taken (for backup and recovery purposes). Now we’ve got a distributed database whose code we can debug (of course, “we” refers to Google and not us) should worse come to worse.

If you did happen to read the Paxos algorithm or the Wikipedia article on Paxos (or even just look at the diagram), you can see that this algorithm runs in a number of phases, and that there’s lots of potential for optimization. A common optimization is that the first step of the algorithm, electing a leader (master) for this run of Paxos, can be omitted if the node proposing the change is already the leader. And just like that we’ve knocked off O(2n) messages (n messages sent out requesting to be the master and potentially n messages received with the response). In fact, citing the paper (footnote, Page 7):

In a loaded system, under one percent of the Paxos instances run the full Paxos algorithm.

This ends up being pretty good for the final performance of the system, but before we can get that far, we have to at least mention the various problems that arise. With Chubby, masters can fail and new masters need to take their place. Paxos allows for this, but a problem encountered was that the intended behavior was to have events issued to old masters fail, while in practice nothing was preventing these events from succeeding (since nothing identified a master’s lifetime). So to enforce the requested behavior, exactly that had to be added in: epoch numbers, which represented a master’s lifetime (analogous to noting that our current president is the 44th, to distinguish him from all others).

Since Chubby is designed to be highly available, proper coding measures were taken to ensure that rolling it out would be as successful as possible. The paper details the numerous problems they ran into (problems upgrading, problems rolling back to an old version, and so on). But more interesting is how they were confident of its functionality. Extensive test suites were constructed to verify that even if faults were injected into the system, that it would remain safe (that is, one node couldn’t get an update unless a majority of them did) and live (over time, nodes will get updates).

The more interesting development note is how they coded Paxos in the first place. They note that they coded core parts of the algorithm as state machines in a specification language and then used a compiler to translate it to C++. This makes it much easier to verify its correctness, as it’s much easier to verify a page of pseudocode-like specifications than many thousands of lines of C++ code. Upon reading this, I thought of one thing: a controversial blog post by Joel Spolsky about Wasabi. Essentially it’s a similar idea: they needed certain features from their language but it just ended up being so mundane to do so that it was unproductive. So they wrote a compiler named Wasabi to give them all the language features they needed and compiled that to the language they needed. Since it was a lot of addons to existing languages, compiling it to the existing language ends up being much easier than writing a whole new language. Just like the Google guys, they’ve both used the right tool for the right job, and got a better product that was easier to program (given some initial development time on the compiler itself).

They follow up that interesting idea with another one: they had a bit of trouble finding where problems were in the system due to the fault-tolerant nature of the system. When a node would die, the fact of it would be hidden from the developers and it became harder to notice without extra logging procedures. Of course, the problem was resolved, but it makes it inherently difficult to address different types of problems as they arise.

That being said, let’s look at how it performs. From page 14 of the paper:

table1

Like last time, these numbers really speak for themselves. It outperforms their original database in every metric they measure it against and gives them the peace of mind they need should something come up down the line (since they also own the code base and are familiar with it).

The paper ends on a bit of an ominous note: they note that the current state of fault-tolerant systems is lacking, to say the least. They make it clear that the community has not thought enough about how testing fault-tolerant applications is done, and that taking these algorithms and modifying them to make them work in practice destroys any hope of correctness and proving useful properties about them in a reasonable amount of time.

This rounds up our tour of the “Big 5″ Google papers. As we’ve seen before, these applications were all designed for a very specialized environment (Google) with very specific needs. Processing huge amounts of data quickly requires a fault-tolerant filesystem (GFS), a database that can be run on it (BigTable), a programming model to use it with (MapReduce), a naming service to tie everything together with (Chubby), and an algorithm to ensure agreement between a set of nodes (Paxos). They’re all at varying degrees of abstraction and made for a fun ride. Of course, it doesn’t work for every situation, and most likely won’t work for yours, but it’s always interesting to see what everyone else is up to and how they pull it all off. We’ll continue next time with another paper from the reading list we looked at a while ago, so stay tuned!