Byzantine Reality

Searching for Byzantine failures in the world around us

Amazon's Dynamo

Now that we’ve spent more than enough time looking over Google’s highly scalable infrastructure, let’s turn our attention to an even newer paper from Amazon. Their datastore, dubbed Dynamo, is an interesting contrast to Google’s work that brings up many interesting questions and points to note from it. Here’s the executive summary:

Dynamo is built to be highly available, and sacrifices the traditional notion of consistency in order to do so.

Yet the paper itself perhaps gives a better one-liner:

Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly.

The paper says in the title that Dynamo is a key-value datastore (basically a hashtable again like BigTable), but from reading the paper it’s not clear that it’s at that low of a level of abstraction. I originally was under the impression that it fully encapsulated the database, but upon reading it all the way through it’s clear that’s not particularly true. Let me explain.

Dynamo is really more of a piece of middleware software that sits on top of other databases and provides the abstraction to the user of a key-value datastore. The paper states that Dynamo can hook into “aBerkeley Database (BDB) Transactional Data Store, BDB Java Edition, MySQL, and an in-memory buffer with persistent backing store.” Different databases are chosen based on the size of the data being stored in it (BDB for data > 100 KB, MySQL otherwise), and this is exposed as a configuration option for the developer’s software, since they would be the best ones to know how big all the data they’re storing in Dynamo is. This bring up another interesting thing about Dynamo to contrast with the Google papers we’ve seen:

Dynamo gives more choices to the developer to harness the system better, but at an increased level of complexity.

Now I have to consciously know as a programmer while writing my app to think of how big all my objects will be and which database stores them best. And what happens if some of my objects are really small and some are really large? I love giving more power to the programmer, but like a lot of powerful abstractions, I don’t know if this is begging for disaster. The paper does end up stating that applications should have data less than 1 MB, but it still doesn’t address the question of having lots of data < 100 KB in size and lots of data > 100 KB but < 1 MB.

But back on topic, since Dynamo dumps data to a database, I’d say it really qualifies more as a piece of middleware rather than it’s own database. It seems entirely possible that one could trivially also dump data to Amazon’s Simple Storage Service (S3), since it also uses a get/put model like Dynamo requires.

The recurring theme in this paper is how to achieve scalable availability. Dynamo achieves this through a notion of “eventual consistency”. Since the system is highly distributed and failures are likely, it is exceedingly costly to perform updates. Using the Paxos algorithm requires a majority of the nodes to be up, and the amount of communication required for all the rounds of Paxos is apparently far too high for Amazon’s constraints (not explicitly stated but seems to be the case). Therefore, if a certain number of nodes get an update and promise to eventually propogate the data, the system can then survive many more failures and ensure consistency, given enough time.

The big problem with this should be clear: what does “eventual consistency” mean to the programmers using the system? On this, the paper is clear. A program communicates with a “coordinator” who coordinates reads and writes with the various nodes. If the user attempts to read a value and the coordinator gets back different values from the various nodes, it simply packages them all up and gives them all to the user. The user then decides what to do with it. This is just like before: the user would ideally know what to do with all this data better than the system does, but at a cost of increased complexity in the system. It is possible to give simple rules to the system and have it “automatically resolve” conflicts, like having the newest write be the single value to return, but the choice on which to use is left to the programmer.

However, it is not clear how a coordinator comes to get this role in the system. The paper stresses that all nodes need to be symmetric and play the same role, in order to survive failures and keep the system simple. Therefore, how does one node become a coordinator? It is stated in the paper that the user’s program has a preference list of nodes to read from and write to, so it seems to be that the user’s program could play the role of coordinator. This would keep in line with the rest of the paper (add more complexity on the user’s side to keep things simple elsewhere), but as it’s not clear from the paper, let’s not speculate further on it. We can go a different route though and add in another quote from the paper that compounds the confusion (here the context is about how to ensure that if our program writes something it reads back whatever it wrote):

In particular, since each write usually follows a read operation, the coordinator for a write is chosen to be the node that replied fastest to the previous read operation which is stored in the context information of the request. This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting “read-your-writes” consistency.

So now it seems clear that in fact the coordinator is indeed one of the nodes, but it is not clear how it is elected to this status, except perhaps artificially by the user’s program. Maybe just the user’s program tells a node it is the coordinator for a write, gives it the write, and lets it do whatever it needs to do with it. This is the most probable, but it’s not clear what happens if that node dies while it is the coordinator since it now has state. Furthermore, the user’s program now has more state since it now is keeping track of who responded fastest to each read operation in order to know who to contact about writes, which also keeps in line with what we’ve seen here.

Now that we’ve agreed that Dynamo really acts more as a piece of middleware providing a database abstraction, a good question to ask is “What does it do to give this abstraction?”. On that, the paper is clear:

In addition to the actual data persistence component, the system needs to have scalable and robust solutions for load balancing, membership and failure detection, failure recovery, replica synchronization, overload handling, state transfer, concurrency and job scheduling, request marshalling, request routing, system monitoring and alarming, and configuration management.

It also enforces single key updates only (transactions thus are similar to BigTable and Cassandra) and uses a gossip-based failure detection and membership algorithm as well as an error correcting scheme implemented as a Merkle tree. It does all this to provide the abstraction of a distributed hash table just as we hinted at in the introduction quite some time ago, and a decent chunk of the paper is dedicated to how this hash table is partitioned and all that fun stuff that I’ll skip over here.

What I think is a bit more interesting is how many nodes need to participate in a read or write for it to go through successfully. Like many other things in Dynamo, you get to choose. Once more there’s much more power in your hands to make Dynamo more flexible, but now there’s more of a requirement that you really get to know Dynamo and mess it up a few times to learn how to choose these numbers. Some quick notation in the paper is useful to explain before going on. They use R to refer to the number of nodes required to participate in a successful read operation and W to represent the number of nodes required to participate in a successful write operation. N is a bit less clear. Sometimes in the paper it refers to the total number of nodes in the system, and in other places it refers to how many replicas you want for your file, and in other places it refers to both these at the same time. Take for example what I’d consider to be the most informative line of the paper (possibly after the bit in the intro):

The common (N,R,W) configuration used by several instances of Dynamo is (3,2,2).N replicas to perform a “durable write”.

Thus Dynamo gives you three-copy replication (same as BigTable) but also is hinting that a Dynamo cluster typically also only has three nodes in it. A majority of the nodes are required to participate in a simple read or write operation (here 2 out of 3). This number is customizable, so they do mention that a highly writable system can be produced by setting W to 1 (only one node needs to participate for a write to be successful) and the same for a highly readable system with R to 1. If one sets R or W to be less than a majority, then consistency becomes a bit weaker in the system and these issues we talked about earlier can show up.

So there’s Dynamo in a nutshell! It’s an interesting alternative to BigTable that shows promise, and it would be very interesting to know how much of this configurability in the system really costs the average programmer at Amazon. And since we always try to mention some open source versions of what you’ve read about here, it would be remiss of me not to mention Project Voldemort and Dynomite. I don’t really have any experience with them, but we may look into them in a little while (no promises however). Sorry about the long delay since our post last time, but I will be posting soon about what took up my time then, so stay tuned!