Rob Colantuoni

November 14, 2013

Tags: Research and Distributed Computing


Looking into Paxos

The paper “Paxos Made Moderately Complex” by Robbert Van Renesse aims to provide the reader with a non-trivial, detail-oriented description of a Paxos real-world implementation. The paper included both pseudo-code and java implementations of the ideas discussed in the paper. Later sections deal with optimizations and liveness of the implementation. Although I’ve found other papers on this topic easier to read and comprehend, this paper had a feeling of completeness that was lacking in other research papers on the topic.

Most of the research papers regarding Paxos up to this point have attempted to explain the algorithm by using conceptual tools such as analogies or reducing the complexity by hiding details on how to accomplish certain tasks. I’ve found that often by hiding these details, the authors leave the reader with more questions regarding the algorithm and the feeling that the algorithm is lacking or is a toy. In this paper, the author does eliminate some optimization to better explain the algorithm, however he dedicates a section to addressing those topics later in the paper after the reader has time to digest the core concepts of Paxos.

A point that stood out regarding the paper was the authors assertion that in most real-world applications multi-Paxos is used. The author points to Zookeeper as a demonstration of that point, however the developers of Zookeeper assert that their version of Paxos is not multi-paxos (src). They point to the fact that they don’t have a method for guaranteeing a single leader. They rely on the leader election process to deal with the fact that there might be more than one member that believes itself to be the leader.

One of the main benefits of multi-Paxos is that given the assignment of a single leader, we can optimize the Paxos protocol by eliminating multiple calls in prepare phase. This then brings up the question of whether Zookeeper can maintain this optimization given that they don’t guarantee single leader. It turns out that Zookeeper makes some assumptions about the communication channel (that it is FIFO) and implements an auxiliary method of maintaining a quorum of members that recognize a single member as the leader (src). The combination of these two additional protocol changes allows Zookeeper to also skip phase 1 and maintain the same optimizations that are present in multi-Paxos.

This paper brought to light some misconceptions I had regarding Paxos – especially regarding synchrony and failure situations.

One of my take-aways from the paper was that the paxos protocol doesn’t guarantee that members have similar state at at the same time in a synchronous sense, but rather that they will follow the same state changes in a guaranteed order. Given enough delay on the wire, it’s possible that a member may be days behind other members, but that it’s state given a past slot will be identical to that of other members. This led me to question whether the Paxos protocol by it’s nature had built-in snapshot capabilities. It seems a given that you can query across all Paxos members using any slot number s where s < current s. The slot number acts as a logical clock for the entire Paxos system, so partitioning across the system using the slot number is consistent per the requirements. I started looking at how Zookeeper deals with snapshots and they refer to these as “fuzzy snapshots” (src).

Given this in-protocol ability to create consistent snapshots, the section on garbage collection did not make sense to me. It would seem that rather than having all members maintain the state for past slots, recovering nodes would immediately start caching slot decisions and then initiate their state using a snapshot from another member where the snapshot slot was one less than the last received update. This would require that only the recovering node have a small cache of decisions initially while it completes the state replication and then it would apply the cached decisions to that restored state. This would make garbage collection easier, as once the decision has been applied the member can discard it.

The final note I had regarding the paper was the mention of the AIMD (Additive Increase, Multiplicative Decrease) timeouts in failure detection. As mentioned briefly in the paper, TCP uses this method to increase window size with each successive packet send. This increases throughput while maintaining a reasonable recovery time in the event of failure and re-transmission.

What I found interesting about this approach is that the system will most likely be using TCP/IP for it’s transport method. In my previous research about Zookeeper, it became obvious that they were relying on the TCP guarantees of in-order delivery, re-transmission, and failure detection while implementing their system. My ultimate question is where do we draw the line when making assumptions about failure detection and the communication channels? If the authors goal is to describe Paxos in a real-world application, should we not leverage the communication channel session failure detection as a drop-in replacement for the scheme described in the paper? [Other papers in this area] ( have acknowledged that TCP can be used to provide scheduling, flow control and re-transmission of messages but I have not been able to work through the problem in it’s entirety yet to say if this guarantees liveness or is a pragmatic short-cut in light of real-world implementation issues.

In summary, this paper was very helpful and led me to further research that helped clarify the state of Paxos implementations in modern computer science. The amazing variety of methods used to optimize Paxos for industrial applications provides fertile ground for alternate research and improvements to existing solutions.