Seeing the raft for the trees
The Raft distributed consensus protocol is normally seen as having an array of log entries. There is however a generalisation of Raft which replaces the array of log entries with a tree of nodes, and then (standard) Raft can be seen as a very simple tree that contains no branches. The resultant protocol can be called Tree Raft, and is very similar in spirit to Chained Raft. This post describes Tree Raft and contrasts it to standard Raft.
In standard Raft, the log is an array, indexed by uint64_t index
, with each array element being a log entry:
struct LogEntry {
uint64_t term;
string contents;
};
In Tree Raft, the log is a tree. The tree can be stored in a map/dictionary whose key type is NodeRef
and value type is Node
:
struct NodeRef {
uint64_t index;
uint64_t term;
};
struct Node {
string contents;
NodeRef parent;
};
There are two restrictions on the parent
field: if Node n
has key (i, t)
then n.parent.index == i - 1
and n.parent.term <= t
. This first restriction allows an optimisation: parent.index
does not need to be stored; only parent.term
needs to be stored.
There are two obvious orderings that can be defined on NodeRef
: by ascending index then ascending term (lexicographic), and by ascending term then ascending index (transposed lexicographic). Both orderings turn out to be useful in different places. Note that a node comes after its parent in both orderings.
As an optimisation, a Node
and (some number of) its transitive parents can be represented as an array of LogEntry
plus the parent term of the oldest Node
. This optimised representation is the only representation in standard Raft, and it appears twice: the per-server log takes this form (with the parent term of the oldest Node
being lastTerm
in the InstallSnapshot RPC), and the AppendEntries RPC takes this form (with the parent term of the oldest Node
being prevLogTerm
). A Tree Raft implementation could use this optimised form for committed nodes, and only use the more expensive map/dictionary representation for uncommitted nodes.
Note that a NodeRef
identifies a single Node
, but because every node identifies its parent, a NodeRef
implicitly identifies an entire chain of nodes. This is similar to a commit hash in git: a hash identifies a single commit, but a commit references its parent commit(s), and thus a single hash identifies an entire graph of commits leading to that point. Standard Raft describes this as the Log Matching property: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.
A Tree Raft server maintains two cursors into the tree: NodeRef commit
and NodeRef head
. The NodeRef commit
cursor replaces the uint64_t commitIndex
from standard Raft. The NodeRef head
cursor replaces "last log entry" everywhere that it comes up: the RequestVote RPC contains the candidate's NodeRef head
, and other servers are willing to vote for candidates whose NodeRef head
is greater than or equal to theirs (using transposed lexicographic order). The leader is able to create a new node by using the key (head.index + 1, currentTerm)
and then setting its head
to this new node. For every follower, there will be a path in the tree from the follower's NodeRef head
to the leader's NodeRef head
, and one objective of a follower is to move its head
cursor along this path toward the leader's head
cursor (note that the path might involve going up the tree before going back down). Similarly, for every follower, there will be a path in the tree from the follower's NodeRef commit
to the leader's NodeRef commit
, and another objective of a follower is to move its commit
cursor along this path toward the leader's commit
cursor (crucially, this path is always downwards and never backtracks). The leader is responsible for moving its commit
cursor toward its head
cursor when it is safe to do so, and the safety constraint here is very similar to standard Raft: a strict majority of servers must have their head.term
equal to the leader's currentTerm
, and within this majority, the minimum head.index
gives the safe commit point. On every server, the commit
cursor will either be equal to the head
cursor, or be an ancestor of the head
cursor. This permits an optimisation: only the index
of the commit
cursor needs to be stored, as its term
can be reconstructed by finding the ancestor of head
with the given index
. Standard Raft contains this optimisation, as it only stores uint64_t commitIndex
. It can be seen that standard Raft is just Tree Raft where a server stores the chain of nodes between the root and its head
(a very simple non-branching tree) as its log, and does not store any other nodes.
Tree Raft replaces the AppendEntries RPC with an AddNodes RPC. The prevLogIndex
, prevLogTerm
, entries[]
, and leaderCommit
arguments are replaced by the leader's NodeRef head
, pair<NodeRef, Node> nodes[]
, and the leader's NodeRef commit
. Note that these arguments can be optimised somewhat (only including commit.index
and not commit.term
, using the array of LogEntry
representation for nodes
, not including head
if the last nodes
entry is the head, not including the term of nodes when equal to the leader's currentTerm
, and so forth). After processing the RPC, followers respond with their currentTerm
and their (possibly moved) NodeRef head
. The receiver implementation is:
- If
term < currentTerm
, finish processing. - Add all the
nodes
to your map. - If there is a path from your
NodeRef head
to the leader'sNodeRef head
, set yourhead
to the leader's. - If there is a path from your
NodeRef commit
to the leader'sNodeRef commit
, and the leader'scommit
either equals yourhead
or is an ancestor of yourhead
, set yourcommit
to the leader's (this can be deferred until after the response has been sent, and the movement can be done one step at a time).
Note that if a follower misses an AddNodes RPC (due to packet loss, or leader failure, or late startup), then the nodes added in step 2 might not have a path back to the root; that is, there'll be a node in the map whose parent is not in the map. In this case, steps 3 and 4 will not be able to find a path to the new nodes. In a departure from standard Raft, in Tree Raft, followers are responsible for identifying these missing parent nodes and issuing Replay RPCs (against a randomly chosen server) in an attempt to obtain them. Once obtained, steps 3 and 4 can be retried. This has a number of nice properties: leaders no longer need to track nextIndex
for each follower, the replay workload is shared between followers rather than being yet another task that the leader is responsible for, and the AddNodes RPC can be a UDP multicast to all followers. Furthermore, as nodes are not removed from the map as part of processing an AddNodes RPC, there are slightly nicer forward progress guarantees during leader changeover. As yet another nice property, if implementing the PreVote extension (which is necessary for liveness), then PreVote rejects can carry the (server's latest view of) the leader's currentTerm
and NodeRef head
and NodeRef commit
, and then the receiver of the reject can treat this like an AddNodes
and then replay the missing nodes (note that currentTerm
is monotonic, and within a term, the leader's head
and commit
are also monotonic, hence stale views of this triplet can be arbitrated). Consequently, followers can keep up with the leader even when direct communication between the leader and the follower is not possible, avoiding the need for a long recovery process once direct communication is restored.
As nodes are not removed from the map as part of the AddNodes RPC, it is worth considering when nodes do get removed. A server cannot remove its head
node, nor any ancestor of its head
node (except when moving these nodes into a snapshot), but it is safe to remove any other node at any time. Standard Raft takes a very aggressive policy: when head
moves backward (as part of moving it toward the common parent of the follower's head
and the leader's head
, i.e. moving it along the path in the tree from follower's head
to leader's head
), immediately remove everything that can be removed. A very lax policy is also possible: never remove nodes. A good middle ground is to remove during commitment: when committing a node, remove any other nodes which have the same index
but a different term
, and when removing a node, also remove its children (recursively). This fits very nicely with using the optimised array representation for committed nodes. If taking this middle ground, then it becomes attractive to store the uncommitted nodes in a sorted map, with nodes in lexicographic order, as then nodes with the same index
can be found with a range scan, and the children of a node can also be found with a range scan (find all nodes with index + 1
using range scan, then filter based on their parent.term
).
To efficiently answer "is there a path between node X and node Y" queries, it can be useful for each server to store an additional NodeRef furthest_ancestor
field on every Node
. Initially, furthest_ancestor
is set to parent
, but when it is discovered that n.furthest_ancestor
exists in the map, n.furthest_ancestor
is replaced by lookup(n.furthest_ancestor).furthest_ancestor
(giving a union-find-like behaviour). If two nodes have the same furthest_ancestor
then a path exists between them. As an optimisation, furthest_ancestor
does not need to be stored for committed nodes, as it'll always refer to the root for them. Furthermore, furthest_ancestor
precisely identifies the node which needs to be the target of a Replay RPC.