Introduced as a more understandable alternative to Paxos, Raft achieves consensus by dividing the problem into sub-problems like leader election, log replication, and safety. It is based on the following key components:
- Leader: Handles all client interactions and log appends.
- Followers: Replicate the leader’s log entries.
- Candidates: Servers that can become leaders during elections. It is used by etcd and cockroachdb
Operations
- Leader Election: A candidate requests votes from other servers to become the leader.
- Log Replication: The leader appends entries to its log and replicates them to followers.
- Commitment: Once a majority of followers replicate an entry, it’s considered committed.
Critical parts in Python
Leader Election
Leader election involves nodes voting for a candidate. A node becomes a leader if it gains a majority of votes. Raft resolves conflicts using term numbers and the log state. A term number is incremented whenever a node starts a new election and votes are granted to candidates with the highest term. A node votes once per term, and if two candidates request votes, a node will grant to the candidate with highest term (if terms equal, longer log)
import random
from threading import Lock
class RaftNode:
def __init__(self, node_id, peers):
self.node_id = node_id
self.peers = peers
self.current_term = 0
self.voted_for = None
self.state = "follower" # follower, candidate, leader
self.lock = Lock()
def request_vote(self, term, candidate_id, last_log_index, last_log_term):
with self.lock:
if term < self.current_term:
return False # Ignore outdated terms
# Grant vote if the candidate's log is at least as up-to-date
if (self.voted_for is None or self.voted_for == candidate_id) and self.is_log_up_to_date(last_log_index, last_log_term):
self.voted_for = candidate_id
self.current_term = term
return True
return False
def is_log_up_to_date(self, last_log_index, last_log_term):
# Placeholder for actual log comparison logic
return True
def start_election(self):
with self.lock:
self.state = "candidate"
self.current_term += 1
self.voted_for = self.node_id
votes = 1
# Request votes from peers
for peer in self.peers:
if peer.request_vote(self.current_term, self.node_id, 0, 0): # Simplified log state
votes += 1
if votes > len(self.peers) // 2:
self.become_leader()
def become_leader(self):
with self.lock:
self.state = "leader"
print(f"Node {self.node_id} became the leader for term {self.current_term}")Log Replication
When the leader receives a write request it appends the entry to its write-ahead log (WAL), sends the entry to followers and commits once a quorum of followers acknowledged it
class LogEntry:
def __init__(self, term, command):
self.term = term
self.command = command
class LeaderRaftNode(RaftNode):
def __init__(self, node_id, peers):
super().__init__(node_id, peers)
self.log = []
self.commit_index = -1
self.next_index = {peer.node_id: 0 for peer in peers}
self.match_index = {peer.node_id: -1 for peer in peers}
def append_entry(self, command):
entry = LogEntry(self.current_term, command)
self.log.append(entry)
# Send the entry to all followers
acks = 1 # Leader implicitly acknowledges itself
for peer in self.peers:
if peer.append_entries(self.current_term, self.node_id, len(self.log) - 1, entry):
acks += 1
# Commit if majority acknowledges
if acks > len(self.peers) // 2:
self.commit_entry(len(self.log) - 1)
def commit_entry(self, index):
self.commit_index = index
print(f"Leader committed entry at index {index}: {self.log[index].command}")
def append_entries(self, term, leader_id, prev_log_index, entry):
with self.lock:
if term < self.current_term:
return False # Ignore outdated terms
# Simplified: Always accept the entry (real Raft checks for matching logs)
if len(self.log) > prev_log_index and self.log[prev_log_index].term != term:
return False # Log inconsistency
self.log.append(entry)
return TrueInconsistencies detection and resolution
When the leader sends an AppendEntries RPC, it includes the previous log index (prevLogIndex) and term (prevLogTerm).
- A follower checks:
- Does it have an entry at
prevLogIndex? - Does the term at
prevLogIndexmatchprevLogTerm?
- Does it have an entry at
- If not, the follower rejects the AppendEntries RPC.
If the AppendEntries is rejected, the follower logs is adjusted:
- The leader backtracks (
nextIndex--) until it finds the last matching entry. - It truncates the follower’s log from the point of divergence and appends its own entries.
Once the leader and follower logs match up to prevLogIndex, the leader sends the new entries, which the follower appends to its WAL.
Example: Log Overwriting in Raft
- Leader Sends AppendEntries RPC:
- Leader sends
prevLogIndex=2,prevLogTerm=3. - Follower checks its log:
Log[2].termis2, butprevLogTermis3(mismatch).
- Leader sends
- Follower Rejects: Follower responds with its last known matching index (
1). - Leader Adjusts:
- Leader decrements
nextIndexto1. - Leader sends entries
[E2, E3 (term 3), E4].
- Leader decrements
- Follower Truncates and Appends:
- Follower removes all entries after index
1. - Follower appends
[E2, E3 (term 3), E4].
- Follower removes all entries after index
WAL Truncation in Practice
While the WAL is append-only, truncation is implemented as follows:
-
Log Metadata:
- A metadata file tracks the highest committed index and the last applied index.
- Truncation updates this metadata, marking entries beyond the truncation point as invalid.
-
Physical Deletion (Optional):
- Periodically, invalidated entries may be physically removed for space efficiency, though this is optional.