Introduced as a more understandable alternative to Paxos, Raft achieves consensus by dividing the problem into sub-problems like leader election, log replication, and safety. It is based on the following key components:

  • Leader: Handles all client interactions and log appends.
  • Followers: Replicate the leader’s log entries.
  • Candidates: Servers that can become leaders during elections. It is used by etcd and cockroachdb

Operations

  1. Leader Election: A candidate requests votes from other servers to become the leader.
  2. Log Replication: The leader appends entries to its log and replicates them to followers.
  3. Commitment: Once a majority of followers replicate an entry, it’s considered committed.

Critical parts in Python

Leader Election

Leader election involves nodes voting for a candidate. A node becomes a leader if it gains a majority of votes. Raft resolves conflicts using term numbers and the log state. A term number is incremented whenever a node starts a new election and votes are granted to candidates with the highest term. A node votes once per term, and if two candidates request votes, a node will grant to the candidate with highest term (if terms equal, longer log)

import random
from threading import Lock
 
class RaftNode:
    def __init__(self, node_id, peers):
        self.node_id = node_id
        self.peers = peers
        self.current_term = 0
        self.voted_for = None
        self.state = "follower"  # follower, candidate, leader
        self.lock = Lock()
 
    def request_vote(self, term, candidate_id, last_log_index, last_log_term):
        with self.lock:
            if term < self.current_term:
                return False  # Ignore outdated terms
            
            # Grant vote if the candidate's log is at least as up-to-date
            if (self.voted_for is None or self.voted_for == candidate_id) and self.is_log_up_to_date(last_log_index, last_log_term):
                self.voted_for = candidate_id
                self.current_term = term
                return True
            return False
 
    def is_log_up_to_date(self, last_log_index, last_log_term):
        # Placeholder for actual log comparison logic
        return True
 
    def start_election(self):
        with self.lock:
            self.state = "candidate"
            self.current_term += 1
            self.voted_for = self.node_id
            votes = 1
 
        # Request votes from peers
        for peer in self.peers:
            if peer.request_vote(self.current_term, self.node_id, 0, 0):  # Simplified log state
                votes += 1
 
        if votes > len(self.peers) // 2:
            self.become_leader()
 
    def become_leader(self):
        with self.lock:
            self.state = "leader"
            print(f"Node {self.node_id} became the leader for term {self.current_term}")

Log Replication

When the leader receives a write request it appends the entry to its write-ahead log (WAL), sends the entry to followers and commits once a quorum of followers acknowledged it

class LogEntry:
    def __init__(self, term, command):
        self.term = term
        self.command = command
 
class LeaderRaftNode(RaftNode):
    def __init__(self, node_id, peers):
        super().__init__(node_id, peers)
        self.log = []
        self.commit_index = -1
        self.next_index = {peer.node_id: 0 for peer in peers}
        self.match_index = {peer.node_id: -1 for peer in peers}
 
    def append_entry(self, command):
        entry = LogEntry(self.current_term, command)
        self.log.append(entry)
 
        # Send the entry to all followers
        acks = 1  # Leader implicitly acknowledges itself
        for peer in self.peers:
            if peer.append_entries(self.current_term, self.node_id, len(self.log) - 1, entry):
                acks += 1
 
        # Commit if majority acknowledges
        if acks > len(self.peers) // 2:
            self.commit_entry(len(self.log) - 1)
 
    def commit_entry(self, index):
        self.commit_index = index
        print(f"Leader committed entry at index {index}: {self.log[index].command}")
 
    def append_entries(self, term, leader_id, prev_log_index, entry):
        with self.lock:
            if term < self.current_term:
                return False  # Ignore outdated terms
 
            # Simplified: Always accept the entry (real Raft checks for matching logs)
            if len(self.log) > prev_log_index and self.log[prev_log_index].term != term:
                return False  # Log inconsistency
 
            self.log.append(entry)
            return True

Inconsistencies detection and resolution

When the leader sends an AppendEntries RPC, it includes the previous log index (prevLogIndex) and term (prevLogTerm).

  • A follower checks:
    • Does it have an entry at prevLogIndex?
    • Does the term at prevLogIndex match prevLogTerm?
  • If not, the follower rejects the AppendEntries RPC.

If the AppendEntries is rejected, the follower logs is adjusted:

  • The leader backtracks (nextIndex--) until it finds the last matching entry.
  • It truncates the follower’s log from the point of divergence and appends its own entries.

Once the leader and follower logs match up to prevLogIndex, the leader sends the new entries, which the follower appends to its WAL.

Example: Log Overwriting in Raft

  1. Leader Sends AppendEntries RPC:
    • Leader sends prevLogIndex=2, prevLogTerm=3.
    • Follower checks its log:
      • Log[2].term is 2, but prevLogTerm is 3 (mismatch).
  2. Follower Rejects: Follower responds with its last known matching index (1).
  3. Leader Adjusts:
    • Leader decrements nextIndex to 1.
    • Leader sends entries [E2, E3 (term 3), E4].
  4. Follower Truncates and Appends:
    • Follower removes all entries after index 1.
    • Follower appends [E2, E3 (term 3), E4].

WAL Truncation in Practice

While the WAL is append-only, truncation is implemented as follows:

  1. Log Metadata:

    • A metadata file tracks the highest committed index and the last applied index.
    • Truncation updates this metadata, marking entries beyond the truncation point as invalid.
  2. Physical Deletion (Optional):

    • Periodically, invalidated entries may be physically removed for space efficiency, though this is optional.