Follow the WeChat official account "Programmer Xiaopang" for daily technical insights, delivered first!
Introduction
When building large-scale distributed systems, data consistency is one of the challenges we must face. As business grows and system scale expands, ensuring that replicated data remains consistent across multiple nodes has become a critical issue. The Zab consensus protocol was specifically designed to solve this problem, and it provides data consistency guarantees for the distributed coordination service ZooKeeper.
Zab Consensus Protocol
The ZAB protocol is a consensus algorithm based on the Leader and Follower pattern. In the ZAB protocol, servers in the entire cluster are divided into three roles: Leader, Follower, and Observer. The Leader is responsible for handling client requests and synchronizing data to Followers and Observers. Followers receive data synchronization from the Leader and participate in new Leader elections when the Leader fails. Observers do not participate in voting and only receive data synchronization.
ZooKeeper guarantees the final consistency of distributed transactions through the Zab protocol. Zab (ZooKeeper Atomic Broadcast, the ZooKeeper atomic broadcast protocol) supports crash recovery. Based on this protocol, ZooKeeper implements an active-standby system architecture to maintain data consistency across replicas in the cluster.
In a ZooKeeper cluster, all client requests are written to the Leader process, which then synchronizes the data to other nodes called Followers. During cluster data synchronization, if a Follower node crashes or the Leader process crashes, the Zab protocol is used to ensure data consistency.
Algorithm Principles
The ZAB protocol is a variant of the atomic broadcast protocol, and it ensures data consistency in distributed systems. The ZAB protocol ensures data consistency through a series of processes, which include the following three stages
-
Crash Recovery Stage: If the Leader node crashes during synchronization, the system will enter the crash recovery stage to re-elect a new Leader. The crash recovery stage also includes data synchronization operations to synchronize the latest data in the cluster and maintain cluster data consistency. The consistency guarantee of the entire ZooKeeper cluster switches between these two states. When the Leader service is normal, it is in normal message broadcast mode; when the Leader is unavailable, it enters crash recovery mode. After the crash recovery stage completes data synchronization, the system re-enters the message broadcast stage.
-
Message Broadcast Stage: The Leader node accepts transaction commits, broadcasts new Proposal requests to Follower nodes, collects feedback from each node to decide whether to commit, and uses the Quorum election mechanism mentioned earlier during this process.
-
Election Stage: In the ZAB protocol, nodes in the cluster are divided into Leaders and Followers. When the cluster starts or the Leader fails, an election process will be triggered to elect a new Leader.
Zxid
Zxid is a transaction number for the Zab protocol. Zxid is a 64-bit number, where the lower 32 bits are a simple monotonically increasing counter: the counter increments by 1 for each client transaction request; the upper 32 bits represent the Leader epoch number.
Zab Process Analysis
The specific workflow of Zab can be divided into three processes: message broadcast, crash recovery, and data synchronization. Let's analyze each one separately below.
Message Broadcast
In ZooKeeper, all transaction requests are handled by the Leader node, with other servers acting as Followers. The Leader converts client transaction requests into transaction Proposals and distributes the Proposals to all other Followers in the cluster. After completing the broadcast, the Leader waits for feedback from Followers. When more than half of the Followers have sent feedback, the Leader will broadcast Commit information to the Followers in the cluster again. The Commit information confirms that the previous Proposal should be submitted. Writing to the Leader node is also a two-step operation: the first step is to broadcast the transaction operation, and the second step is to broadcast the commit operation. The term "more than half" refers to the number of feedback nodes >= N/2 +1, where N is the total number of Follower nodes.
- After a client's write request arrives, the Leader will wrap the write request into a Proposal transaction and add an incrementing transaction ID, which is Zxid. Zxid is monotonically increasing to ensure the order of each message;
- Broadcast this Proposal transaction. The Leader and Follower nodes are decoupled, and all communication goes through a first-in-first-out (FIFO) message queue. The Leader will assign a separate FIFO queue to each Follower server, then place the Proposal into the queue;
- After receiving the corresponding Proposal, the Follower node will persist it to disk. Once the write is complete, send an ACK to the Leader;
- After the Leader receives ACKs from more than half of the Follower machines, it will commit the transaction on its local machine and start broadcasting the commit. After receiving the commit, the Followers complete their respective transaction commits.
Crash Recovery
The message broadcast uses the Quorum mechanism to solve the problem of Follower node crashes, but what if the Leader node crashes during the broadcast process? This requires the crash recovery supported by the Zab protocol, which can ensure that a new Leader is re-elected when the Leader process crashes and guarantee data integrity. Crash recovery is consistent with the election process when the cluster starts, meaning that the following situations will all enter the crash recovery stage:
- Initializing the cluster, just after startup
- The Leader crashes due to a failure
- The Leader loses support from more than half of the machines and is disconnected from more than half of the nodes in the cluster
The crash recovery mode will start a new round of elections. The newly elected Leader will synchronize with more than half of the Followers to achieve data consistency. Once synchronization with more than half of the machines is complete, the recovery mode is exited and the system re-enters the message broadcast mode. Nodes in Zab have three states, and the node state changes along with the different stages of Zab:
- following: The current node is a Follower, obeying the commands of the Leader node
- leading: The current node is the Leader, responsible for coordinating transactions
- looking: The node is in an election state
Let's use a simulated example to understand the crash recovery stage, which is the election process. Suppose there are five Follower servers in the running cluster, numbered Server1, Server2, Server3, Server4, and Server5. The current Leader is Server2. If the Leader crashes at some point, Leader election will begin.
The election process is as follows:
- Each node changes its state to Looking
In addition to Leaders and Followers, ZooKeeper also has Observer nodes. Observers do not participate in elections. After the Leader crashes, the remaining Follower nodes will change their state to Looking and then begin the Leader election process.
- Each Server node will cast a vote to participate in the election
In the first round of voting, all Servers will vote for themselves, then send their votes to all machines in the cluster. During operation, the Zxid on each server is likely to be different.
- The cluster receives votes from each server and starts processing the votes and election
The process of handling votes is the process of comparing Zxids. Assume that Server3 has the largest Zxid. Server1 judges that Server3 can become the Leader, so Server1 will vote for Server3. The judgment criteria are as follows:
- First, elect the one with the largest epoch
- If the epochs are equal, select the one with the largest zxid
- If both the epoch and zxid are equal, select the one with the largest server ID, which is the myid configured in zoo.cfg
During the election process, if a node receives more than half of the votes, it will become the Leader node; otherwise, the voting will be restarted. 4. Election successful, change the server's state
Data Synchronization
After the election is completed during crash recovery, the next task is data synchronization. During the election process, it has been confirmed through voting that the Leader server is the node with the largest Zxid. The synchronization stage uses the latest Proposal history obtained by the Leader in the previous stage to synchronize all replicas in the cluster.
Leader election example code:
public interface Proposal { int getZxid(); void apply(); }public class ZabLeaderElection {
private volatile boolean isLeader = false;
private int leaderId;
private List proposals = new ArrayList<>();public synchronized void startElection() { // Election logic, simplified to directly assume the current node is the Leader leaderId = this.hashCode(); isLeader = true; System.out.println("Leader elected: " + leaderId); } public synchronized boolean isLeader() { return isLeader; } public synchronized void onProposalReceived(Proposal proposal) { if (isLeader()) { proposals.add(proposal); applyProposal(proposal); } else { // If the current node is not the Leader, forward the proposal to the Leader forwardProposalToLeader(proposal); } } private void applyProposal(Proposal proposal) { // Apply the proposal to the state machine proposal.apply(); // Persist the proposal persistProposal(proposal); // Send the proposal to other Follower nodes sendProposalToFollowers(proposal); } private void forwardProposalToLeader(Proposal proposal) { // Logic for forwarding proposals to the Leader } private void persistProposal(Proposal proposal) { // Logic for persisting proposals } private void sendProposalToFollowers(Proposal proposal) { // Logic for sending proposals to Followers }
}
Conclusion
The ZAB protocol is the core algorithm for implementing data consistency in ZooKeeper. Through processes such as election, atomic broadcast, and crash recovery, the ZAB protocol ensures data consistency in distributed systems. The ZAB protocol is widely used in scenarios such as distributed locks, service discovery, and configuration management, providing a reliable data consistency guarantee mechanism for distributed systems.
This is a discussion topic separated from the original topic at https://juejin.cn/post/7368294440547647526



