Tikv source code analysis series — how to use raft

Time:2021-11-23

Tikv source code analysis series — how to use raft

This series of articles is mainly aimed at tikv community developers, focusing on tikv system architecture, source code structure and process analysis. The purpose is to enable developers to have a preliminary understanding of the tikv project and better participate in the development of tikv.
It should be noted that tikv is written in rust language, and users need to have a general understanding of rust language. In addition, this series of articles will not cover the detailed introduction of tikv central control service placement driver (PD), but will explain some important processes how tikv interacts with PD.
Tikv is a distributed kV system. It uses raft protocol to ensure strong data consistency. At the same time, it uses mvcc + 2pc to support distributed transactions.

summary

This document is mainly for tikv community developers. It mainly introduces the system architecture, source code structure and process analysis of tikv. The purpose is to enable developers to have a preliminary understanding of the tikv project and better participate in the development of tikv after reading the documents.

It should be noted that tikv is usedRustLanguage, users need to have a general understanding of rust language. In addition, this document will not cover the detailed introduction of the placement driver (PD) of the tikv central control service, but will explain some important processes how tikv interacts with PD.

Tikv is a distributed kV system. It uses raft protocol to ensure strong data consistency. At the same time, it uses mvcc + 2pc to support distributed transactions.

framework

The overall architecture of tikv is relatively simple, as follows:

Tikv source code analysis series -- how to use raft

Placement Driver: placement driver (PD) is responsible for the management and scheduling of the whole cluster.

Node: a node can be considered as an actual physical machine, and each node is responsible for one or more stores.

StoreStore: rocksdb is used for actual data storage. Usually, one store corresponds to one hard disk.

Region: region is the smallest unit of data movement, corresponding to an actual data range in the store. Each region will have multiple replicas. Each replica is located in a different store, and these replicas form a raft group.

Raft

Tikv uses the raft algorithm to achieve strong data consistency under the distributed environment. For raft, please refer to the paper“In Search of an Understandable Consensus Algorithm”AndOfficial website, there is no detailed explanation here. Simply understand, raft is a replication log + state machine model. We can only write through the leader. The leader will copy the command to the followers in the form of log. When most nodes of the cluster receive the log, we think the log is committed and can be applied to the state machine.

Raft transplantation of tikvetcd Raft, support all functions of raft, including:

  • Leader election

  • Log replicationLog compaction

  • Membership changesLeader transfer

  • Linearizable / Lease read

It should be noted that the processing of membership change in tikv and etcd is slightly different from that in raft paper. The main reason is that tikv’s membership change takes effect only when log applied. The main purpose is to simplify the implementation, but there is a risk that if we have only two nodes, we need to remove one node from it, If a follower has not received the log entry of confchange, the leader will be pawned and cannot be recovered, and the whole cluster will not work. Therefore, we usually recommend that users deploy three or more odd nodes.

Raft library is an independent library. Users can also easily embed it directly into their own applications, and only need to deal with storage and message sending by themselves. Here is a brief introduction to how to use raft. The code is under / SRC / raft in the tikv source code directory.

Storage

First, we need to define our own storage. Storage is mainly used to store raft related data. The definition of trait is as follows:

pub trait Storage {
    fn initial_state(&self) -> Result<RaftState>;
    fn entries(&self, low: u64, high: u64, max_size: u64) -> Result<Vec<Entry>>;
    fn term(&self, idx: u64) -> Result<u64>;
    fn first_index(&self) -> Result<u64>;
    fn last_index(&self) -> Result<u64>;
    fn snapshot(&self) -> Result<Snapshot>;
}

We need to implement our own storage trait. Here is a detailed explanation of the meaning of each interface:

initial_ State: called when initializing raft storage, it will return a raft state. The definition of raft state is as follows:

pub struct RaftState {
    pub hard_state: HardState,
    pub conf_state: ConfState,
}

Hardstate and confstate are protobuf, defined as:

message HardState {
    optional uint64 term   = 1; 
    optional uint64 vote   = 2; 
    optional uint64 commit = 3; 
}

message ConfState {
    repeated uint64 nodes = 1;
}

In the hardstate, the last saved term information of the raft node, which node of the previous vote, and the committed log index are saved. The confstate stores all node ID information of the raft cluster.

When calling raft related logic outside, users need to handle the persistence of raft state by themselves.

entries: get the raft log entry of the [low, high) interval, and control the maximum number of entities returned through max_size.

term,first_ Index and last_ Index is to get the current term and the minimum and final log index respectively.

snapshot: get a snapshot of the current storage. Sometimes, the current storage data volume is large, and it will take a long time to generate a snapshot, so we may have to generate it asynchronously in another thread without blocking the current raft thread. At this time, the snapshot temporarily unavailable error can be returned. At this time, raft knows that it is preparing the snapshot, Will try again after a while.

It should be noted that the above storage interface is only required by the raft library. In fact, we will use this storage to store raft log and other data, so we need to provide other interfaces separately. In raft storage.rs, we provide a memstorage for testing. You can also refer to memstorage to implement your own storage.

Config

Before using raft, we need to know some related configurations of raft, which are defined in config. Here, only the items needing attention are listed:

pub struct Config {
    pub id: u64,
    pub election_tick: usize,
    pub heartbeat_tick: usize,
    pub applied: u64,
    pub max_size_per_msg: u64,
    pub max_inflight_msgs: usize,
}

id: the unique ID of the raft node. In a raft cluster, IDS cannot be duplicated. In tikv, ID is globally unique through PD.

election_tick: when follower is in selection_ If the message from the leader is not received after the tick time, the election will be restarted. Tikv uses 50 by default.

heartbeat_tick: leader every heartbeat_ Every tick will send a heartbeat message to the follower. The default is 10.

applied: applied is the last log index that has been applied.

max_size_per_msg: limit the maximum message size sent each time. The default is 1MB.

max_inflight_msgs: limit the maximum number of in flight messages during replication. The default is 256.

Here is a detailed explanation of the meaning of tick. The raft of tikv is timing driven. Suppose we call the raft tick every 100ms, when we call the headbeat_ After the number of ticks, the leader will send a heartbeat to the follower.

RawNode

We use raft through rawnode. The constructor of rawnode is as follows:

pub fn new(config: &Config, store: T, peers: &[Peer]) -> Result<RawNode<T>> 

We need to define the config of raft, and then pass in an implemented storage peer. This parameter is only used for testing, but actually it needs to be null. After generating the rawnode object, we can use raft. We focus on the following functions:

tick: we use the tick function to drive raft regularly. In tikv, we call tick every 100ms.

propose: the leader writes the command sent by the client to the raft log through the propose command and copies it to other nodes.

propose_conf_change: similar to propose, it is only used to process the confchange command separately.

step: when a node receives messages from other nodes, it actively calls the driver raft.

has_ready: used to determine whether a node is ready.

ready: to get the ready status of the current node, we will use has before_ Ready to determine whether a rawnode is ready.

apply_conf_change: when a confchange log is successfully applied, the driver raft needs to be actively called.

advance: Tell raft that it has finished processing ready and start subsequent iterations.

For rawnode, we focus on the concept of ready. The definition of ready is as follows:

pub struct Ready {
    pub ss: Option<SoftState>,
    pub hs: Option<HardState>,
    pub entries: Vec<Entry>,
    pub snapshot: Snapshot,
    pub committed_entries: Vec<Entry>,
    pub messages: Vec<Message>,
}

ss: if the softstate changes, such as adding or deleting nodes, Ss will not be empty.

hs: HS will not be empty if the hardstate is changed, such as re vote and term increase.

entries: needs to be stored in storage before messages is sent.

snapshot: if the snapshot is not empty, it needs to be stored in storage.

committed_entries: the raft log that has been committed can be applied to state machine.

messages: messages sent to other nodes can usually be sent only after entries are saved successfully, but for leaders, messages can be sent first and entries can be saved. This is an optimization method mentioned in raft paper, and tikv also adopts it.

When a rawnode is found to be ready externally, it gets ready. The processing is as follows:

  1. Persist non empty SS and HS.

  2. If it is a leader, send messages first.

  3. If the snapshot is not empty, save the snapshot to storage, and asynchronously apply the data in the snapshot to the state machine (although synchronous application is also possible here, the snapshot is usually large, and the synchronization will block threads).

  4. Save entries to storage.

  5. If it is a follower, send messages.

  6. Will be committed_ Entries apply to state machine.

  7. Call advance to tell raft that it has finished processing ready.

(to be continued…)