Tikv source code analysis series — placement driver


This series of articles mainly for tikv community developers, focusing on tikv system architecture, source structure, process analysis. The purpose is to enable developers to have a preliminary understanding of the tikv project after reading, and better participate in the development of tikv.

Tikv is a distributed kV system, which uses raft protocol to ensure strong consistency of data, and uses mvcc + 2pc to support distributed transactions.

This is the third section of this series.


Placement driver (hereinafter referred to as PD) is the global central master node in tidb. It is responsible for scheduling the entire cluster, generating global ID, and generating global timestamp Tso. PD also keeps the meta information of the entire cluster tikv, which is responsible for providing routing function to clients.

As the central control node, PD is integrated through theetcdAutomatic support of auto failure, no need to worry about single point of failure. At the same time, PD also ensures the strong consistency of data through the raft of etcd, without worrying about the problem of data loss.

In terms of architecture, all data of PD are obtained through active reporting by tikv. At the same time, PD’s scheduling and other operations on the entire tikv cluster will only return relevant commands in the result of the heartbeat command sent by tikv, so that tikv can handle it on its own rather than sending commands to tikv on its own initiative. In this way, the design is very simple. We can think of PD as a stateless service (of course, PD will still persist some information to etcd), and all operations are triggered passively. Even if the PD fails, the newly selected PD leader can immediately provide external services without considering any previous intermediate state.


PD is integrated with etcd, so we usually need to start at least three copies to ensure data security. At present, PD has cluster startup mode,initial-clusterAnd the static wayjoinDynamic mode.

Before continuing, we need to know the port of etcd. In etcd, 2379 and 2380 ports are monitored by default. 2379 is mainly used by etcd to process external requests, while 2380 is used for communication between etcd peers.

Suppose we have three PDS, PD1, PD2, PD3, on host1, host2, and host3.

For static initialization, we will give theinitial-clusterset uppd1=http://host1:2380,pd2=http://host2:2380,pd3=http://host3:2380

For dynamic initialization, we first start PD1, then start PD2 and join the cluster of PD1,joinSet tohttp://host1:2379。 Then start PD3 and join the cluster formed by PD1 and PD2,joinSet tohttp://host1:2379

As can be seen, static initialization and dynamic initialization go through two ports completely, and these two ports are mutually exclusive, that is, we can only use one way to initialize the cluster. Etcd itself only supportsinitial-clusterBut for convenience, PD also providesjoinThe way.

joinWe mainly use the member API provided by etcd, including add member, list member, etc., so we use port 2379 because we need to send commands to etcd for execution. andinitial-clusterIt is the initialization method of etcd itself, so the 2380 port is used.

Compared withinitial-clusterjoinThere are a lot of cases to considerserver/join.go prepareJoinClusterFunction), butjoinIt’s very natural to use. We’ll consider removing it laterinitial-clusterInitialization scheme.


After the PD is started, we need to select a leader to provide external services. Although etcd has its own raft leader, we still think that using our own leader, that is, PD’s leader is different from etcd’s own leader.

When PD is launched, the leader election is as follows:

  1. Check whether there is a leader in the current cluster. If there is a leader, watch the leader. If it is found that the leader is dropped, restart 1.

  2. If there is no leader, start the campaign, create a lesson, and write relevant information through the transaction mechanism of etcd, as follows:

    // Create a lessor. 
    ctx, cancel := context.WithTimeout(s.client.Ctx(), requestTimeout)
    leaseResp, err := lessor.Grant(ctx, s.cfg.LeaderLease)
    // The leader key must not exist, so the CreateRevision is 0.
    resp, err := s.txn().
        If(clientv3.Compare(clientv3.CreateRevision(leaderKey), "=", 0)).
        Then(clientv3.OpPut(leaderKey, s.leaderValue, clientv3.WithLease(clientv3.LeaseID(leaseResp.ID)))).

    If the createrevision of the leader key is 0, indicating that other PDS have not been written, then I can write my own leader related information and bring a lease. If the transaction fails, indicating that the other PD has become the leader, go back to 1 again.

  3. After becoming the leader, we regularly carry out live keeping treatment:

    // Make the leader keepalived.
    ch, err := lessor.KeepAlive(s.client.Ctx(), clientv3.LeaseID(leaseResp.ID))
    if err != nil {
        return errors.Trace(err)

    When the PD crashes, the previously written leader key will be automatically deleted due to the expiration of the lease, so that other PDS can watch and start the election again.

  4. To initialize raft cluster, it mainly reloads the meta information of the cluster from etcd. Get the latest Tso information:

    // Try to create raft cluster.
    err = s.createRaftCluster()
    if err != nil {
        return errors.Trace(err)
    log.Debug("sync timestamp for tso")
    if err = s.syncTimestamp(); err != nil {
        return errors.Trace(err)
  5. After all is done, start to update Tso regularly to monitor whether the lessor has expired and whether the user exits voluntarily

    for {
        select {
        case _, ok := <-ch:
            if !ok {
                log.Info("keep alive channel is closed")
                return nil
        case <-tsTicker.C:
            if err = s.updateTimestamp(); err != nil {
                return errors.Trace(err)
        case <-s.client.Ctx().Done():
            return errors.New("server closed")


We talked about Tso earlier. Tso is a global timestamp, which is the cornerstone of tidb to implement distributed transactions. Therefore, for PD, we should first ensure that it can allocate a large number of TSOs for transactions quickly, and at the same time, we also need to ensure that the Tso allocated must be monotonically increasing, and there is no possibility of fallback.

Tso is an Int64 shaping, which is composed of physical time and logical time. Physical time is the millisecond time of the current UNIX time, while logical time is the maximum1 << 18Counter for. In other words, 1 ms, PD can allocate up to 262144 TSOs, which can meet the vast majority of cases.

For Tso save in allocation, PD will do the following:

  1. When PD becomes the leader, it will get the last saved time from etcd. If the local time is larger than this, it will continue to wait until the current time is greater than this value:

    last, err := s.loadTimestamp()
    if err != nil {
        return errors.Trace(err)
    var now time.Time
    for {
        now = time.Now()
        if wait := last.Sub(now) + updateTimestampGuard; wait > 0 {
            log.Warnf("wait %v to guarantee valid generated timestamp", wait)
  2. When a PD can allocate Tso, it will first apply for a maximum time to etcd. For example, if the current time is T1, and the maximum time window can be applied for 3S each time, PD will save the time value of T1 + 3S to etcd, and then PD can directly use this time window in memory. When the current time T2 is greater than T1 + 3S, PD will continue to update to etcd t2 + 3s:

    if now.Sub(s.lastSavedTime) >= 0 {
        last := s.lastSavedTime
        save := now.Add(s.cfg.TsoSaveInterval.Duration)
        if err := s.saveTimestamp(save); err != nil {
            return errors.Trace(err)

    The advantage of this method is that even if the PD is pawned, the newly started PD will start to allocate Tso after the last maximum time saved, that is, in the case of 1 processing.

  3. Because PD keeps an allocable time window in memory, when requesting Tso from outside, PD can directly calculate Tso in memory and return it.

    resp := pdpb.Timestamp{}
    for i := 0; i < maxRetryCount; i++ {
        current, ok := s.ts.Load().(*atomicObject)
        if !ok {
            log.Errorf("we haven't synced timestamp ok, wait and retry, retry count %d", i)
            time.Sleep(200 * time.Millisecond)
        resp.Physical = current.physical.UnixNano() / int64(time.Millisecond)
        resp.Logical = atomic.AddInt64(&current.logical, int64(count))
        if resp.Logical >= maxLogical {
            log.Errorf("logical part outside of max logical interval %v, please check ntp time, retry count %d", resp, i)
        return resp, nil

    Because it is calculated in memory, the performance is very high. We can allocate millions of TSO per second in our internal test.

  4. If the client requests Tso from PD for each transaction, the cost of each RPC is also very high, so the client will obtain Tso from PD in batches. The client will first collect a batch of TSO requests for transactions, such as N, and then directly send commands to PD with the parameter n. after receiving the command, PD will generate n TSOs and return them to the client.


At the beginning, we said that all the data about the cluster of PD are reported by the active heartbeat of tikv, and the dispatch of PD to tikv is also completed during the heartbeat. PD usually processes two kinds of heartbeat, one is the heartbeat of tikv’s own store, and the other is the heartbeat reported by the leader peer of region in the store.

For PD, thehandleStoreHeartbeatThe function is mainly used to cache some state of the current store in the heartbeat to the cache. How many regions are used in the following region stores, including the number of peleaders in the region.

For the region’s heartbeat, PD is in thehandleRegionHeartbeatInside. It should be noted that only the leader peer will report the information of its region, while the follow peer will not. After receiving the heartbeat of a region, PD will first put it into the cache. If PD finds that the epoch of a region has changed, it will also save the information of this region to etcd. Then, PD will schedule the region specifically. For example, if the number of peers is not enough, add new peers, or delete a peer that is broken. We will discuss the detailed scheduling implementation later.

Let’s talk about the epoch of region. In the epoch of region, there areconf_verandversion, indicating the different version states of this region. If a region has membership changes, that is, a peer is added or deleted,conf_verIt will add 1 if region happenssplitperhapsmerge, thenversionAdd 1.

Whether PD or tikv, we use epoch to judge whether the region has changed or not, thus refusing some dangerous operations. For example, the region has split,versionIt becomes 2. If there is a write request withversionIf it is 1, we will consider the request as a stale and refuse it directly. becauseversionThe change indicates that the scope of the region has changed. It is very likely that the key to be operated by this stage request is in the previous region range but not in the new range.

Split / Merge

As we said earlier, PD will schedule the region in the heartbeat of the region, and then directly bring the relevant scheduling information in the return value of heartbeat, and let tikv process it by itself. After tikv processing is completed, PD can know whether the scheduling is successful through the next heartbeat report.

For membership changes, it’s easier because we have a configuration with the maximum number of copies. Suppose there are three. If there are only two peers in the region, add the peer. If there are four, remove the peer. For split / merge of region, the situation is a little more complicated, but it is also relatively simple. Note that at this stage, we only support split. Merge is in the development stage and has not been released to the public. Therefore, we only take split as an example

  1. In tikv, the leader peer will regularly check whether the space occupied by a region exceeds a certain threshold. Suppose we set the size of the region to 64MB. If a region exceeds 96MB, it needs to be split.

  2. The leader peer will first send a command to the PD to request splittinghandleAskSplitBecause we split a region into two. For these two newly split regions, one will inherit all the meta information of the previous region, while the other related information, such as region ID and new peer ID, needs to be generated by PD and returned to the leader.

  3. The leader peer writes a split raft log, which is executed when applying, so that the region is split into two.

  4. After the division was successful, tikv told PD that PD was therehandleReportSplitIt processes, updates cache related information, and persists it to etcd.


Because PD stores all the cluster information of tikv, it naturally provides routing function for client. Suppose the client wants tokeyWrite a value.

  1. The client gets it from PD firstkeyThe PD returns the meta information related to the region.

  2. The client caches itself so that it does not need to get it from the PD every time. Then send the command directly to the leader peer of the region.

  3. It is possible that the leader of the region has drifted to other peers, and tikv will returnNotLeaderError, and bring the address of the new leader. The client updates in the cache and sends the request to the new leader again.

  4. It is also possible that the version of the region has changed, such as split,keyIt may have fallen into the new region, and the client will receive itStaleCommandSo we get it again from PD and enter state 1.


PD is the central scheduling module of tidb cluster. In the design, we try our best to ensure stateless and convenient expansion. This article mainly introduces how PD cooperates with tikv and tidb. Later, we will introduce in detail the core scheduling function, that is, how the PD controls the entire cluster.

Recommended Today

MVC and Vue

MVC and Vue This article was written on July 27, 2020 The first question is: is Vue an MVC or an MVVM framework? Wikipedia tells us: MVVM is a variant of PM, and PM is a variant of MVC. So to a certain extent, whether Vue is MVC or MVVM or not, its ideological direction […]