PD scheduling policy best practices


Author: huang menglong

It is well known that PD is the core of the whole TiDB cluster, is responsible for the overall meta information storage and TiKV cluster load balancing scheduling, this paper will detail the principle of PD scheduling system, and through the analysis of several typical scenario and handling, share the scheduling strategy of best practices and tuning method, help you quickly locate problems in the process of using. This article is based on version 3.0. The earlier version (2.x) lacks some support, but the basic principle is similar, and this article can be used as a reference.

PD scheduling principle


First, let’s introduce the concepts involved in scheduling systems. Understanding these concepts and how they relate to each other can help you quickly locate problems in practice and adjust them through configuration.

  • Store

    The Store in PD refers to the storage node in the cluster, which is the tikv-server instance. Note that Store and TiKV instances are strictly one-to-one, and even if multiple TiKV instances are deployed on the same host or even on the same disk, these instances will correspond to different stores.

  • Region / Peer / Raft Group

    Each Region is responsible for maintaining a piece of contiguous data for the cluster (averaging about 96 MiB by default), each of which stores multiple copies in different stores (three copies by default), each of which is called Peer. Multiple peers in the same Region synchronize data through the raft protocol, so Peer is also used to refer to members in the raft instance. TiKV USES the multi-raft mode to manage data, that is, each Region corresponds to a separately running raft instance, which we also call a raft Group.

  • Leader / Follower / Learner

    They correspond to three roles of Peer respectively. The Leader is responsible for responding to the read and write requests of the client; Followers passively synchronize data from the Leader, and a new Leader will be elected when the Leader fails. Learner is a special role that only participates in synchronizing the raft log without voting, and in the current implementation exists only briefly in the intermediate step of adding a copy.

  • Region Split

    The Region in TiKV cluster is not divided at the beginning, but generated gradually with the data written. The splitting process is called Region Split.

    The mechanism is to build an initial Region to cover the whole key space during cluster initialization, and then generate new regions through Split whenever the Region data reaches a certain amount during the operation.

  • Pending / Down

    Pending and Down are two special states that Peer may occur. Where, “Pending” indicates that there is a big gap between the raft log of Follower or Learner and the Leader, and followers in Pending state cannot be elected as the Leader. Down means that the Leader has not received the corresponding Peer message for a long time, which usually means that the corresponding node is Down or isolated from the network.

  • Scheduler

    The Scheduler is the component in PD that generates scheduling. Each scheduler in PD runs independently and serves different scheduling purposes. Commonly used schedulers and their invocation targets are:

    • balance-leader-scheduler: maintain the Leader balance of different nodes.
    • balance-region-scheduler: maintain Peer equilibrium of different nodes.
    • hot-region-scheduler: maintain the Region equilibrium of read-write hotspots of different nodes.
    • evict-leader-{store-id}: deport all leaders of a node. (often used for rolling upgrades)
  • Operator

    Operator is a set of operations that apply to a Region and serve a scheduling purpose. For example, “migrate the Leader of Region 2 to Store 5”, “migrate the copy of Region 2 to Store 1, 4, 5”, etc.

    Operators can be generated by Scheduler via computation or created by external apis.

  • Operator Step

    Operator Step is a Step in the Operator execution process, and one Operator often contains more than one Operator Step.

    Currently, steps generated by PD include:

    • TransferLeader: migrate the Region Leader to the specified Peer
    • AddPeer: add Follower at the specified Store
    • RemovePeer: deletes a Region Peer
    • AddLearner: adds Region Learner in the specified Store
    • PromoteLearner: promote the designated Learner to Follower
    • SplitRegion: divides the specified Region in two

Scheduling process

From a macro perspective, the scheduling process can be roughly divided into three parts:

1. Information collection

TiKV node is periodically reported to PDStoreHeartbeatandRegionHeartbeatTwo heartbeat messages. Among themStoreHeartbeatContains basic Store information, capacity, remaining space, read and write traffic, etc.RegionHeartbeatIt contains the Region scope, copy distribution, copy status, data volume, read and write traffic and other data. PD sorts out this information and transfers it to scheduling for decision making.

2. Generate scheduling

The different schedulers generate operators to be executed after considering various constraints and constraints based on their own logic and requirements. The limitations and constraints herein include but are not limited to:

  • Do not add copies to stores that are in disconnected, offline, busy, out of space, or in various abnormal states such as a large number of send and receive snapshots
  • Balance does not select Region with abnormal state
  • Do not try to transfer the Leader to Pending Peer
  • Do not attempt to remove the Leader directly
  • Does not break the physical isolation of Region copies
  • Do not break constraints such as Label property

3. Execute scheduling

The generated Operator does not start immediately, but first enters a new OperatorOperatorControllerA wait queue that is managed.OperatorControllerThe Operator will be taken out of the waiting queue with a certain amount of concurrency according to the configuration for execution, which is to send each Operator Step to the corresponding Region Leader in sequence.

The final Operator completion is marked as finish state or timeout and removed from the execution list.


Region load balancing scheduling is mainly dependentbalance-leaderandbalance-regionThese two schedulers, whose scheduling goal is to spread Region evenly across all stores in the cluster. Their priorities are different:balance-leaderFocus on Region leaders, which can be thought of as the purpose of dispersing the pressure of handling client requests;balance-regionAttention is paid to each Peer in Region to disperse the storage pressure and avoid the situation of disk bursting.

balance-leaderwithbalance-regionThere is a similar scheduling process. Firstly, a score is given according to the corresponding resource amount of different stores, and then leaders or peers are selected from stores with high scores to stores with low scores.

There are also some differences in the calculation of the scores of the two:balance-leaderRelatively simple, Region Size corresponding to all leaders on the Store is added as the score.balance-regionDue to consider different node storage capacity may not be the same, will be three kinds of circumstances, when space surplus amount of data is used to calculate score (different node data basically balanced), calculated by the use of the remaining space, when the space is insufficient score (make different nodes surplus space fundamental equilibrium), in the middle state, at the same time consider two factors to do weighted sum as a score.

In addition, we support balance weights for stores to address issues where different nodes may differ in performance and other aspects.leader-weightandregion-weightIt is used to control the weight of the leader and region respectively, and the default value of both configurations is1. So let’s say I have a Storeleader-weightSet to2, when scheduling is stable, the number of leaders of this node is about twice that of ordinary nodes. So let’s say I have a Storeregion-weightSet to0.5, then the region number of this node is about half of other nodes after the scheduling is stable.

Hot scheduling

The scheduler corresponding to hot spot scheduling ishot-region-scheduler. Currently, the 3.0 version has a relatively single way of statistics of hot Region. It is to calculate regions whose reading or writing flow exceeds a certain threshold for a period of time according to the information reported by Store, and then scatter these regions with a similar way to Balance.

For a hot spot, the hot spot dispatcher will simultaneously try to shatter the Peer and Leader of the hot spot Region. For hot spot reading, since only the Leader bears the reading pressure, hot spot scheduling will try to break up the Leader of hot Region.

Cluster topology awareness

The purpose of making PD aware of the topology of the distribution of different nodes is to spread the copies of different regions as far as possible through scheduling, so as to ensure high availability and disaster tolerance. For example, if a cluster has three data centers, the safest scheduling method is to place three peers in the Region in different data centers, so that any one of the data centers can continue to provide services in case of failure.

PD will continuously scan all regions in the background. When it is found that the distribution of Region is not in the current optimal state, scheduling will be generated to replace Peer and adjust the Region to the best state.

The component responsible for this inspection is calledreplicaChecker(similar to Scheduler, but not closed), which depends onlocation-labelsThis configuration is used for scheduling. Such as configuration[zone, rack, host]A three-tier topology is defined: the cluster is divided into multiple zones (available zones), multiple racks (racks) under each zone, and multiple hosts (hosts) under each rack. PD will first try to place Region peers in different zones when scheduling. If it cannot be satisfied (for example, 3 copies of configuration but only 2 zones in total), it can be reduced to guarantee that the Peer is placed in different rack. If the number of rack is not enough to guarantee isolation, then it will try host level isolation, and so on.

Shrinkage and fault recovery

Indentation is the preparation of a Store, through the command to mark the Store asOfflineAt this time, PD migrates the Region on the node to other nodes through scheduling. Fault recovery means that when a Store fails and cannot be recovered, regions with peers distributed on the corresponding Store will produce a lack of copies. In this case, PD needs to make up copies of these regions on other nodes.

The process is basically the same in both cases. byreplicaCheckerThe Region Peer is checked for an abnormal state, and the generated dispatcher creates a new copy in the healthy Store to replace the abnormal one.

Region merge

Region merge refers to the process of merging neighboring regions through scheduling in order to avoid consuming system resources by a large number of small regions or even empty regions after deleting data. Region merge bymergeCheckerResponsible for its process withreplicaCheckerSimilarly, scheduling is initiated after background traversal and continuous small regions are found.

Query scheduling status

The main means to check the status of the scheduling system include: Metrics, pd-ctl, and log. This article provides a brief introduction to Metrics and pd-ctl. For more detailed information, please refer to the section on pd monitoring and the use of pd Control in the official document.

State of the Operator

The Grafana PD/Operator page shows Operator related statistics. Among the more important ones are:

  • Schedule Operator Create: shows the creation of the Operator, from which scheduler the Operator was created and why.
  • Operator finish duration: shows the Operator execution time
  • Operator Step duration: shows the execution time of different Operator steps

Query the Operator’s pd-ctl command as follows:

  • operator show: query all operators generated by the current schedule
  • operator show [admin | leader | region]: query Operator by type

State of the Balance

The Grafana PD/Statistics – Balance page shows the Statistics related to load balancing, among which the important ones are:

  • Store Leader/Region score: displays scores for each Store
  • Store Leader/Region count: display the number of Leader/Region of each Store
  • Store available: shows the remaining space for each Store

Pd-ctl store command can be used to query store score, number, remaining space, weight and other information.

Hot spot scheduling state

The Grafana PD/Statistics – hotspot page shows the Statistics of hot regions, among which the important ones are:

  • Hot write Region’s leader/peer distribution: shows the Leader/Peer distribution of hot Region
  • Hot read Region’s leader distribution: shows the distribution of leaders in reading hot Region

Pd-ctl can also be used to query the above information. The commands that can be used are:

  • hot read: query read hot Region information
  • hot write: query write hot Region information
  • hot store: statistics hot spots distribution by Store
  • region topread [limit]: query the Region with the most current read traffic
  • region topwrite [limit]: query Region with the most current write traffic

Region health

Grafana PD/Cluster/Region health panel shows statistics of Region number of abnormal state, including Pending Peer, Down Peer, Offline Peer, and Region with too many or too few copies.

The region check command of pd-ctl can be used to check the region list of specific exceptions:

  • region check miss-peer: Region with missing copies
  • region check extra-peer: Region of multiple copies
  • region check down-peer: Region with copy status Down
  • region check pending-peer: Region with Pending copy status

Scheduling policy control

Pd-ctl is mainly used to adjust the scheduling strategy online. The scheduling behavior of pd can be controlled in the following three aspects. This article provides a brief introduction, and more detailed information can be found in the section on the use of PD Control in the official document.

Start stop scheduler

Pd-ctl supports the dynamic creation and deletion of Scheduler, and we can control the scheduling behavior of pd through these operations, as shown below:

  • scheduler show: displays the Scheduler on the current system
  • scheduler remove balance-leader-scheduler: delete (disable) the balance leader scheduler
  • scheduler add evict-leader-scheduler-1: adds a scheduler to remove all leaders of Store 1

Manually add Operator

PD also supports bypassing the scheduler and creating or deleting operators directly through pd-ctl, as follows:

  • operator add add-peer 2 5: adds Peer for Region 2 on Store 5
  • operator add transfer-leader 2 5: migrate the Region 2 Leader to Store 5
  • operator add split-region 2: split Region 2 into two regions of equal size
  • operator remove 2: cancels the Region 2 Operator currently to be executed

Scheduling parameter adjustment

Execute with pd-ctlconfig showThe command can view all the scheduling parameters and execute themconfig set {key} {value}You can adjust the value of the corresponding parameter. Common parameters are illustrated here. For more details, please refer to the PD scheduling parameter guide:

  • leader-schedule-limit: control the concurrency of Transfer Leader scheduling
  • region-schedule-limit: control the concurrency of Peer scheduling
  • disable-replace-offline-replica: stop processing the scheduling of nodes offline
  • disable-location-replacement: stop processing scheduling related to Region isolation level adjustments
  • max-snapshot-count: maximum concurrency of Snapshot allowed per Store

Typical scene analysis and processing

1. Uneven Leader/Region distribution

It should be noted that the scoring mechanism of PD determines that, in general, the different Leader Count and Region Count of different stores do not mean that the load is unbalanced. It is necessary to judge whether there is Balance imbalance from the actual load of TiKV or the storage space occupation.

After confirming the uneven distribution of Leader/Region, the scoring situation of different stores should be observed first.

If the scores of different stores are close, indicating that PD considers it to be an equilibrium state at this time, and the possible reasons are as follows:

  • Hot spots cause load imbalance. For further analysis based on hot spot scheduling information, please refer to the hot spot scheduling section below.
  • There are a large number of empty regions or small regions, leading to a great difference in the number of leaders of different stores, which leads to the excessive burden of raftstore. Region Merge needs to be started and merged as quickly as possible. Refer to the Region Merge section below.
  • Hardware and software environments differ from Store to Store. Adjust as appropriateleader-weightandregion-weightTo control the distribution of the Leader/Region.
  • Other unknown reasons. It is also possible to adjust the weight of the pocket by adjusting leader-weight and register-weight to the distribution that the user thinks is reasonable.

If the score of different stores varies greatly, Operator related Metrics needs to be further checked, with special attention to Operator generation and execution, where there are roughly two scenarios.

In one case, the generated schedule is normal, but the schedule is slow. Possible reasons include:

  • Schedule speed is limited by limit configuration. The default limit of PD configuration is conservative and can be adjusted as appropriate without significant impact on normal businessleader-schedule-limitorregion-schedule-limitTurn it up a little bit. In addition,max-pending-peer-countAs well asmax-snapshot-countRestrictions can also be relaxed.
  • There are other scheduling tasks running in the system at the same time, resulting in balance speed is not up. In this case, if the balance schedule has a higher priority, you can stop other schedules or limit the speed of other schedules. For example, when the Region is not balanced, the looffline node operation will be preempted by looffline scheduling and Region Balanceregion-schedule-limitQuota, then we can putreplica-schedule-limitSet the speed limit or simply set the speed limitdisable-replace-offline-replica = trueTo temporarily close the logoff process.
  • Scheduling is too slow. The Operator Step time can be checked to make the judgment. Typically, the Snapshot Step is not involved (e.gTransferLeaderRemovePeerPromoteLearnerThe completion time of the Snapshot Step should be at the millisecond level (e.gAddLearnerAddPeerThe completion time is tens of seconds. If the time consumption is obviously too high, it may be caused by the excessive pressure of TiKV or the bottleneck of network, etc., which needs specific analysis.

Another case is when the corresponding balance schedule is not generated. Possible reasons include:

  • The scheduler is not enabled. For example, the corresponding Scheduler is removed, or the limit is set to0
  • Cannot be scheduled due to other constraints. Let’s say we have in the systemevict-leader-scheduler, the Leader cannot be migrated to the corresponding Store. For another example, if Label property is set, some stores will not accept the Leader.
  • The limitation of cluster topology leads to unbalance. For example, for the cluster of 3 copies and 3 data centers, due to the restriction of copy isolation, 3 copies of each Region are distributed in different data centers. If the number of stores of these 3 data centers is different, the final scheduling will converge in a balanced but globally unbalanced state in each data center.

2. Slow downline speed of nodes

Again, this scenario starts with Operator related Metrics, analyzing Operator generated execution.

If the schedule is being generated normally, it’s just very slow. Possible reasons include:

  • Schedule speed is limited by limit configuration. The corresponding limit parameter of the downline isreplica-schedule-limitYou can turn it up appropriately. Similar to Balance,max-pending-peer-countAs well asmax-snapshot-countRestrictions can also be relaxed.
  • There are other scheduling tasks running at the same time in the system that are competing, or scheduling is performing too slowly. The treatment was described in the previous section, so I won’t repeat it.
  • In the case of a single downline node, since a large part of Region to be operated (about 1/3 under the configuration of 3 copies) of leaders are concentrated on the downline node, the downline speed will be limited by the speed of this single point to generate a Snapshot. You can add one to this node manuallyevict-leaderDispatch moves the Leader to speed up.

If there is no corresponding Operator scheduling generation, the possible reasons are as follows:

  • Offline scheduling is off, orreplica-schedule-limitIt’s set to 0.
  • No node was found to transfer the Region. For example, if the capacity of the replacement nodes of the same Label is greater than 80%, PD will stop dispatching in order to avoid the risk of disk explosion. This situation requires adding more nodes or removing some data free space.

3. Slow online speed of nodes

Currently, PD does not give special treatment to the node on-line. In fact, the node on-line relies on the balance region mechanism for scheduling. Therefore, it is ok to refer to the previous steps for troubleshooting unbalanced region distribution.

4. Uneven distribution of hot spots

The problem of hot spot scheduling can be broadly divided into the following situations.

One is that we can see from PD metrics that there are a lot of hot regions, but the scheduling speed is not up to speed, so hot regions cannot be scattered in time.

The solution is to increasehot-region-schedule-limit, and reduce the limit quota of other schedulers to speed up hot spot scheduling. There arehot-region-cache-hits-thresholdTurning it down allows PD to respond more quickly to changes in flow.

The second is when a single Region forms a hotspot, such as a small table that is scanned frequently with a large number of requests. This can be seen from a business perspective or from metrics’ hot stats. Since single Region hotspots cannot be eliminated by means of scattering at the present stage, it is necessary to manually add hot Region after confirmingsplit-regionThe dispatch disassembles such a Region.

Another situation is that there is no hot spot according to PD statistics, but it can be seen from relevant metrics of TiKV that the load of some nodes is significantly higher than that of other nodes, which becomes the bottleneck of the whole system.

This is because the dimension of PD hot Region is relatively single at present, and only traffic is analyzed, so hot spots cannot be prepared to be located in some scenarios. For example, some regions have a large number of point check requests, which is not significant in terms of traffic, but the high QPS leads to the bottleneck of key modules. The current approach to this problem is to first identify the hot table from the business level and then add itscatter-range-schedulerTo make all regions of the table evenly distributed. TiDB also provides a related interface in its HTTP API to simplify this operation, please refer to the TiDB HTTP API documentation for details.

5. Region Merge speed is slow

Similar to all the scheduling slowness issues discussed earlier, Region Merge slowness is most likely restricted by the limit (Region Merge is also restricted by the limit)merge-schedule-limitandregion-schedule-limit), or compete with other schedulers, and the processing methods are not covered.

If we know from the statistics that there are a large number of empty regions in the system, we can pass themax-merge-region-sizeandmax-merge-region-keysAdjust to a smaller value to speed up the Merge. This is because the Merge process involves copy migration, so the smaller the Region of the Merge, the faster. If the Merge Operator generation speed is already several hundred opm, if you want to speed up, you can alsopatrol-region-intervalAdjust it to “10ms”, which can speed up Region patrol, but will consume more CPU.

There is a special case: a large number of tables have been created and then emptied (truncate operation is also called Table creation). If the split Table feature is turned on, these empty regions cannot be merged. In this case, it is necessary to adjust the following parameters to turn off this feature:

  • TiKV split-region-on-tableSet tofalse
  • PD namespace-classifierSet to“”

In addition, for the versions before 3.0.4 and 2.1.16, Region statisticsapproximate_keysIn certain cases (most of which occur after the drop table), the statistics are inaccurate, resulting in large keys statistics that cannot be satisfiedmax-merge-region-keysThe constraint ofmax-merge-region-keysThis condition is relaxed and set to a large value to get around the problem.

6. TiKV node fault handling strategy

Without manual intervention, PD’s default behavior for handling TiKV node faults is to wait for half an hour (passablemax-store-down-timeConfigure adjustment) to set this node toDownState, and begin supplementing copies of the Region involved.

In practice, if the failure of this node can be determined to be unrecoverable, the offline processing can be done immediately, so that PD can make up the copy as soon as possible and reduce the risk of data loss. In contrast, if the node is determined to be recoverable, but may not be available within half an hour, you can putmax-store-down-timeTemporarily adjust to a larger value, so as to avoid waste of resources caused by unnecessary replacement copies after timeout.


This paper introduces the concept and principle of PD scheduling as well as the processing methods of common problems. It is hoped that readers can solve the scheduling problems encountered in production by referring to this paper based on their understanding of the scheduling system. PD scheduling strategy is still evolving and improving, and we look forward to your valuable Suggestions.

Read the original:https://pingcap.com/blog-cn/best-practice-pd/

PD scheduling policy best practices

Recommended Today

Big data Hadoop — spark SQL + spark streaming

catalogue 1、 Spark SQL overview 2、 Sparksql version 1) Evolution of sparksql 2) Comparison between shark and sparksql 3)SparkSession 3、 RDD, dataframes and dataset 1) Relationship between the three 1)RDD 1. Core concept 2. RDD simple operation 3、RDD API 1)Transformation 2)Action 4. Actual operation 2)DataFrames 1. DSL style syntax operation 1) Dataframe creation 2. SQL […]