Author: Yao Wei
This paper presents a business best practice to avoid business falling into the “anti-pattern” used by TiDB when developing by explaining the common problems in TiDB in a typical scenario of high concurrent batch writing data to TiDB.
This paper is mainly for readers who have a certain understanding of TiDB. Before reading this article, readers recommend reading three articles (about storage, computing, scheduling) and TiDB Best Practice which explain the principle of TiDB.
High concurrent batch insertion scenarios usually exist in batch tasks in business systems, such as liquidation and settlement. It has the following remarkable characteristics:
- Large amount of data
- History data need to be stored in a short time
- It takes a short time to read a lot of data.
This poses some challenges to TiDB:
- Whether the write/read capability can be extended linearly and horizontally
- Is the performance stable and not attenuated while data is being written concurrently and continuously
For distributed database, in addition to its basic performance, the most important thing is to make full use of all node capabilities, to avoid the emergence of a single node as a bottleneck.
TiDB Data Distribution Principle
If we want to solve the above challenges, we need to start with the principle of TiDB data segmentation and scheduling. Here is just a brief description, please refer to the details: say scheduling.
TiDB has a size limit (default 96M) for data segmentation in terms of region. Region’s segmentation method is range segmentation. Each region has multiple copies, and each group of copies is called a Raft-Group. Leader is responsible for reading and writing this data (of course TiDB will support Follower-Read soon). Leader is automatically evenly scheduled by PD components on different physical nodes to equalize read and write pressures.
<center>Figure 1 TiDB Data Overview </center>
As long as business writing does not have AUTO_INCREMENT’s primary key or monotonically increasing index (that is, no business writing hotspot, see TiDB for more details). In principle, TiDB relies on this architecture to linearly expand its reading and writing capabilities and make full use of distributed resources. On this point, TiDB is particularly suitable for businesses with high concurrent batch writing scenarios.
But in the software world, there is no silver bullet. Specific matters need to be analyzed. Next, we discuss how TiDB needs to be used correctly in this scenario through some simple loads to achieve the best theoretical performance of this scenario.
A simple example
There is a simple table:
CREATE TABLE IF NOT EXISTS TEST_HOTSPOT( id BIGINT PRIMARY KEY, age INT, user_name VARCHAR(32), email VARCHAR(128) )
This table structure is very simple, except that ID is the primary key, there is no additional secondary index. The statements written are as follows: ID is generated by random number discretization:
INSERT INTO TEST_HOTSPOT(id, age, user_name, email) values(%v, %v, '%v', '%v');
The load is to execute the above written statements intensively in a short time.
So far, it seems to be in line with the TiDB best practices mentioned above. There are no hot spots in the business. As long as we have enough machines, we can make full use of TiDB’s distributed capabilities. To verify this, we can try it in the experimental environment (the experimental environment deployment topology is 2 TiDB nodes, 3 PD nodes, 6 TiKV nodes, please ignore QPS, the test here is just to explain the principle, not benchmark):
Control screenshot </center> Figure 2
The client initiated “intensive” writing in a short time, and TiDB received a request of 3K QPS. If there is no accident, the pressure should be spread to all six TiKV nodes. However, in terms of the CPU usage of TiKV nodes, there exists a significant write skew (tikv – 3 nodes are the write hotspots):
< center > Figure 3 Monitoring Screen </center >
<center>Figure 4 Monitoring screenshot </center>
The Raft store CPU represents the CPU utilization of raftstore threads and usually represents the write load. In this scenario, tikv-3 is the leader of raft, tikv-0 and tikv-1 are followers of raft, and the load of other tikv nodes is almost empty.
This can also be confirmed from the monitoring of PD:
< center > Fig. 5 Monitoring screenshot </center >
Reasons for counterintuition
The above phenomenon is somewhat counter-intuitive. The reason for this phenomenon is that when a table was created, it would only correspond to a region in TiKV. The scope of this phenomenon is as follows:
[CommonPrefix + TableID, CommonPrefix + TableID + 1)
For a large number of writes in a short time, it will continue to write to the same region.
<center>Fig.6 TiKV Region Splitting Process </center>
This process is briefly described in the figure above, and continues to be written, and TiKV will split Regions. But since the original Leader’s Store was the first to initiate the election, it is likely that the old Store will become the new Slicer for the two Regions. For the newly segmented Region 2,3. It also repeats what happened on Region 1. That is, the pressure will be concentrated intensively in TiKV-Node 1.
In the process of continuous writing, the PD can find that Node 1 has generated a hot spot, and it will evenly distribute the Leader to other Nodes. If the number of TiKV nodes can be more than the number of replicas, region migration will also occur, migrating to the idle Node as far as possible. These two operations are in the insertion process, which can also be verified in the PD monitoring:
Control screenshot </center> Figure 7
After writing for a period of time, the whole cluster will be automatically scheduled by PD into a state of uniform pressure, and then the ability of the whole cluster will be really utilized. For the most part, this is no problem. This stage belongs to the preheating stage of the table region.
However, this should be avoided for high concurrency batch intensive writing scenarios.
So can we skip this preheating process and divide the region directly into the expected number and schedule it in advance to each node of the cluster?
TiDB supports a new feature called Split Region in v3.0.x and later versions of v2.1.13. This feature provides a new grammar:
SPLIT TABLE table_name [INDEX index_name] BETWEEN (lower_value) AND (upper_value) REGIONS region_num SPLIT TABLE table_name [INDEX index_name] BY (value_list) [, (value_list)]
Readers may wonder why TiDB doesn’t automatically complete this slicing ahead of time. Let’s take a look at the picture first.
<center>Figure 8 Table Region Range</center>
As you can see from Figure 8, in the encoding of Table row data key, the only variable row data is row ID. In TiDB, rowID is an Int64 shaping. So can we divide the Int64 shaping range evenly into the number of parts we want, and then evenly distribute it in different nodes to solve the problem?
The answer is not necessarily, depending on the situation, if the line ID writing is completely discrete, then the above approach is feasible. But if the row ID or index has a fixed range or prefix. For example, I only insert discretely in the range of [2000w, 5000w]. This writing is still not hot in the business, but if it is segmented in the way above, it is possible to write only to a region at the beginning.
As a general-purpose database, TiDB does not assume the distribution of data, so it starts with a region to express a table. After the real data is inserted, TiDB automatically divides the data according to its distribution. This method is more general.
So TiDB provides Split Region grammar to optimize for short-term batch writing scenarios. Let’s try to use the following statement in the example above to split the Region ahead of time and see the load.
Since the writing of the test is completely discrete in the positive range, we can use the following statement to split the table into 128 regions in Int64 space in advance:
SPLIT TABLE TEST_HOTSPOT BETWEEN (0) AND (9223372036854775807) REGIONS 128;
After the segmentation is completed, it can be passed through
SHOW TABLE test_hotspot REGIONS;Statement to see the scattered situation, if the SCATTERING column values are all 0, the scheduling is successful.
You can also view the distribution of Regions through the table-regions.py script, which is fairly uniform:
[[email protected] scripts]# python table-regions.py --host 172.16.4.3 --port 31453 test test_hotspot [RECORD - test.test_hotspot] - Leaders Distribution: total leader count: 127 store: 1, num_leaders: 21, percentage: 16.54% store: 4, num_leaders: 20, percentage: 15.75% store: 6, num_leaders: 21, percentage: 16.54% store: 46, num_leaders: 21, percentage: 16.54% store: 82, num_leaders: 23, percentage: 18.11% store: 62, num_leaders: 21, percentage: 16.54%
We rerun the insert load:
<center> Figure 9 Monitoring Screen </center>
< center > Figure 10 Monitoring Screen </center >
< center > Figure 11 Monitoring Screen </center >
You can see that the obvious hot issues have been eliminated.
Of course, here’s just a simple table with some hot issues of indexing. How to pre-shred the relevant regions of the index?
This problem can be left to the reader, and more information can be obtained through the Split Region document.
More complex situations
If the table does not have a primary key or the primary key is not an int type, and the user does not want to generate a randomly distributed primary key ID, there will be an implicit _tidb_rowid column inside the TiDB as the row ID. In the absence of SHARD_ROW_ID_BITS, the value of the _tidb_rowid column increases monotonously, and there will also be write hotspots. (See what SHARD_ROW_ID_BITS is)
To avoid writing hot issues caused by _tidb_rowid, you can use SHARD_ROW_ID_BITS and PRE_SPLIT_REGIONS to build tables (see what PRE_SPLIT_REGIONS are).
SHARD_ROW_ID_BITS is used to randomly break up the row ID generated by the _tidb_rowid column, and pre_split_regions are used to pre-split the region after the table is built. Note: Pre_split_regions must be less than or equal to shard_row_id_bits.
create table t (a int, b int) shard_row_id_bits = 4 pre_split_regions=·3;
- SHARD_ROW_ID_BITS = 4 indicates that the value of tidb_rowid is randomly distributed into 16 (16 = 2 ^ 4) ranges.
- Pre_split_regions = 3 means that 8 (2 ^ 3) regions are split ahead of time after the table is built.
After the table t is written, the data is written into eight regions which are well splitted in advance, thus avoiding the hot issue of writing because there is only one region after the table is first built.
Turn off TiDB’s Atch mechanism
In version 2.1 of TiDB, latch mechanism is introduced in the SQL layer to detect transaction conflicts in the scenarios where writing conflicts are frequent, and to reduce retries caused by write conflicts between TiDB and TiKV when committing transactions. For run-batch scenarios, there is usually stock data, so there is no write conflict for transactions. TiDB latch can be turned off to reduce the allocation of small memory objects:
[txn-local-latches] enabled = false