This article is from oppo Internet technology team, the third in a series of articles on “analyzing spark data partition“. In this article, we will analyze the data partition in spark streaming, tispark.
Series 1: analyzing Hadoop partition of spark data partition
Series 2: analyzing spark RDD partition of spark data partition
Series 3: analyzing spark streaming & tispark partition of spark data partition
1. Kafka +Spark Streaming
Spark streaming receives data from Kafka and converts it into the data structure dstream in spark streaming, i.e. discrete data stream.
There are two ways to receive data:
- The old method of receiving with receiver;
- Using the new method of direct pull (introduced in spark 1.3);
1.1 receiver mode
Currently, spark does not support this mode.
The parallelism of receiver mode is determined by spark.streaming.blockinterval, which is 200ms by default.
The receiver mode will encapsulate the block.batch data into RDD after receiving it. The block here corresponds to the partition in RDD.
When batchinterval is certain:
- Reducing the value of spark.streaming.interval parameter will increase the number of partitions in dstream.
- It is suggested that spark.streaming.interval should not be less than 50ms.
1.2 direct mode
Spark will create as many RDD partitions as Kafka partition, and read data from Kafka in parallel. So there is a one-to-one mapping between Kafka partition and RDD partition.
The type of RDD that directkafkainputdstream periodically generates is kafkardd.
Let’s first look at how kafkardd partitions:
It will generate kafkarddpartition partition according to the offset information parameter received from initialization. Each partition corresponds to a piece of data of Kafka’s topic partition. The information offsetrange of this piece of data indicates that it saves the data location.
Let’s analyze the compute method of directkafkainputdstream in detail:
According to the source code analysis, the calculation method of partition is to create an offset range for each partition of topic, and generate a kafkardd for all offsetranges.
Let’s analyze the getpartitions method of kafkardd:
Each offsetrange generates a partition.
How to increase the number of RDD partitions and the amount of data processed by each partition?
Through source code analysis, you can reduce the number of partitions of topics in Kafka messages; to increase the RDD parallelism, you can increase the number of partitions of topics in Kafka messages.
2.1 tidb architecture
The tidb cluster is mainly divided into three components:
Tidb server is responsible for receiving SQL requests, processing SQL related logic, finding tikv address required for storage and calculation through PD, interacting with tikv to obtain data, and finally returning structure;
Tidb server does not store data, but is only responsible for calculation. It can be expanded infinitely. It can provide external access address through load balancing components such as LVS, haproxy, F5, etc.
2.2 TiKV Server
Tikv is responsible for data storage. Externally, it is a distributed key value storage engine that provides things.
The basic unit of data storage is region. Each region is responsible for storing the data of a key range (left closed right open interval from startkey to endkey). Each tikv node is responsible for multiple regions. The load balance of data between multiple tikv is scheduled by PD and also by region.
The data distribution of tidb is based on region. A region contains a range of data, usually 96MB in size. The region’s meta information contains two attributes: startkey and endkey.
When a key > = startkey & & key < endkey: we know the region where the key is located, and then we can read the key data by looking up the tikv address where the region is located.
The region where the key is located is obtained by sending a request to the PD.
GetRegion(ctx context.Context, key byte) (metapb.Region, metapb.Peer, error)
By calling this interface, we can locate the region where the key is located.
If we need to obtain multiple regions in a range: we will start from the startkey in this range, call getregion multiple times, and the endkey of each returned region will be used as the next requested startkey until the endkey of the returned region is greater than the endkey of the requested range.
There is an obvious problem in the above execution process: every time we read data, we need to access PD first, which will bring huge pressure on PD and affect the performance of the request.
To solve this problem: tikv client implements a component of regioncache to cache the region information.
When you need to locate the region where the key is located: if the regioncache hits, you do not need to access the PD.
There are two data structures in regioncache to save region information:
- Map; with map, you can quickly find the region according to the region ID,
- B-tree; use B-tree to find the region containing the key according to a key.
Strictly speaking, the region information saved on the PD is also a layer of cache; the real latest region information is stored on the tikv server, and each tikv server will decide when to split the region itself.
When the region changes, PD reports the information to PD. PD uses the reported region information to meet the query requirements of tidb server.
When we get the region information from the cache and send the request, tikv server will verify the region information to ensure that the requested region information is correct.
If the region is split, the region migration results in the change of region information. The requested region information will expire, and tikv server will return a region error.
In case of a region error, we need to clean up the regioncache, retrieve the latest region information, and resend the request.
2.4 tispark architecture
Tispark deeply integrates the spark catalyst engine, enabling spark to efficiently read the data stored in tikv for distributed computing.
The getpartitions method in tirdd is analyzed as follows:
Through source code analysis: first obtain keywithregiontasks through splitrangebyregion, and create a tipartition for each regiontask.
It can be seen that the number of partition partitions in tispark is the same as that in tikv. If you want to improve the parallelism of tispark tasks, you can modify the following two parameter values:
Through the analysis of the above situations, as long as we can correctly understand the relationship between partitions and tasks in various scenarios, and then optimize the parameters that affect the partitions, we can also make the task with large amount of data quickly, and at the same time clearly answer the questions of data analysts.