Spark shuffle Principle & Tuning

Time:2021-12-3

spark-shuffle

Shuffle is to reorganize data. Due to the characteristics and requirements of distributed computing, the implementation details are more cumbersome and complex
In the MapReduce framework, shuffle is a bridge between map and reduce. In the map stage, data is read through shuffle and output to the corresponding reduce; The reduce phase is responsible for pulling data from the map side and performing calculations. In the whole shuffle process, it is often accompanied by a large number of disk and network I / O. Therefore, the performance of shuffle directly determines the performance of the whole program. Spark will also have its own shuffle implementation process.

Generally speaking, spark is not much different from Mr shuffle. Both involve map (data writing stage) and reduce (data reading stage).

Spark shuffle execution process

Spark shuffle Principle & Tuning

image.png

This paper analyzes the execution process of spark shuffle and the tuning of related parameters through the source code.

By analyzing the source code submitted by spark, we can know that the final call isorg.apache.spark.scheduler.TaskofrunTaskMethod, and task has two subclasses,ShuffleMapTask(write (there may be read before write, and the last stage is write) is equivalent to the map stage in MR)ResultTask(the start phase is read, which is equivalent to the reduce phase in MR)

Large processshuffle-writeNeed to go throughwrite buffer、sort and spill、merge fileThere are still differences in details in the three stages. The following explains the shuffle write process by analyzing the source code

Spark shuffle Principle & Tuning

image.png

Shuffle write phase

Spark shuffle Principle & Tuning

image.png

fromShuffleMapTaskofrunTaskMethodSortShuffleManagerofregisterShuffle()

  • Ifdep.mapSideCombine = false && numPartitions <= 200(spark.shuffle.sort.bypassMergeThreshold), returnBypassMergeSortShuffleHandle

  • If the serializer supports relocationserializer.supportsRelocationOfSerializedObjects && dep.mapSideCombine = false && numPartitions <= 1 << 24 // 16,777,216(source code analysis will explain why it is < < 24 bits)SerializedShuffleHandle

  • Return in other casesBaseShuffleHandle

Spark shuffle Principle & Tuning

image.png

Then ingetWriterIn the method, according toregisterShuffleReturned result judgment

  • If it isBypassMergeSortShuffleHandlereturnBypassMergeSortShuffleWriter
  • If it isSerializedShuffleHandlereturnUnsafeShuffleWriter
  • Others, i.eBaseShuffleHandlereturnSortShuffleWriter

Last callWriterofwriteTherefore, the specific process of shuffle write is divided intoBypassMergeSortShuffleWriter , UnsafeShuffleWriter , SortShuffleWriterThree write processes

Spark shuffle Principle & Tuning

image.png

Bypass mergesortshufflewriter process

Spark shuffle Principle & Tuning

image.png

BypassMergeSortShuffleWriter writerThe process is simple

  • establishSpark.shuffle.sort.bypassmergethreshold (default 200)individualDiskBlockObjectWriterObject, each object creates a file, and the file name generates rulestemp_shuffle_UUID
  • Calculate the destination according to the data keyDiskBlockObjectWriter, serialize the data and write data to the file.
  • You’ll get it in the endnumPartitionsA disk file, and then merge these files into a data file (file name"shuffle_" + shuffleId + "_" + mapId + "_0.data")And an index file (filename)"shuffle_" + shuffleId + "_" + mapId + "_0.data"
    Spark shuffle Principle & Tuning

    image.png

The advantage of this writer is that it does not need sorting and can adjust its parameters

  • Spark.shuffle.sort.bypassmergethreshold (default 200)
  • Write file cache sizeSpark.shuffle.file.buffer (32K by default)
  • SerializerSpark. Serializer (default org. Apache. Spark. Serializer. Javaserializer)

The above parameters can be adjusted appropriately according to the actual situation

Unsafeshufflewriter process

Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
  • Data is written to memory through the serializerserBufferThe effect is as follows

    Spark shuffle Principle & Tuning

    image.png
  • adoptUnsafetakeserBufferData copy toMemoryBlockIn page, the first 16 bits will be skipped during copy. The first 16 bits do not store data. The specific information is not clear. I only know that this is usedUnsafeThe operation of reading byte [] data needs to be used because the data needs to be copied to page (out of heap or in heap memory)Unsafeoperation

  • When writing a page, the length of the first 4 or 8 bits of stored data constitutes the following data structure

    Spark shuffle Principle & Tuning

    image.png
  • Record the data information in inmemorysort, and record the data of each information with 64 bits, similar to index information. Among them, the first 24 bits record the partitionid, the 13 bits record the number of page number, and the last 27 bits record the offset of data in page. Because of the 24 bit record partitionid, it will be judged when the writer is used earliernumPartitions <= 1 << 24 // 16,777,216This number is because the first 24 bits are used to record the partitionid, so it cannot exceed this number. Subsequently, the data can be found through page number information and offset information

Spark shuffle Principle & Tuning

image.png
  • Spark uses a bit table to describe the usage of pagePAGE_TABLE_SIZE = 1<<13 //8196Mark the page used, 1 represents used, 0 represents unused, and mark the use of pages

  • Page determines whether to use off heap memoryspark.memory.offHeap.enabled, to enable off heap memory, you must configure the size of off heap memoryspark.memory.offHeap.size, advantages of using off heap memory:https://cloud.tencent.com/developer/article/1513203

Spark shuffle Principle & Tuning

image.png
  • If the memory is insufficient, the data will be sorted, written to disk, and then page will be released. When sorting, because inmemorysort records the partitionid of each data, it only needs to sort the “index”, and then read the data through page number and offset information and write it to the file

  • Follow up logicSortShuffleWriterAlmost, write the ordered data into the file, and finally merge one or more files and memory data to form a large file and index file. The specific process will be introduced in the next sectionSortShuffleWriterIntroduce together

Spark shuffle Principle & Tuning

image.png

Sortshufflewriter process

The two types of writers described above have limitations, and the operators of the above two types of writers may be usedgroupByKey、sortByKey、coalesce. finalSortShuffleWriterLike a panacea, unlimited use.

Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png

Firstly, shuffle write is divided into three stages: write buffer, spin disk and merge file, corresponding to 1, 2 and 3 in the figure above

  • write buffer:Write the data to the cache buffer. If the capacity is greater than 0.7, the buffer will be expanded. If the memory can be applied, the buffer will be expanded to continue writing

  • sort and spill:When the memory is insufficient, the cached data will be sorted and spun to the disk

  • merge file:After stages 1 and 2, multiple spill files and data in buffer memory will be generated. After sorting, the data in memory and disk will be combined into a partitioned and sorted large file (data file), followed by an index file (description data file). Reducer will pull the data required by the data file according to the index file.

    Spark shuffle Principle & Tuning

    image.png

Writer buffer stage

Map side aggregation partitionedappendonlymap is required

  • Instead of using the JDK’s own map structure, spark uses an array to implement the map function

  • Calculate the position POS and 2POS of the hash value of the key. If there is no data in the 2POS position, put the key in the 2POS position and the value in the 2POS + 1 position. If there is a value on 2 * POS, calculate whether the key is equal. If it is equal, update it with the passed in function, such as reducebykey (+) in wordcount. Calculate the new value and update it. If it is not equal, pass (POS + delta) &The mask method recalculates the position of the hash value. The delta starts from 1. When a key exists, it is incremented by 1 each time

  • After each insertion, the current approximate capacity will be judged, and the occupied memory will be calculated by estimation. It will be estimated every 32 times. If it is greater than the current memory, it will apply for memory from taskmemorymanager.acquireexecutionmemory. If the application is successful, continue to write. If the writing is unsuccessful, spin the disk,Therefore, for the first optimization point, theoretically, the larger the memory of the executor, the more data can be stored in the memory, and the less the number of spin disks, the faster the speed.The process of spin calls the collection.destructivesortedwriteablepartitioned iterator (comparator). First, the data will be moved forward to fill the middle vacancy, and then the data in memory will be sorted. The sorting algorithm is timsort, and finally written into the file in the form of partition and sorting.

    Spark shuffle Principle & Tuning

    image.png

There is no need to aggregate partitionedpairbuffer on the map side

  • Instead of using the JDK’s own list structure, spark uses an array to implement the list function
  • The function is relatively simple. Mapcombine is not required. You only need to append the data to the back of the array according to kV. The spin overflow disk is the same as partitionedappendonlymap. After sorting the data (partition and sorting, there is no need to move the data, fill the empty position, and the data itself is compact) is written to the disk file.

    Spark shuffle Principle & Tuning

    image.png

Shuffle read phase

No matter what processes the writer goes through, it will eventually produce a partition sorted large file and an index file (describing the partition information of the large file)
Source viewResultTask runTaskThe last call was found to berdd.iterator(partition, context), spark RDD is executed through recursive iteration (a bit like the process of new object, it will recursively check whether the parent class is created first, and in spark, the child RDD will check whether the parent RDD is calculated or cached). According to the RDD chain execution, push forward through getorcompute to find the entry of read and the compute method of shufflerdd to find the shufflereader

Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png

View the read method, so the focus is on the stage of reading data (remote + local) and obtaining wrapped streams, that is, the key analysis is passedShuffleBlockFetcherIteratorProcess of obtaining data

Spark shuffle Principle & Tuning

image.png

ShuffleBlockFetcherIterator

In essence, it is the stage of reading data. You can understand that the client sends a request to the server to obtain data. Each reducer reads the data of its own partition by reading the index file and then going to the large file (the file is merged in the shuffle write stage and sorted by partition). It mainly starts from the process of client reading data, timeout, concurrency, abnormal retry, etc. the server side starts by adjusting the number of concurrency.

Spark shuffle Principle & Tuning

image.png

ShuffleBlockFetcherIterator.initialize

Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
Spark shuffle Principle & Tuning

image.png
  • First, set val targetrequestsize = math.max (maxbytesinflight / 5, 1L), and the default is 48m / 5 = 9.6m. Judge whether the pulled data is greater than 9.6m or whether the number of blocks pulled by an address is greater than maxblocksinflightperaddress (int.maxvalue by default), so it is only controlled by 9.6m. If so, it is encapsulated into a new fetchrequest (address, curblocks). If each piece of data is very small, such a request needs to establish connections with many machines to pull data. There are too many links and the pressure on the read end is large. You can adjust itspark.reducer.maxReqsInFlightLimit the number of connections between each read and the machine to improve the success rate

  • Finally, it will be encapsulated into n fetchrequests. Then start pulling data all over to determine whether it is isremoteblockfetchable. In short, the data being pulled cannot be greater than spark.reducer.maxsizeinflight (default 48m) and the number of requests cannot exceed maxreqsinflight (int.maxvalue). Otherwise, it will enter the waiting deferredfetchrequests queue. Therefore, in order to increase the number of concurrent requests for shuffle read, you can increase the size of maxbytesinflight (48m by default). Moreover, the number of blocks pulled by a single address cannot exceed maxblocksinflightperaddress (int.maxvalue by default). Therefore, reducing maxblocksinflightperaddress can reduce the number of blocks pulled at the same time, and prevent IO from being too high due to multiple blocks pulled at the same time, resulting in no response to service, IO timeout and other exceptions.

  • Finally, execute the sendrequest request to pull data and tune parametersSpark.shuffle.io.maxretries (3 by default)The number of retries for pull failure,Spark.shuffle.io.retrywait (5 seconds by default)Wait for 5 seconds after failure and try to pull the data again. These two parameters can be adjusted appropriately according to the actual situation to improve fault tolerance.

ExternalShuffleService

Another point of shuffle tuning is that external shuffle service can be enabled.
Spark’s executor node is not only responsible for data calculation, but also involves data management. If a shuffle operation occurs, the executor node not only needs to generate shuffle data, but also needs to be responsible for processing read requests. If an executor node hangs, it will not be able to handle the data read request of shuffle, and the data it generated before is meaningless.

In order to decouple data computing and data reading services, spark supports separate services to process read requests. This separate service, called external shuffleservice, runs on each host and manages the shuffle data generated by all executor nodes of the host. Some readers may think of the performance problem, because before, multiple executors were responsible for processing read requests, but now a host has only one external shuffleservice to process requests. In fact, there is no need to worry about the performance problem, because it mainly consumes disk and network, and uses asynchronous reading, so it will not have a performance impact.

After decoupling, if the executor accidentally hangs up during data calculation, it will not affect the reading of shuffle data. Moreover, spark can also realize dynamic allocation, which means that idle executors can be released in time.

External shuffleservice is essentially a netty service written based on netty, so related tuning is to tune netty parameters, mainly including the following parameters. Specific adjustments need to be made according to the actual situation.
spark.shuffle.io.serverThreads
spark.shuffle.io.receiveBuffer
spark.shuffle.io.backLog
spark.shuffle.io.sendBuffer

Summary of tuning parameters

Out of heap memory correlation

  • spark.shuffle.io.preferDirectBufs: whether to preferentially use off heap memory
  • spark.memory.offHeap.enabled: whether off heap memory is enabled
  • spark.memory.offHeap.size: set the out of heap memory size

Parameter tuning in shuffle write phase

  • spark.executor.memory: by analyzing the write process, we can know that the larger the available memory of a single task, the larger the memory that can be applied for, the fewer the number of spin disks, and the faster the speed. Therefore, this parameter can be increased appropriately
  • spark.sql.shuffle.partitions: increasing parallelism can reduce the amount of data processed by a single task, reduce the number of spin disks, and reduce the risk of oom. However, increasing this parameter will increase the number of tasks, which is the same as the number of threads. When a certain threshold is reached, the more threads will increase the pressure of system context switching. A little testing is required according to different tasks, Identify specific data
  • spark.shuffle.file.buffer: the default is 32K. When writing spin disk, the buffer size
  • spark.shuffle.spill.batchSize: the default is 10000. When writing spin disks, the default is the number of writes in a batch

Parameter tuning in shuffle read phase

  • spark.reducer.maxSizeInFlight: the default is 48m, and the data of one block pulled by one request is 48 / 5 = 9.6m. Ideally, there will be five requests pulling data at the same time, but a large block may be encountered. If it exceeds 48m, only one request is pulling data and cannot be parallel, so this parameter can be increased appropriately
  • spark.reducer.maxReqsInFlight: how many requests can pull data simultaneously during shuffle read? The default is integer.max_ Value is generally not optimized or modified
  • spark.reducer.maxBlocksInFlightPerAddress: how many servers does a pull request contain? By default, a request is 9.6m, but the file pulled by each server may be very small, only a few K. in that case, a request needs to request thousands of servers to pull data, which is easy to lead to timeout and other exceptions. Therefore, reduce this parameter appropriately
  • spark.reducer.maxReqSizeShuffleToMem: the maximum amount of data that can be stored in the memory during the read process, exceeding which the pulled data will be placed on the disk
  • spark.shuffle.io.maxRetries: 3 times by default. The number of retries when a request fails to pull. Increasing this parameter may delay the task execution time, but it can improve the task success rate
  • spark.shuffle.io.retryWait: the default is 5 seconds. The waiting time when a request fails to pull. Increasing this parameter may delay the task execution time, but it can improve the task success rate
  • spark.shuffle.io.clientThreads: the number of threads to pull the data client, which can be appropriately increased

Tuning effect

  • Bill table 2.8 billion, before optimization:4600 seconds, after optimization: 1200 seconds, time consumption reduced by 73.9%
  • Order table 470 million, before optimization:1600 seconds, after optimization: 720 seconds, time consumption reduced by 55%
Spark shuffle Principle & Tuning

image.png

T: Memory used 1t = 1024g
P: Configure spark.sql.shuffle.partitions, 1p = 1000
C: Number of CPU cores

Reference link
https://blog.csdn.net/zhanglh046/article/details/78360762
https://github.com/JerryLead/SparkInternals
https://www.cnblogs.com/itboys/p/9201750.html
https://www.dazhuanlan.com/2019/12/19/5dfb2a10d780d/
https://blog.csdn.net/pre_tender/article/details/101517789
https://www.bilibili.com/video/BV1sW41147vD?from=search&seid=12279554496967751348
https://www.jianshu.com/p/cda24891f9e4
https://cloud.tencent.com/developer/article/1513203