Spark performance tuning shuffle tuning and troubleshooting

Time:2021-5-4

Shuffle tuning of Spark Tuning

This section begins with an explanationShuffle core concepts; And then forHashShuffleSortShuffleCarry out the optimization; Next, yesMap sideReduce sideOptimization; And then for theData skewProblem analysis and optimization; Finally, the process of spark runningTroubleshooting


This article starts with the official account.Learn big data in five minutes】, our company focuses on big data technology and shares high-quality big data original technical articles.

1、 The core concept of shuffle

1. Shufflemapstage and resultstage

Spark performance tuning shuffle tuning and troubleshooting

When dividing stages,The last stage is called final stageIt is essentially a resultstage object,All the previous stages are called shufflemapstage

The end of shufflemapstage is accompanied by the write of shuffle file to disk.

Resultstage basically corresponds to the action operator in the code, that is, a function is applied to the data set of each partition of RDD, which means the end of a job.

2. Number of tasks in shuffle

We know that spark shuffle is divided into map phase and reduce phase, or shuffleread phase and shufflewrite phase. For a shuffle, the map process and reduce process will be executed by several tasks. How to determine the number of map tasks and reduce tasks?

Suppose the spark task reads data from HDFS, thenThe initial number of RDD partitions is determined by the number of splits of the fileThat is to sayA split corresponds to a partition of the generated RDDWe assume that the number of initial partitions is n.

After a series of operator calculations for the initial RDD (assuming that the repartition and coalesce operators are not executed for repartition, the number of partitions remains the same, which is still n. if the repartition and coalesce operators are executed for repartition, the number of partitions becomes m). We assume that the number of partitions remains the same, when the shuffle operation is executed,The number of tasks on the map side is the same as the number of partitions, that is, the number of map tasks is n

The stage on the reduce side defaults tospark.default.parallelismThe value of this configuration item is used as the number of partitions. If it is not configured, the number of partitions of the last RDD on the map side is used as the number of partitionsThen the number of partitions determines the number of tasks on the reduce side.

3. Read the data of reduce

According to the division of stages, we know that the map task and the reduce task are not in the same stage,Map task is located in shufflemapstageReduce task is located in resultstage, the map task will be executed first, so how can the subsequent reduce task know where to pull the data after the map task is dropped?

The data fetching process of the reduce side is as follows

  1. After the map task is executed, the calculation status and the location of small disk files will be encapsulated into the mapstatus object, and then the mapstatus object in this process will be sent to the mapoutputtrackermaster object of the driver process by the mapoutputtrackermaster object;
  2. Before the implementation of the reduce task, the mapoutputtrackerworker in this process will send a request to the mapoutputtrackcermaster in the driver process to request the location information of the small disk file;
  3. After all the map tasks are executed, the mapoutputtrackermaster in the driver process will master the location information of all the small disk files. At this time, mapoutputtrackermaster will tell mapoutputtrackermaster the location information of the small disk file;
  4. After completing the previous operation, blocktransforservice pulls data from the node where executor0 is located. By default, five sub threads will be started. The amount of data pulled each time cannot exceed 48m (reduce task pulls 48m data at most each time, and stores the pulled data in 20% of the memory of the executor).

2、 Hashshuffle analysis

The following discussion assumes that each executor has one CPU core.

1. Unoptimized hashshufflemanager

In the shuffle write stage, the data processed by each task is “divided” according to the key, so that the next stage can execute the shuffle class operator (such as reducebykey) after the calculation of one stage. The so-called “division” isHash the same keyThus, the same key is written to the same disk file, and each disk file belongs to only one task of the downstream stage.Before the data is written to disk, it will be written to the memory buffer first. When the memory buffer is full, it will overflow to the disk file

How many tasks are there in the next stage and how many disk files will be created for each task in the current stage. For example, there are 100 tasks in the next stage, so each task in the current stage needs to create 100 disk files. If there are 50 tasks in the current stage, including 10 executors in total, and each executor performs 5 tasks, 500 disk files will be created on each executor, and 5000 disk files will be created on all executors. thus it can be seen,The number of disk files generated by unoptimized shuffle write operations is staggering

The shuffle read stage is usually what a stage needs to do at the beginning. At this time, theEach task needs to pull all the same keys in the calculation results of the previous stage from each node to its own node through the network, and then aggregate or connect the keys. In the process of shuffle write, map task creates a disk file for each reduce task in the downstream stage. Therefore, in the process of shuffle read, each reduce task only needs to pull its own disk file from the node of all map tasks in the upstream stage.

The pull process of shuffle read is to pull and aggregate at the same time. Each shuffle read task has its own buffer,Only data of the same size as the buffer can be pulled each timeAnd then aggregate through a map in memory. After aggregating a batch of data, pull the next batch of data and put it into the buffer for aggregation. And so on, until the end of all the data to pull, and get the final result.

Working principle of unoptimized hashshufflemanagerAs shown in the figure below:

Spark performance tuning shuffle tuning and troubleshooting

2. Optimized hashshufflemanager

To optimize the hashshufflemanager, we can set a parameter:spark.shuffle.consolidateFilesThe default value of this parameter is false. Setting it to true will enable the optimization mechanism,If we use hashshufflemanager, it is recommended to turn on this option

After the consolidate mechanism is turned on, in the shuffle write process, the task does not create a disk file for each task in the downstream stage, which will appearshuffleFileGroupThe concept of,Each shufflefilegroup corresponds to a batch of disk files. The number of disk files is the same as the number of tasks in the downstream stage. The number of CPU cores on an executor determines how many tasks can be executed in parallel. The first batch of parallel execution of each task will create a shufflefilegroup, and write the data to the corresponding disk file

When the CPU core of the executor finishes executing one batch of tasks and then executes the next batch of tasks,The next batch of tasks will reuse the existing shufflefilegroupIn other words, task will write the data to the existing disk file instead of the new one. Therefore,The consolidate mechanism allows different tasks to reuse the same batch of disk files, which can effectively merge the disk files of multiple tasks to a certain extent, thus greatly reducing the number of disk files and improving the performance of shuffle write

Suppose that the second stage has 100 tasks and the first stage has 50 tasks. There are still 10 executors in total (the number of executor CPU is 1), and each executor performs 5 tasks. When using the unoptimized hashshufflemanager, each executor will generate 500 disk files, and all executors will generate 5000 disk files. However, after optimization, the calculation formula of the number of disk files created by each executor is as follows:Number of CPU cores * number of tasks in the next stageIn other words, each executor will only create 100 disk files, and all executors will only create 1000 disk files.

Working principle of optimized hashshufflemanagerAs shown in the figure below:

Spark performance tuning shuffle tuning and troubleshooting

3、 Analysis of sortshuffle

The operation mechanism of sortshufflemanager is mainly divided into two kinds, one isCommon operation mechanismThe other isBypass operation mechanism. When the number of shuffle read tasks is less than or equal tospark.shuffle.sort.bypassMergeThresholdThe bypass mechanism is enabled when the value of the parameter (the default is 200).

1. Common operation mechanism

In this mode,The data is first written to a memory data structureIn this case, different data structures may be selected according to different shuffle operators.If it is a shuffle operator of an aggregation class, such as reducebykey, the map data structure will be used to aggregate through the map and write to the memory at the same timeIf it is a common shuffle operator such as join, the array data structure will be selected and written directly into memory. Then, after each piece of data is written into the memory data structure, it will determine whether it has reached a critical threshold. If it reaches the critical threshold, it will try to overflow the data in the memory data structure to disk, and then empty the memory data structure.

Before writing to the disk file, the existing data in the memory data structure will be sorted according to the key. After sorting, the data is written to the disk file in batches. The default number of batches is 10000. That is to say, sorted data will be written to disk files in batches in the form of 10000 batches. Writing disk file is realized by Java bufferedoutputstream.Bufferedoutputstream is the buffered output stream of Java. Firstly, the data will be buffered in the memory. When the memory buffer overflows, it will be written to the disk file again. This can reduce the number of disk IO and improve the performance

When a task writes all the data to the memory data structure, many disk overflow operations will occur, and many temporary files will be generated. Finally, all the previous temporary disk files will be merged, which is calledMerge processAt this time, the data in all previous temporary disk files will be read out, and then written into the final disk file in turn. In addition, a task only corresponds to one disk file, which means that all the data prepared by the task for the tasks of the downstream stages are in this file, so a separate copy will be writtenIndex fileWhich identifies the start offset and end offset of the data of the downstream tasks in the file.

Sortshufflemanager greatly reduces the number of files because it has a process of merging disk files. For example, the first stage has 50 tasks, with a total of 10 executors. Each executor performs 5 tasks, while the second stage has 100 tasks. Because there is only one disk file for each task, there are only five disk files on each executor and only 50 disk files on all executors.

The working principle of sortshufflemanager based on common operating mechanismAs shown in the figure below:

Spark performance tuning shuffle tuning and troubleshooting

2. Bypass operation mechanism

The trigger conditions of bypass operation mechanism are as follows:

  • The number of shuffle map tasks is less thanspark.shuffle.sort.bypassMergeThreshold=200The value of the parameter.
  • Is not a shuffle operator of an aggregate class.

At this time, each task will create a temporary disk file for each downstream task, hash the data according to the key, and then write the key to the corresponding disk file according to the hash value of the key. Of course, when writing a disk file, the memory buffer is written first, and the buffer is full before overflowing to the disk file. Finally, all temporary disk files are also merged into one disk file and a separate index file is created.

The disk writing mechanism as like as two peas in the process of HashShuffleManager is the same as the unoptimized one, because all of them create a surprising number of disk files. Therefore, a small number of final disk files also make the mechanism perform better than the unoptimized hashshufflemanager.

The differences between this mechanism and the ordinary sortshufflemanager are as follows: first, the disk write mechanism is different; Second, there is no sorting. in other words,The biggest advantage of this mechanism is that there is no need to sort the data during shuffle write, which saves this part of the performance overhead.

Working principle of sortshufflemanager based on bypass mechanismAs shown in the figure below:

Spark performance tuning shuffle tuning and troubleshooting

4、 Buffer size of map and reduce

In the process of spark task running, if the amount of data processed by the map side of shuffle is large, but the size of the map side buffer is fixed, the buffer data of the map side may be spilled to the disk file frequently, which makes the performance very low,By adjusting the buffer size of the map side, frequent disk IO operations can be avoided, and the overall performance of spark tasks can be improved

The default configuration of the map side buffer is 32KB. If each task processes 640kb of data, 640 / 32 = 20 writes will occur. If each task processes 64000kb of data, 64000 / 32 = 2000 writes will occur. This has a very serious impact on performance.

The configuration method of map side buffer is as follows:

val conf = new SparkConf()
  .set("spark.shuffle.file.buffer", "64")

In the spark shuffle process, the buffer size of the shuffle reduce task determines the amount of data that the reduce task can buffer each time, that is, the amount of data that can be pulled each time,If the memory resources are sufficient, appropriately increasing the size of the pull data buffer can reduce the number of times to pull data, and also reduce the number of network transmission, so as to improve the performance

The size of the data pull buffer on the reduce side can be determined byspark.reducer.maxSizeInFlightThe default value is 48MB. The setting method of this parameter is as follows:

Data pull buffer configuration on reduce side:

val conf = new SparkConf()
  .set("spark.reducer.maxSizeInFlight", "96")

5、 The number of retries and waiting time interval of reduce side

In the process of spark shuffle, when the reduce task pulls its own data, if it fails due to network abnormalities and other reasons, it will automatically try again.For those jobs that contain time-consuming shuffle operations, it is recommended to increase the maximum number of retries(for example, 60 times) to avoid data pull failure caused by full GC of JVM or network instability.In practice, it is found that for the shuffle process with huge amount of data (billions to tens of billions), adjusting this parameter can greatly improve the stability

The number of retries for fetching data on the reduce side can be determined byspark.shuffle.io.maxRetriesParameter, which represents the maximum number of retries. If the pull fails within the specified number of times, the job execution may fail. The default value is 3. The setting method of this parameter is as follows:

Configuration of retrying times of pull data in reduce side:

val conf = new SparkConf()
  .set("spark.shuffle.io.maxRetries", "6")

In the process of spark shuffle, when the reduce task pulls its own data, if it fails due to network abnormality and other reasons, it will automatically try again. After a failure, it will wait for a certain time interval to try again,The stability of shuffle operation can be increased by increasing the interval time (such as 60s)

The waiting interval of data fetching on the reduce side can be set byspark.shuffle.io.retryWaitThe default value is 5S. The setting method of this parameter is as follows:

Configuration of waiting interval for data fetching on reduce side:

val conf = new SparkConf()
  .set("spark.shuffle.io.retryWait", "60s")

6、 Threshold of bypass mechanism

For sortshufflemanager, if the number of shuffle reduce tasks is less than a certain threshold, the shuffle write process will not perform sorting operation, but directly write data in the way of the unoptimized hashshufflemanager. However, in the end, all temporary disk files generated by each task will be merged into one file, and a separate index file will be created.

When youWhen using sortshufflemanager, if sorting operation is not needed, it is recommended to increase this parameter to be larger than the number of shuffle read tasks, then map side will not sort at this time, which reduces the performance overhead of sorting. However, in this way, a large number of disk files will still be generated, so the performance of shuffle write needs to be improved

The sortshufflemanager sorting operation threshold can be set through thespark.shuffle.sort.bypassMergeThresholdThe default value is 200. The setting method of this parameter is as follows:

Configuration of waiting interval for data fetching on reduce side:

val conf = new SparkConf()
  .set("spark.shuffle.sort.bypassMergeThreshold", "400")

Data skew

That is, the amount of data divided into different areas is not very uniform. You can customize the partitioner and divide it as you want.

The data skew problem in spark mainly refers to the data skew problem in shuffle process, which is caused by the different amount of data corresponding to different keys and the different amount of data processed by different tasks

For example, the reduced end has to process a total of 1 million pieces of data. The first task and the second task are assigned to 10000 pieces of data respectively, and the calculation is completed in five minutes. The third task is assigned to 980000 pieces of data. At this time, the third task may take 10 hours to complete, which makes the entire spark job take 10 hours to complete. This is the consequence of data skewing.

Pay attention to the differenceData skewAndData overloadIn these two cases, data skew means that a few tasks are assigned the vast majority of data, so a few tasks run slowly; Excessive data means that the amount of data allocated to all tasks is very large, the difference is not much, and all tasks run slowly.

Data skew performance:

  1. Most of the tasks of spark job are executed quickly, only a few tasks are executed very slowly. At this time, there may be data skew, and the job can run, but it runs very slowly;
  2. Most of the tasks of spark jobs are executed quickly, but some tasks will suddenly report oom during operation. If they are executed several times, an oom error will be reported in a certain task. At this time, there may be data skew and the job cannot run normally.

Positioning data skew problem:

  1. Look up the shuffle operators in the code, such as reducebykey, countbykey, groupbykey, join and so on, and judge whether there will be data skew here according to the code logic;
  2. Check the log file of spark job. The log file will record the error accurately to a certain line of the code. According to the code location where the exception is located, you can determine the stage where the error occurred and which shuffle operator is corresponding;

1. Pre aggregate raw data

1. Avoid shuffle process

In most cases, the data source of spark jobs is hive tables, which are basically yesterday’s data after ETL.
In order to avoid data skew, we can consider avoiding the shuffle process. If we avoid the shuffle process, the possibility of data skew will be eliminated fundamentally.

If the data of the spark job comes from the hive table, you can aggregate the data in the hive table first. For example, group the data according to the key, and splice all the values corresponding to the same key into a string in a special format. In this way, a key has only one piece of data; After that, when all the values of a key are processed, only map operation is needed, and no shuffle operation is needed. In this way, the shuffle operation is avoided, and no data skew problem can occur.

For the operation of data in hive table, it is not necessary to splice it into a string, or it can directly carry out cumulative calculation on each data of key.
To distinguish between large amount of data and data skew

2. Increase the granularity of key (reduce the possibility of data skew and increase the amount of data for each task)

If there is no way to aggregate one piece of data for each key, we can consider expanding the aggregation granularity of the key in a specific scenario.

For example, there are 100000 user data at present, and the granularity of the current key is (province, city, District, date). Now we consider expanding the granularity of the key to (province, city, date). In this way, the number of keys will be reduced, and the difference in the amount of data between keys may also be reduced, thus reducing the phenomenon and problem of data skew( This method is only effective for specific types of data. When the application scenario is not suitable, it will aggravate the data skew.)

2. Preprocessing leads to skew

1. Filtration

If it is allowed to discard some data in the spark job, you can consider filtering the key that may cause data skew and filtering out the data corresponding to the key that may cause data skew. In this way, data skew will not occur in the spark job.

2. Use random key

When using operators like groupbykey and reducebykey, you can consider using random key to realize double aggregation, as shown in the following figure:

Spark performance tuning shuffle tuning and troubleshooting

Firstly, the map operator is used to add a random number prefix to the key of each data, break up the key, change the original same key into different key, and then aggregate for the first time, so that the data originally processed by one task can be dispersed to multiple tasks for local aggregation; Then, remove the prefix of each key and aggregate it again.

This method has a good effect on data skew caused by operators such as groupbykey and reducebykey. It is only suitable for shuffle operation of aggregation class, and its scope of application is relatively narrow. If it is a shuffle operation of the join class, other solutions are needed

This method is also the first few programs do not have a better effect to try the solution.

3. Sample sampling is used to join the tilt key separately

In spark,If an RDD has only one key, the data corresponding to the key will be broken up by default in the shuffle process and processed by different reduce tasks

Therefore, when data skew is caused by a single key, the skew key can be extracted separately to form an RDD, and then the RDD composed of the skew key can be used to join with other RDDS separately. At this time, according to the operation mechanism of spark, the data in the RDD will be dispersed into multiple tasks in the shuffle stage for join operation.

The process of single join is shown in the following figure:

Spark performance tuning shuffle tuning and troubleshooting

Applicable scenario analysis:

For the data in RDD, you can convert it into an intermediate table, or directly use countbykey() to see the data amount corresponding to each key in the RDD. At this time, if you find that the data amount of a key in the whole RDD is particularly large, you can consider using this method.

When the amount of data is very large, we can consider using sample sampling to obtain 10% of the data, and then analyze which key in the 10% of the data may cause data skew, and then extract the data corresponding to this key separately.

Inapplicable scenario analysis:

If there are many keys that cause data skew in an RDD, then this scheme is not applicable.

3. Improve the parallel degree of reduce

When scheme 1 and scheme 2 have no good effect on data skew, we can consider improving the parallel degree of reduce end in shuffle process. The improvement of parallel degree of reduce end increases the number of tasks in reduce end, and the amount of data allocated to each task will be reduced accordingly, so as to alleviate the problem of data skew.

1. Setting the parallelism of reduce side

In most shuffle operators, a setting parameter of parallelism can be passed in, such as reducebykey (500). This parameter will determine the parallelism of the reduce side in the shuffle process. During the shuffle operation, a specified number of reduce tasks will be created. For shuffle statements in spark SQL, such as group by, join, etc., you need to set a parameter, namelyspark.sql.shuffle.partitions, which represents the parallelism of shuffle read task. The default value is 200, which is a little too small for many scenarios.

By increasing the number of shuffle read tasks, multiple keys originally assigned to a task can be assigned to multiple tasks, so that each task can process less data than before.

For example, if there were originally five keys, each corresponding to 10 pieces of data, and the five keys were assigned to a task, then the task would have to process 50 pieces of data. After shuffle read task is added, each task is assigned a key, that is, each task processes 10 pieces of data. Naturally, the execution time of each task will be shorter.

2. The defect of parallel degree setting on reduce side

Improving the parallelism of reduce does not fundamentally change the nature and problem of data skew (scheme 1 and scheme 2 fundamentally avoid the occurrence of data skew), just to alleviate and alleviate the data pressure of shuffle reduce task as much as possible, and the problem of data skew, which is suitable for the situation that there are many keys corresponding to a large amount of data.

This solution usually can’t completely solve the problem of data skew, because if there are some extreme cases, such as a key with 1 million data, no matter how many tasks you have, the key with 1 million data will be assigned to a task for processing, so data skew is bound to happen. So this scheme can only be said to be a means to try to use when finding data skew, try to use the simplest method to ease data skew, or use it in combination with other schemes.

In an ideal situation, the improvement of the parallel degree of the reduce side will alleviate the problem of data skew to a certain extent, or even basically eliminate the data skew; However, in some cases, it will only slightly improve the speed of the task that was running slowly due to data skew, or avoid the oom problem of some tasks, but it is still running slowly. At this time, we should give up scheme 3 in time and start to try the later scheme.

4. Use map join

Under normal circumstances, the join operation will execute the shuffle process and execute the reduce join, that is, first gather all the same keys and corresponding values into a reduce task, and then join. The process of ordinary join is shown in the following figure:

Spark performance tuning shuffle tuning and troubleshooting

An ordinary join goes through the shuffle process. Once shuffled, it is equivalent to pulling the data of the same key into a shuffle read task and then joining. This is called reduce join. However, if an RDD is relatively small, you can use the broadcast small RDD full data + map operator to achieve the same effect as join, that is, map join. In this case, shuffle operation and data skew will not occur.

Note: RDD can’t broadcast directly. You can only pull the internal data of RDD to the driver memory through collect, and then broadcast it

1. Core idea:

Instead of using join operator for join operation, broadcast variable and map class operator are used to implement join operation, so as to completely avoid shuffle operation and data skew. The data in the smaller RDD is directly pulled into the memory of the driver side through the collect operator, and then a broadcast variable is created for it; Next, the map operator is executed on another RDD. In the operator function, the full amount of data of the smaller RDD is obtained from the broadcast variable, and each data of the current RDD is compared according to the connection key. If the connection key is the same, then the data of the two RDDS are connected in the way you need.

According to the above ideas, shuffle operation will not occur at all, and the problem of data skew caused by join operation is fundamentally eliminated.

When the join operation has the problem of data skew and the amount of data in one of the RDDS is small, we can give priority to this method, and the effect is very good

The process of map join is shown in the following figure

Spark performance tuning shuffle tuning and troubleshooting

2. Inapplicable scenario analysis:

Since the broadcast variable of spark is to save a copy in each executor, if two RDD data volumes are relatively large, then if a RDD with relatively large data volume is made into a broadcast variable, it is likely to cause memory overflow.

Troubleshooting

1. Avoid oom out of memory

In the shuffle process, the task on the reduce side does not wait until the task on the map side writes all its data to the disk before pulling it, but insteadWhen the map side writes a little data, the reduce side task will pull a small part of the data, and then immediately carry out the following operations such as aggregation and the use of operator functions

How much data the task on the reduce side can pull depends on the buffer of the data pulled by the reduceThe pulled data is put in the buffer first, and then processed later. The default size of the buffer is 48MB

The task on the reduce side will pull and calculate at the same time. It may not be full of 48MB data every time. Most of the time, part of the data will be pulled and processed.

Although increasing the size of the buffer on the reduce side can reduce the number of pull times and improve the shuffle performance, sometimes the amount of data on the map side is very large and the writing speed is very fast. At this time, all tasks on the reduce side may reach the maximum value of their own buffer, that is, 48MB. At this time, with the code of the aggregate function executed by the reduce side, A large number of objects may be created, which may lead to a memory overflow, i.eOOM

If there is a problem of memory overflow on the reduce side, we can consider reducing the size of the pull data buffer on the reduce side, for example, to 12MB

This kind of problem has appeared in the actual production environment, which is typicalPrinciple of performance for execution. It is not easy to lead to oom when the buffer of data fetching on reduce side is reduced, but correspondingly, the number of fetching on reduce side is increased, resulting in more network transmission overhead and performance degradation.

Note that to ensure that the task can run, and then consider the performance optimization.

2. Avoid shuffle file pull failure caused by GC

In spark operation, there are some problemsshuffle file not foundThis is a very common error,Sometimes, after this kind of error occurs, choose to execute it again, and it will not report this kind of error again

Possible reasons for the above problemsIn shuffle operation, the task in the next stage wants to go to the executor where the task in the previous stage is located to pull data. As a result, the other party is executing GC. Executing GC will cause all work sites in the executor to stop, such as blockmanager, network communication based on netty, etc., which will cause the following task to pull the data. If the data has not been pulled for a long time, it will be reportedshuffle file not foundThis error will not occur again if it is executed a second time.

The shuffle performance can be adjusted by adjusting the number of retries and the time interval of data fetching on the reduce side. By increasing the parameter value, the number of retries of data fetching on the reduce side increases, and the waiting time interval after each failure lengthens.

The shuffle file pull failure caused by JVM GC adjusts the number of data retries and the time interval of data pull on the reduce side

val conf = new SparkConf()
  .set("spark.shuffle.io.maxRetries", "6")
  .set("spark.shuffle.io.retryWait", "60s")

3. Surge of network card traffic caused by horn-client mode

In the horn client mode, the driver starts on the local machine, and the driver is responsible for all task scheduling, and needs to communicate frequently with multiple executors on the horn cluster.

Suppose there are 100 executors and 1000 tasks, then each executor is assigned to 10 tasks. After that, the driver has to communicate with 1000 tasks running on the executor frequently. The communication data is very large, and the communication category is very high. This leads to the possibility that the network card traffic of the local machine will surge due to frequent and large amount of network communication in the process of spark task running.

be careful,The horn client mode is only used in the test environment. The reason why we use the horn client mode is that we can see the detailed and comprehensive log informationBy checking the log, you can lock the problems in the program and avoid the failure in the production environment.

In the production environment, we must use the horn cluster mode. In the horn cluster mode, it will not cause the local machine network card traffic surgeIf there is a problem of network communication in the horn cluster mode, it needs to be solved by the operation and maintenance team.

4. The JVM stack memory overflow in the horn-cluster mode cannot be executed

When the spark job contains the contents of sparksql, it may encounter the situation that it can run in the horn client mode, but it cannot submit to run in the horn cluster mode (an oom error is reported).

In horn client mode, driver is running on the local machine, and the configuration of permgen of the JVM used by spark is the spark class file on the local machine,The size of the permanent generation of the JVM is 128MBThis is no problem, but in theIn horn cluster mode, the driver runs on a node of the yarn cluster and uses the default settings that have not been configured,Permgen permanent generation size is 82mb

The internal part of sparksql needs to carry out very complex SQL semantic parsing, syntax tree conversion, etc., which is very complex. If the SQL statement itself is very complex, it is likely to lead to performance loss and memory occupation, especially for permgen.

So,At this time, if permgen occupies more than 82mb, but less than 128MB, it will be able to run in the yarn client mode and unable to run in the yarn cluster mode

The solution to this problem is to increase the capacity of permgen, need to be inspark-submitIn the script, the related parameters are set as follows:

--conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M"

Through the above method, the size of the driver permanent generation is set, which is 128MB by default and 256MB at most. In this way, the problem mentioned above can be avoided.

5. Avoid memory overflow of sparksql JVM stack

When the SQL statement of sparksql has hundreds of or keywords, the memory overflow of the JVM stack on the driver side may occur.

JVM stack memory overflow is basically due to calling too many method levels, resulting in a large number of very deep recursions beyond the depth limit of the JVM stack。( We guess that when there are a large number of or statements in sparksql, when parsing SQL, for example, when converting to syntax tree or generating execution plan, the processing of or is recursive. When there are a lot of or, a lot of recursion will occur.)

At this point,It is suggested that a single SQL statement should be divided into multiple SQL statements, and each SQL statement should have less than 100 clauses. According to the actual production environment test, the or keyword of an SQL statement is controlled within 100, which usually does not lead to the memory overflow of the JVM stack.


More data and good articles, welcome to the official account.Learn big data in five minutes

–end–

Article recommendation
Spark performance tuning – RDD operator tuning