Big data training: Spark performance tuning and parameter configuration

Time:2022-11-24

Spark Performance Tuning – Basics

As we all know, correct parameter configuration is of great help to improve the efficiency of Spark, and helps related data developers and analysts to use Spark more efficiently for offline batch processing and SQL report analysis.

The recommended parameter configuration template is as follows:

spark-submit submission method script
/xxx/spark23/xxx/spark-submit –master yarn-cluster \

–name ${mainClassName} \

–conf spark.serializer=org.apache.spark.serializer.KryoSerializer \

–conf spark.yarn.maxAppAttempts=2 \

–conf spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC \

–driver-memory 2g \

–conf spark.sql.shuffle.partitions=1000 \

–conf hive.metastore.schema.verification=false \

–conf spark.sql.catalogImplementation=hive \

–conf spark.sql.warehouse.dir=${warehouse} \

–conf spark.sql.hive.manageFilesourcePartitions=false \

–conf hive.metastore.try.direct.sql=true \

–conf spark.executor.memoryOverhead=512M \

–conf spark.yarn.executor.memoryOverhead=512 \

–executor-cores 2 \

–executor-memory 4g \

–num-executors 50 \

–class startup class \

${jarPath} \

-M ${mainClassName}

  1. spark-sql submission method script

option=/xxx/spark23/xxx/spark-sql

export SPARK_MAJOR_VERSION=2

${option} –master yarn-client \

–driver-memory 1G \

–executor-memory 4G \

–executor-cores 2 \

–num-executors 50 \

–conf “spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties” \

–conf spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER \

–conf spark.sql.auto.repartition=true \

–conf spark.sql.autoBroadcastJoinThreshold=104857600 \

–conf “spark.sql.hive.metastore.try.direct.sql=true” \

–conf spark.dynamicAllocation.enabled=true \

–conf spark.dynamicAllocation.minExecutors=1 \

–conf spark.dynamicAllocation.maxExecutors=200 \

–conf spark.dynamicAllocation.executorIdleTimeout=10m \

–conf spark.port.maxRetries=300 \

–conf spark.executor.memoryOverhead=512M \

–conf spark.yarn.executor.memoryOverhead=512 \

–conf spark.sql.shuffle.partitions=10000 \

–conf spark.sql.adaptive.enabled=true \

–conf spark.sql.adaptive.shuffle.targetPostShuffleInputSize=134217728 \

–conf spark.sql.parquet.compression.codec=gzip \

–conf spark.sql.orc.compression.codec=zlib \

–conf spark.ui.showConsoleProgress=true

-f pro.sql

pro.sql is the business logic script

Spark Performance Tuning – Advanced

For readers who are willing to understand the underlying principles of Spark, this article sorts out the interaction diagrams of three common task submission methods, such as standalone, Yarn-client, and Yarn-cluster, to help relevant users understand the core technical principles of Spark more intuitively. The next advanced content lays a solid foundation.

standalone

Big data training: Spark performance tuning and parameter configuration

1) Spark-submit submits and constructs a DriverActor process through reflection;

2) The Driver process executes the written application, constructs sparkConf, and constructs sparkContext;

3) When SparkContext is initialized, it constructs DAGScheduler and TaskScheduler, and jetty starts webui;

4) TaskScheduler has sparkdeployschedulebackend process to communicate with Master and request to register Application;

5) After the Master accepts the communication, it registers the Application, uses the resource scheduling algorithm, notifies the Worker, and asks the Worker to start the Executor;

6) The worker will start the executor for the application, and after the executor is started, it will reversely register to the TaskScheduler;

7) After all Executors are reversely registered to TaskScheduler, Driver ends the initialization of sparkContext;

8) Driver continues to execute the written application, and every time an action is executed, a job will be created;

9) The job will be submitted to DAGScheduler, and DAGScheduler will divide the job into multiple stages (stage division algorithm), and each stage will create a taskSet;

10) taskScheduler will submit each task in the taskSet to the executor for execution (task allocation algorithm);

11) Every time the Executor receives a task, it will use the taskRunner to encapsulate the task, and then take out a thread from the thread pool of the executor to execute the taskRunner. (task runner: copy the written code/operator/function, deserialize it, and execute the task).

Yarn-client

Big data training: Spark performance tuning and parameter configuration 

1) Send a request to ResourceManager (RM), request to start ApplicationMaster (AM);

2) RM allocates container on a NodeManager (NM), starts AM, which is actually an ExecutorLauncher;

3) AM applies for container from RM;

4) RM assigns container to AM;

5) AM requests NM to start the corresponding Executor;

6) After the executor is started, reversely register to the Driver process;

7) Subsequent division of stages, submitting taskset is similar to standalone mode.

Yarn-cluster

Big data training: Spark performance tuning and parameter configuration

1) Send a request to ResourceManager (RM), request to start ApplicationMaster (AM);

2) RM allocates container on a certain NodeManager (NM) and starts AM;

3) AM applies for container from RM;

4) RM assigns container to AM;

5) AM requests NM to start the corresponding Executor;

6) After the executor is started, reverse registration to AM;

7) Subsequent division of stages, submitting taskset is similar to standalone mode.

After understanding the low-level interaction of the above three common tasks, the following article will expand from the three aspects of storage format, data tilt, and parameter configuration, and share an advanced posture for promoting Spark performance tuning.

Storage format (file format, compression algorithm)

As we all know, different SQL engines have different optimization methods in different storage formats. For example, Hive is more inclined to orc, and Spark is more inclined to parquet. At the same time, when performing big data operations, point checks, wide table queries, and large table join operations are relatively frequent, which requires that the file format should preferably use columnar storage and be splittable. Therefore, we recommend columnar storage file formats based on parquet and orc, and compression algorithms based on gzip, snappy, and zlib. In terms of combination, we recommend using the combination of parquet+gzip and orc+zlib. This combination takes into account both columnar storage and separability. Compared with txt+gz, which is a row-based storage and indivisible combination It is more able to adapt to the needs of the above big data scenarios.

Take the online data of about 500G as an example, under different cluster environments and SQL engines,Big Data TrainingPerformance tests were performed on different storage file formats and algorithm combinations. The test data shows that: under the same resource conditions, the parquet+gz storage format is at least 60% faster than the text+gz storage format in multi-value query and multi-table join.

Combined with the test results, we sorted out the recommended storage formats for different cluster environments and SQL engines, as shown in the following table:

Big data training: Spark performance tuning and parameter configuration

At the same time, we also tested the memory consumption of parquet+gz and orc+zlib. Taking the single historical partition data of a table as an example, parquet+gz and orc+zlib save 26% and 49% of storage space respectively compared with txt+gz.

The complete test results are as follows:

Big data training: Spark performance tuning and parameter configuration

It can be seen that parquet+gz and orc+zlib are indeed effective in reducing costs and improving efficiency. So, how to use these two storage formats? Proceed as follows:

➤hive and spark enable the compression algorithm of the specified file format

spark:

set spark.sql.parquet.compression.codec=gzip;

set spark.sql.orc.compression.codec=zlib;

hive:

set hive.exec.compress.output=true;

set mapreduce.output.fileoutputformat.compress=true;

set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;

➤Specify the file format when creating the table

Parquet file format (serialization, input and output classes)

CREATE EXTERNAL TABLE test(rand_num double)

PARTITIONED BY (day int)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’

STORED AS INPUTFORMAT

‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat’

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat’

;

orc file format (serialization, input and output classes)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.ql.io.orc.OrcSerde’

STORED AS INPUTFORMAT

‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’

;

➤Online adjustment

ALTER TABLE db1.table1_std SET TBLPROPERTIES (‘parquet.compression’=’gzip’);

ALTER TABLE db2.table2_std SET TBLPROPERTIES (‘orc.compression’=’ZLIB’);

➤ctas build table

create table tablename stored as parquet as select ……;

create table tablename stored as orc TBLPROPERTIES (‘orc.compress’=’ZLIB’) as select ……;

data skew

Data skew can be divided into map skew and reduce skew. This article focuses on the reduction tilt, such as group by and join, which are common in SQL, may be the hardest hit areas. When data skew occurs, it generally manifests as: some tasks are significantly slower than the same batch of tasks, the data volume of the task is significantly larger than other tasks, some tasks are OOM, and spark shuffle files are lost. As shown in the figure below, in the duration column and shuffleReadSize/Records column, we can clearly find that the amount of data processed by some tasks has increased significantly, and the time-consuming has become longer, resulting in data skew:

Big data training: Spark performance tuning and parameter configuration

How to solve data skew?

We have summarized 7 data skew solutions that can help you solve common data skew problems:

Solution 1: Use Hive ETL to preprocess data

That is, in the data blood relationship, the problem of skew is moved forward, so that downstream users no longer need to consider the problem of data skew.

⁕This solution is suitable for downstream businesses with strong interaction, such as second-level/minute-level data retrieval queries.

Solution 2: Filter a few keys that cause skew

That is, to remove large keys that are tilted. This scheme is generally used in combination with percentile points. For example, if 99.99% of the id records are within 100, then ids other than 100 can be considered for removal.

⁕This solution is more practical in statistical scenarios, but in detailed scenarios, it is necessary to see whether the filtered key is focused and concerned by the business.

Solution 3: Increase the parallelism of the shuffle operation

That is, dynamically adjust the spark.sql.shuffle.partitions parameter, and increase the number of partitions written by the shuffle write task to achieve an even distribution of keys. In SparkSQL2.3, the value is 200 by default. Developers can add the following parameters in the startup script to dynamically adjust the value:

conf spark.sql.shuffle.partitions=10000

conf spark.sql.adaptive.enabled=true

conf spark.sql.adaptive.shuffle.targetPostShuffleInputSize=134217728

⁕This scheme is very simple, but it can play a better role in optimizing the even distribution of keys. For example, originally there are 10 keys, each with 50 records, and only one partition, then the subsequent task needs to process 500 records. By increasing the number of partitions, each task can process 50 records, 10 tasks run in parallel, and the time consumption is only 1/10 of the original task. However, this solution is difficult to optimize for large keys. For example, if there are millions of records for a certain large key, the large key will still be allocated to one task.

Solution 4: Convert reducejoin to mapjoin

It refers to joining on the map side without going through the shuffle process. Taking Spark as an example, the data of the small RDD can be delivered to each Worker node (NM in Yarn mode) in the form of broadcast variables, and joins can be performed on each Worker node.

⁕This solution is suitable for scenarios where small tables join large tables (data volume of more than 100 G). The default threshold for small tables here is 10M, and small tables below this threshold can be distributed to worker nodes. The specific adjustable upper limit needs to be less than the memory allocated by the container.

Solution 5: Sampling the skewed key and splitting the join operation

The following figure shows an example: table A joins table B, table A has a big key, table B has no big key, the id of the big key is 1, and there are 3 records.

Big data training: Spark performance tuning and parameter configuration

How to split the join operation?

Firstly, separate id1 from table A and table B, remove A’ and B’ with big keys, and join first to achieve a non-sloping speed;
Add a random prefix to the large key of table A, expand the capacity of table B by N times, and join separately; remove the random prefix after joining;
Then union the above 2 parts.
⁕The essence of this solution is to reduce the risk of data skew caused by a single task processing too much data, and it is suitable for situations where there are few large keys.

Solution 6: Use random prefix and expand RDD for join

For example, when table A joins table B, take table A with a large key and table B without a large key as an example:

Add a random prefix of [1,n] to each record in table A, expand table B by N times, and join.
After the join is completed, the random prefix is ​​removed.
⁕This solution is suitable for situations where there are many large keys, but it will also increase resource consumption.

Solution seven: combiner

That is, combiner operations are performed on the map side to reduce the amount of data pulled by shuffle.

⁕This scheme is suitable for scenarios such as cumulative summation.

In actual scenarios, it is recommended that relevant developers analyze specific situations, and the above methods can also be used in combination for complex problems.

Spark parameter configuration

In the case of no data skew, we sorted out and summarized the parameter configuration reference table to help you optimize Spark performance. The settings of these parameters are suitable for the insight and application of about 2T data, and basically meet the tuning needs in most scenarios.

Big data training: Spark performance tuning and parameter configuration

Summarize

Currently, Spark has developed to Spark3.x, and the latest version is Spark 3.1.2 released (Jun 01, 2021). Many new features of Spark3.x, such as dynamic partition pruning, major improvements of Pandas API, and enhancement of nested column pruning and pushdown, provide good ideas for further cost reduction and efficiency increase. In the future, Getui will continue to pay attention to the evolution of Spark, and continue to practice and share.

The article comes from Data Warehouse and Python Big Data

Recommended Today

I used these 9 tips to package Vue components, and the boss praised me as ‘well sealed’

Search [Great Move to the World] on WeChat, and I will share with you the front-end industry trends, learning paths, etc. as soon as possible.This articleGitHubhttps://github.com/qq449245884/xiaozhiIt has been included, and there are complete test sites, materials and my series of articles for interviews with first-line manufacturers. Components are the basic building blocks of front-end frameworks. […]