[datamagic] how to use spark on a trillion level data volume

Time:2020-7-5

Welcome to Tencent cloud + community to get more Tencent mass technology practice dry goods~

This article was first launched in cloud + community and can not be reproduced without permission.
Author: Zhang guopeng, Tencent operation and Development Engineer

1、 Preface

As a big data computing engine, spark quickly occupies the field of big data computing with its fast, stable and simple characteristics. This paper is mainly for the author’s understanding of spark in the process of building and using the computing platform, hoping to give readers some learning ideas. The content of this paper is to introduce the role of spark in datamagic platform, how to quickly master spark and how to use spark well in datamagic platform.

2、 Spark’s role in datamagic platform

[datamagic] how to use spark on a trillion level data volume
Figure 2-1
The main functions of the whole architecture are log access, query (real-time and offline), and computing. The off-line computing platform is mainly responsible for computing this part. The storage of the system is cos (internal storage of the company), not HDFS.
Next, we will mainly introduce the architecture of spark on yarn, which is shown in Figure 2-2. We can see the operation process of spark on yarn.
[datamagic] how to use spark on a trillion level data volume
Figure 2-2

3、 How to master spark quickly

I can understand the following four steps.
1. Understand spark terms
For beginners, learning spark can quickly understand its key terms through its architecture diagram. If you master the key terms, you will have a basic understanding of spark, including structural terms shuffle, patterns, MapReduce, driver, application master, container, resource manager, node manager, etc. API programming terms key RDD, dataframe, structure terms are used to understand its operation principle, API terms are used to write code in the process of use. If you master these terms and the knowledge behind them, you will know the operating principle and programming of spark.

2. Master key configuration
When spark is running, a lot of running information is read through the configuration file, generally in spark- defaults.conf To use spark well, you need to master some key configurations, such as those related to running memory, spark.yarn.executor .memoryOverhead、 spark.executor.memory , related to timeout spark.network.timeout Wait, a lot of spark information can be configured Row changes, so you need to have a certain grasp of the configuration. However, when you use configuration, you should also use different scenarios. For example, for example spark.speculation Configuration. The main purpose of this configuration is to infer execution. When Worker1 is slow to execute, spark will start a worker2 to perform the same task as Worker1. Whoever completes the task first will use the result, so as to speed up the calculation It is very good. However, if there are two identical workers when executing a task of outbound to MySQL, the data of MySQL will be repeated. Therefore, when we use configuration, we must understand it clearly. Google spark conf will list many configurations directly.

3. Make good use of spark parallelism
The reason why we use spark for computing is that it is fast, but the reason why it is fast lies in its parallelism. To master how spark provides parallel services is to improve the parallelism better.

To improve the parallelism, RDD needs to start from several aspects: 1. Configure num executor. 2. Configure executor cores. 3. Configuration spark.default.parallelism 。 The relationship among them is generally as follows spark.default.parallelism=num -2-3 times of executors * executor cores is more suitable. For spark SQL, set the spark.sql.shuffle . partitions, Num executor, and executor cores.

4. Learn how to modify spark code
Novices feel confused, especially when they need to optimize or modify spark. In fact, we can focus on the local part first, and spark is indeed modular. We don’t need to think that spark is complex and difficult to understand. I will analyze it from a certain point of view of modifying spark code.
First of all, the directory structure of spark is shown in Figure 3-1. You can quickly know the location of SQL, graphx and other codes through the folder. While the operating environment of spark is mainly supported by jar packages, as shown in Figure 3-2, some jar packages are intercepted here. In fact, there are many more jar packages. All jar packages can be compiled through Spark’s source code. When a function needs to be modified, you only need to find the phase After modifying the jar package code, compile the jar package and replace it.
[datamagic] how to use spark on a trillion level data volume
Figure 3-1
[datamagic] how to use spark on a trillion level data volume
Figure 3-2
As for the compilation of source code, it is actually very simple. After installing maven, Scala and other related dependencies, you can download the source code and compile it. Mastering the skills of modifying the source code is very important to use the open source project well.

4、 Spark in datamagic platform

Spark is used in datamagic, which is also a process of exploring while using. In this process, it lists its more important features.

1. Rapid deployment
In computing, the number of computing tasks and the magnitude of data change every day. Therefore, the spark platform needs to have the feature of rapid deployment. On the physical machine, there is a one key deployment script. As long as a script is run, a physical machine with 128 GB of memory and 48 cores can be launched immediately. However, the physical machine usually needs to apply for a report to obtain it, so there will be some Docker to support computing resources.

2. Skillfully use configuration optimization calculation
Most properties of spark are implemented through configuration. Therefore, the operation behavior of spark can be dynamically modified by configuration. Here is an example. For example, the number of exercisers can be automatically adjusted through configuration.

2.1 yarn in nodemanager- site.xml Add configuration

   <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
      <value>org.apache.spark.network.yarn.YarnShuffleService</value>
   </property>

2.2 add spark-2.2.0-yarn- shuffle.jar Copy the file to the Hadoop yarn / lib directory (that is, the Library Directory of yarn)
2.3 spark in spark- default.xml Add configuration

spark.dynamicAllocation.minExecutors  1 ා minimum number of executors
spark.dynamicAllocation.maxExecutors  100 ා maximum number of executors

Through this configuration, you can achieve the purpose of automatically adjusting the exerciser.

3. Allocate resources reasonably
As a platform, its computing tasks are certainly not fixed. Some have a large amount of data, while others have a small amount of data. Therefore, it is necessary to allocate resources reasonably. For example, for some data of 10 million or 100 million levels, it is enough to allocate 20 core computing resources. However, if some data orders reach 10 billion, more computing resources need to be allocated. Refer to point 3 of Chapter 3.

4. Meet business needs
In fact, the purpose of computing is to serve the business, and the business needs should be the pursuit of the platform. When the business has reasonable requirements, the platform side should try to meet them. For example, in order to support the demand of high concurrency and high real-time query, spark supports the cmongo delivery mode in data delivery mode.

sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

database = d = dict((l.split('=') for l in dbparameter.split()))

parquetFile = sqlContext.read.parquet(file_name)

parquetFile.registerTempTable(tempTable)

result = sqlContext.sql(sparksql)

url = "mongodb://"+database['user']+":"+database['password']+"@"+database['host']+":"+database['port']    result.write.format("com.mongodb.spark.sql").mode('overwrite').options(uri=url,database=database['dbname'],collection=pg_table_name).save()

5. Applicable scenarios
As a general computing platform, spark does not need to be modified in common application scenarios. However, on datamagic platform, we need to “change in the front”. Here is a simple scenario. In log analysis, the magnitude of the log reaches the level of 100 billion / day. When some fields of the underlying log cannot be parsed by UTF-8 code, the calculation in spark task will be abnormal and then fail. However, if the random code data is filtered before the data falls to the ground, the efficiency of data collection may be affected Finally decided to solve this problem in the spark calculation process, so in spark calculation, the code of data conversion is added with exception judgment to solve the problem.

6. Job problem positioning
Spark needs to locate the cause of the failure when calculating a task. When the job fails, you can merge the task log through the yarn logs – applicationid application, open the log, and locate the trace back. Generally, the reason for the failure can be found. Generally speaking, failure can be divided into several categories.

a. There is a code problem. There is a syntax problem with the SQL written, or there is a problem with the spark code.

b. Spark problems, old spark versions handle null values, etc.

c. If the task is running for a long time, it may be a data skew problem.

d. Task memory overrun problem.

7. Cluster management
In daily use, spark cluster also needs operation and maintenance, so as to find out the existing problems and optimize the cluster continuously. This paper introduces from the following aspects to ensure the robustness and stability of the cluster and ensure the smooth implementation of tasks through the operation means.

a. Check whether there is a lost node or an unhealthy node regularly. You can set the alarm regularly through the script. If there is, you need to locate it.

b. Scan the HDFS running log regularly to see if it is full, and delete the expired log regularly.

c. Regularly scan whether the cluster resources meet the needs of computing tasks, and can deploy resources in advance.

5、 Summary

This paper is mainly through the author’s understanding of spark in the process of building and using the computing platform, and introduces how spark is used in the current datamagic. The current platform has been used for leveling off-line analysis, and the amount of data calculated and analyzed every day has reached the level of 100 billion to trillion.

Q & A

How to solve the dependency problem in Apache spark?

Related reading

Technology sharing | spark RDD

Quick start to Apache spark

Knowledge summary of spark streaming

This article has been authorized by the author to release Tencent cloud + community https://cloud.tencent.com/dev…
[datamagic] how to use spark on a trillion level data volume