With the evolution of big data technology architecture, the separation of storage and computing architecture can better meet users’ demands for reducing data storage costs and scheduling computing resources on demand, which is becoming the choice of more and more people. Compared with HDFS, data storage on object storage can save storage cost, but at the same time, the write performance of object storage for massive files will be much worse.
Tencent cloud elastic MapReduce (EMR) is a cloud hosted elastic open-source pan Hadoop service of Tencent cloud, which supports big data frameworks such as spark, HBase, presto, Flink and Druid.
Recently, when supporting an EMR customer, we encountered a typical application scenario of storage and computing separation. The customer uses the spark component in EMR as the calculation engine, and the data is stored on the object store. In the process of helping customers to optimize technology, it is found that spark has low write performance in massive file scenarios, which affects the overall performance of the architecture.
After in-depth analysis and optimization, we finally greatly improved the write performance, especially the performance of write object storage was improved by more than 10 times, which accelerated the business processing and won the praise of customers.
This article will introduce how Tencent cloud EMR spark computing engine can improve the write performance in massive file scenarios in the storage and computing separation architecture. I hope to communicate with you. Author: Zhong Degeng, Tencent background development engineer.
1、 Background of the problem
Apache spark is a fast and universal computing engine designed for large-scale data processing. It can be used to build large, low latency data analysis applications. Spark is a general parallel framework similar to Hadoop MapReduce, which is open-source by UC Berkeley amp Lab (AMP Laboratory of the University of California, Berkeley). Spark has the advantages of Hadoop MapReduce.
Unlike Hadoop, spark and scala can be tightly integrated, in which Scala can operate distributed datasets as easily as local collection objects. Although spark is created to support iterative jobs on distributed datasets, it is actually a supplement to Hadoop, which can run in parallel in Hadoop file system or on cloud storage.
In the process of technical optimization, the computing engine we studied is spark component of EMR products. Due to its excellent performance and other advantages, more and more customers choose big data computing engine.
In storage, the customer chooses object storage. In terms of data storage, object storage has the characteristics of reliability, scalability and lower cost. Compared with Hadoop file system HDFS, object storage is a better low-cost storage method. Massive warm and cold data is more suitable for object storage to reduce cost.
In the Hadoop ecosystem, native HDFS storage is also an essential storage choice in many scenarios, so we also add the storage performance comparison with HDFS below.
Back to the problem we want to solve, let’s first look at a set of test data. Based on the spark-2. X engine, use sparksql to write 5000 files to HDFS and object storage respectively, and calculate the execution time
From the test results, it can be seen that the storage time of writing objects is written to HDFS29The performance of writing to object storage is much worse than that of writing to HDFS. While we observe the data writing process, we find that network IO is not a bottleneck, so we need to deeply analyze the specific process of data output of the computing engine.
2、 Analysis of spark data output process
1. Spark data stream
First of all, we can understand the main process of data flow during spark job execution through the following figure:
First, each task writes the result data to the temporary directory of the underlying file system_ temporary/task_ [ID], the directory result diagram is as follows:
So far, the task work on the executor is actually finished. Next, the driver will move these result data files to the location directory where the hive table is located. There are three steps in this operation
The first step is to call the commitjob method of outputcommiter to transfer and merge temporary files
As can be seen from the above diagram, commitjob will change the task_ All data files in the [ID] subdirectory are merged to the upper directory ext-10000.
Next, if you want to override the write mode, the existing data in the table or partition will be moved to the trash recycle bin.
After completing the above operations, the merged data files in the first step will be moved to the location of the hive table. So far, all data operations are completed.
2. Location analysis root cause
With the above analysis of spark data stream, do you need to locate the performance bottleneck on the driver side or the executor side? Time spent on executor:
It is found that the execution time of jobs on the executor side has little difference, but the total time-consuming is very large, which indicates that the job is mainly spent on the driver side.
In the driver side, there are three operation stages: commitjob, trashfiles, and movefiles. Which stage of the driver takes more time?
We observe thread dump through spark UI (here, refresh spark UI manually or log in to driver node to view thread stack information). We find that these three stages are relatively slow. Next, we will analyze the source code of these three parts.
3. Source code analysis
(1) Jobcommit stage
Spark uses Hadoop’s fileoutputcommitter to process file merge operations. Hadoop 2. X uses it by default mapreduce.fileoutputcommitter.algorithm . version = 1, use a single thread for loop to traverse all the task subdirectories, and then perform the merge path operation. Obviously, in many cases, this part of the operation will be very time-consuming.
Especially for object storage, rename operation is not only to modify metadata, but also to copy data to a new file.
(2) Trashfiles stage
The trashfiles operation is a single thread for loop to move files to the file recycle bin. If there is more data to be covered and written, this step will be very slow.
(3) Movefiles phase
Similar to the previous problem, the single thread for loop is also used to move files in the movefiles phase.
4. Summary of problems
- The performance bottleneck of spark engine writing massive files lies in the driver side;
- It takes a long time to execute in commitjob, trashfiles and movefiles of driver;
- The reason why the three stages are time-consuming is that the single thread loop processes files one by one;
- The rename performance of object storage needs to copy data, which leads to the time-consuming of writing massive files.
3、 Optimization results
We can see that the big data computing engine of community version still has some performance problems in dealing with the access to object storage. The main reason is that most data platforms are based on HDFS storage, while HDFS only needs to modify metadata on namenode for file renaming. This operation is very fast and does not easily encounter performance bottlenecks.
At present, the separation of cloud data from storage and computing is an important consideration for enterprises to reduce costs. Therefore, we try to modify commitjob, trashfiles, and MoveFile codes into multi-threaded parallel processing files to improve the performance of file writing operation.
Based on the same benchmark test, using sparksql to write 5000 files to HDFS and object storage respectively, we get the optimized results, as shown in the following figure:
Finally write HDFS performance improvement41%, write object storage performance improvement1100% ！
From the above analysis process, we can see that the key to the problem is the single thread limitation in some processing parts of spark engine. In fact, single thread limitation is a common technical bottleneck. Although we have guessed this possibility at the beginning, we still need to clarify the specific part of the restriction, check the source code and debug many times.
In addition, we also found the limitation of the object storage itself, and its rename performance needs to copy data, which results in long time to write massive files. We are also continuing to improve.
It is an important goal of Tencent cloud elastic MapReduce (EMR) product R & D team of Tencent cloud elastic MapReduce (EMR) product research and development team to further optimize the application scenarios of storage computing separation, improve performance, and better meet customers’ demand for cost reduction and efficiency improvement. We welcome you to discuss relevant issues together.