Do small files of spark optimization need to be merged?

Time:2020-9-30

As we know, most spark computing is done in memory, so the bottleneck of spark generally comes from the shortage of resources in the cluster (standalone, yarn, mesos, k8s), CPU, network bandwidth and memory. For the performance of spark, if you want it to be fast, you have to make full use of system resources, especially memory and CPU. Sometimes we also need to make some optimization adjustments to reduce the memory consumption, such as merging small files.

1、 Problem phenomenon

We have a 150000 table with a total data volume of 133mb, using select * from bi.dwd_ tbl_ conf_ The full table query of info takes 3 minutes, and another table ODS with a total data volume of 6.3g is 5 million_ tbl_ conf_ Detail. The query takes 23 seconds. Both tables are column storage tables.

Large table query is fast, but small table query is slow. Why is such a strange phenomenon?

2、 Question inquiry

It takes 23 seconds for a table with a data volume of 6.3g, but 3 minutes for a small table with a data volume of 133mb. This is very strange. We collected the corresponding table creation statements and found that there was no significant difference between the two statements. Most of them were strings, and the number of columns in the two tables was not much different.

CREATE TABLE IF NOT EXISTS  `bi`.`dwd_tbl_conf_info`  (
  `corp_id` STRING COMMENT '',
  `dept_uuid` STRING COMMENT '',
  `user_id` STRING COMMENT '',
  `user_name` STRING COMMENT '',
  `uuid` STRING COMMENT '',
  `dtime` DATE COMMENT '',
  `slice_number` INT COMMENT '',
  `attendee_count` INT COMMENT '',
  `mr_id` STRING COMMENT '',
  `mr_pkg_id` STRING COMMENT '',
  `mr_parties` INT COMMENT '',
  `is_mr` TINYINT COMMENT 'R',
  `is_live_conf` TINYINT COMMENT ''
)
CREATE TABLE IF NOT EXISTS `bi`.`ods_tbl_conf_detail` (
    `id` string,
    `conf_uuid` string,
    `conf_id` string,
    `name` string,
    `number` string,
    `device_type` string,
    `j_time` bigint,
    `l_time` bigint,
    `media_type` string,
    `dept_name` string,
    `UPDATETIME` bigint,
    `CREATETIME` bigint,
    `user_id` string,
    `USERAGENT` string,
    `corp_id` string,
    `account` string
  )

Because the two tables are very simple select query operations, and there are no complex aggregation join operations or UDF related operations, when we basically confirm that the slow query should occur, we put the suspicious point on the table reading operation. By querying the DAG and task distribution of the two query statements, we find the difference.

For fast tables, there are 68 tasks in total, and the average task allocation is about 7-9s. For slow tables, there are 1160 tasks in total, with an average of about 9s. As shown in the figure below:

Do small files of spark optimization need to be merged?

Do small files of spark optimization need to be merged?

At this point, we have basically found the greasy spot. Large table 6.3g, but the number of files is small, only 68, so it ran out quickly. Although the small table is only 133mb, the number of files is very large, resulting in a lot of tasks. Because the single task itself is relatively fast, most of the time is spent on task scheduling, resulting in a longer task time.

How can we solve the problem of slow query of small tables?

3、 Business tuning

Then there is the present problem in front of us

1. Why do small tables produce such small files

2. How to merge such a small file that has been generated

With these two problems, we talked with the business developers. We found that the small tables were inserted into the small tables by the business developers after querying and cleaning the data from the original data tables according to different time slices. However, due to the relatively small time slices, such inserts are particularly frequent, resulting in a large number of small files.

Then we need to solve two problems: how to merge these historical small files and how to ensure that no small files will be generated in the subsequent business process. We guide business developers to do the following optimization:

1) Using insert over write bi.dwd_ tbl_ conf_ info SELECT * FROM bi.dwd_ tbl_ conf_ Info merge the historical data. After the data is merged with write, the data protection will be affected by the new write. After merging, the whole table query is shortened from 3 minutes to 9 seconds.

2) The original table is modified into a partition table, which is put into different partitions at different times during insertion. When querying, only the partition data in the required time period is queried, which further reduces the amount of data read.

Click follow to learn about Huawei’s new cloud technologies~

Recommended Today

Singularity iPhone version officially launched

Recently, I haven’t updated my short book, technology blog, CocoaChina, etc. I’ve been busy developing my own product singularity app. I hope to solve our technical problems in this high-quality “app ape” product, so that more people know the singularity. We dig high-quality Internet technology articles every day for you to recommend (currently, it supports […]