DWQA QuestionsCategory: Artificial IntelligenceThe execution of spark SQL calling hive insert statement is extremely slow
Li Yang asked 1 year ago

Problem description
When using spark SQL to execute

    sparkseccion.sql("insert overwrite table xxxx partition(date_time) select * from zzzz")

It takes about 3 hours to complete 1.4m data. And if you save it locally, it takes about 2 minutes to run. Querying hive’s data is basically a second query.
The environmental background of the problem and what methods have you tried
Version information:

  • Spark 2.1.0
  • Hive 1.2.0
  • scala 2.11

Related codes
//Please paste the code text below (do not replace the code with pictures)

val ss = SparkSession.builder().appName("test spark sql").config(conf).enableHiveSupport().getOrCreate()
//Insert the data of Zzzz table into the XXX table
ss.sql("insert overwrite table xxxx partition(date_time) select * from zzzz")

What are the results you are looking forward to? What is the actual error message?
Execute it within 5 minutes, andxxxxData for was written successfully.
At present, it can write successfully, but it takes 3 hours.

2 Answers
zhangnew answered 1 year ago

You can save the data in parquet or Orc format and load it as an external table of hive, which is very fast.

Dada answered 1 year ago

Look, your hive has a date_ Time dynamic partition, you want to see if your dynamic partition is very much
If this device is very large, it will greatly affect the performance
What’s more, you should learn to read the log and understand what your program is doing. If it takes a long time to run in that step, it will solve the problem better
Sparksql should also pay attention to the problem of small filesThere are too many small files output from spark to HDFS. How to solve the problem of spark small files?