When using spark SQL to execute
sparkseccion.sql("insert overwrite table xxxx partition(date_time) select * from zzzz")
It takes about 3 hours to complete 1.4m data. And if you save it locally, it takes about 2 minutes to run. Querying hive’s data is basically a second query.
The environmental background of the problem and what methods have you tried
- Spark 2.1.0
- Hive 1.2.0
- scala 2.11
//Please paste the code text below (do not replace the code with pictures)
val ss = SparkSession.builder().appName("test spark sql").config(conf).enableHiveSupport().getOrCreate() //Insert the data of Zzzz table into the XXX table ss.sql("insert overwrite table xxxx partition(date_time) select * from zzzz")
What are the results you are looking forward to? What is the actual error message?
Execute it within 5 minutes, and
xxxxData for was written successfully.
At present, it can write successfully, but it takes 3 hours.
You can save the data in parquet or Orc format and load it as an external table of hive, which is very fast.
Look, your hive has a date_ Time dynamic partition, you want to see if your dynamic partition is very much
If this device is very large, it will greatly affect the performance
What’s more, you should learn to read the log and understand what your program is doing. If it takes a long time to run in that step, it will solve the problem better
Sparksql should also pay attention to the problem of small filesThere are too many small files output from spark to HDFS. How to solve the problem of spark small files?