Hive tuning parameters

Time:2022-5-31
Hive tuning parameters

image.png

The hive parameter tuning commonly used in work is summarized as follows.
Principle:
• minimum data
• minimum fields
• minimum number of jobs
• minimum number of reads
• avoid data skew
• global optimization rather than local optimization
• JVM memory

Reasonable file size segmentation

It is necessary to set the slice size reasonably in combination with the cluster resources.

#File split size
set mapreduce.input.fileinputformat.split.maxsize=536870912;
#Node file split size
set mapreduce.input.fileinputformat.split.minsize.per.node=536870912;
#Rack file split size
set mapreduce.input.fileinputformat.split.minsize.per.rack=536870912;
#Reduce file split size
set hive.exec.reducers.bytes.per.reducer=536870912;

#Input merge
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

#Merge small files at the end of the map only task
set hive.merge.mapfiles=true;
#Merge small files at the end of map reduce (Note: if the file compression format is inconsistent, it must be set to false)
set hive.merge.mapredfiles=true;
#Size of merged files (default)
set hive.merge.size.per.task=104857600; 
#When the average size of the output file is smaller than this value, start an independent map reduce task to merge the file (default)
set hive.merge.smallfiles.avgsize=104857600;

Minimum data
Minimum data principle: (map stage, shuffle stage, reduce stage)

  1. Network overhead: it is an effective way to reduce the network overhead by compressing the output of the map when writing to the disk
  2. Dataset size:
#Data filtering before use

#Shuffle operation
#Hive group by query whether to aggregate on the map side first
set hive.map.aggr=true;

#Compress the spill and meger files
set mapreduce.map.output.compress=true;
#Class name of compression codec
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

Data skew

Enable skew connection optimization

set hive.optimize.skewjoin=true;

Enable parallel
#Enable task parallel execution
set hive.exec.parallel=true;
#Maximum number of threads allowed for parallel tasks
set hive.exec.parallel.thread.number=16;
#By default, when the number of all executed and completed map task tasks of the entire MapReduce job exceeds that of map 
#Mapreduce Job Reduce Slowstart After completedmaps (0.05 by default), the applicationmaster will start scheduling and executing the reduce task.
set mapreduce.job.reduce.slowstart.completedmaps=0.05
#Mapoutputcopier threads copy their own data to the completed map task node. 
#The copy data will be stored in the memory buffer first. When the utilization rate of the internal buffer reaches a certain threshold, it will be written to the disk.
set mapred.reduce.parallel.copies=5
Memory optimization

1. The JVM process runs in the container, mapreduce Map Java Opts can set the maximum heap usage of the JVM through Xmx, which is generally set to 0.75 times mapreduce Map Memory MB, because some space needs to be reserved for Java code, non JVM memory usage, etc; The memory settings of reduce are the same.

#Set the size of the ring buffer. The key value pairs processed by the map will not be written to the disk immediately,
#Instead, the ring data buffer inside mapoutputbuffe is temporarily stored in memory
set mapreduce.task.io.sort.mb=1024
#Percentage of start spin
set mapreduce.map.sort.spill.percent=0.8

#Set the memory size of map and JVM heap
set mapreduce.map.memory.mb=4096
set mapreduce.map.java.opts=-Xmx3072M

#Set the memory size of reduce and JVM heap
set mapreduce.reduce.memory.mb=4096
set mapreduce.reduce.java.opts=-Xmx3072M

#The maximum amount of memory used by the data in the reduce memory of the shuffile
mapred.job.shuffle.input.buffer.percent=0.7
Disk optimization

Frequent disk IO is also a great consumption, so you can configure some parameters to reduce disk IO

#The default represents the maximum number of spills that can be merged at the same time
#If there are 100 spin files, the entire merge process cannot be completed at one time
#At this time, you need to increase the size to reduce the number of merges, so as to reduce disk operations;
set mapreduce.task.io.sort.factor=10

#When a combiner exists, the map results will be merged according to the functions defined by the combiner. When will the combiner operation be performed???
#And map are in the same JVM. Min.num.spin For It is determined by the parameter of combine. The default value is 3,
#That is to say, when the number of spin files is three by default, the combine operation is required to reduce the disk data;
set min.num.spill.for.combine=3
#Disk IO and network IO can also be reduced by compressing the spin and merge files.
#The intermediate results are very large. Compression is very useful when IO becomes a bottleneck. You can use mapreduce Map Output Compress (default:false) is set to true for compression,
#The data will be compressed and written to the disk. The data read is the compressed data that needs to be decompressed. In practical experience, the bottleneck of hive's Hadoop operation is generally IO rather than CPU. Compression can generally reduce IO operations by 10 times,
#Compression methods include gzip, LZO, bzip2, LZMA, etc. LZO is a relatively balanced selection, mapreduce Map Output Compress Codec (default:org.apache.hadoop.io.compress.defaultcodec) parameter settings.
#However, this process will consume CPU, which is suitable for large IO bottlenecks.
set mapreduce.map.output.compress=true
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
Resource parameters
#Specify the resource queue, root Urgent
set mapred.job.queue.name=root.default;
#Minimum requestable memory
set yarn.scheduler.minimum-allocation-mb=1024;
#Maximum available memory
set yarn.scheduler.maximum-allocation-mb=32768;
#Minimum number of available CPUs
set yarn.scheduler.minimum-allocation-vcores=1;
#Maximum available CPUs
set yarn.scheduler.maximum-allocation-vcores=16;
#Am container heap memory size
set yarn.app.mapreduce.am.command-opts=-Xmx2048M;
#Am container memory size
set yarn.app.mapreduce.am.resource.mb=4096;
#Nodemanger free memory size
set yarn.nodemanager.resource.memory-mb=57344;
#Nodemanger available CPUs
set yarn.nodemanager.resource.cpu-vcores=16;