High frequency test site for data warehouse interview


This article starts with the official account: five minutes big data.

Causes of small files

Small files in hive must be generated when importing data into hive table, so let’s first look at several ways to import data into hive

  1. Insert data directly into a table
insert into table A values (1,'zhangsan',88),(2,'lisi',61);

This method will produce a file every time it is inserted, and multiple small files will appear when a small amount of data is inserted many times. However, this method is rarely used in the production environment, and it can be said that it is basically not used

  1. Load data through load mode
load data local inpath '/export/ score.csv 'rewrite into table a -- import file

Load data local inpath '/ export / score' rewrite into table a -- import folder

When a file is imported, the hive table has a file. When a folder is imported, the number of files in the hive table is the number of all files in the folder

  1. Loading data by query
insert overwrite table A  select s_id,c_name,s_score from B;

This method is commonly used in the production environment, and it is also the easiest way to generate small files

Insert will start Mr task when importing data, and output as many files as there are reduce in Mr

Therefore, the number of files = the number of reducetask * the number of partitions

There are also many simple tasks without reduce, only map phase, then

Number of files = number of maptasks * number of partitions

Every time insert is executed, at least one file will be generated in hive, because there will be at least one maptask when insert is imported.
For example, some businesses need to synchronize data to hive every 10 minutes, which will generate a lot of files.

The impact of too many small files

  1. First of all, for the underlying storage HDFS, HDFS itself is not suitable for storing a large number of small files. Too many small files will cause the namenode metadata to be particularly large and occupy too much memory, which seriously affects the performance of HDFS
  2. For hive, when querying, every small file will be regarded as a block and a map task will be started to complete. The time for starting and initializing a map task is much longer than the time for logic processing, which will cause a great waste of resources. Moreover, the number of maps that can be executed at the same time is limited.

How to solve too many small files

1. Use the concatenate command of hive to merge small files automatically

usage method:

#For non partitioned tables
alter table A concatenate;

#For partitioned tables
alter table B partition(day=20201224) concatenate;

give an example:

#Insert data into table a
hive (default)> insert into table A values (1,'aa',67),(2,'bb',87);
hive (default)> insert into table A values (3,'cc',67),(4,'dd',87);
hive (default)> insert into table A values (5,'ee',67),(6,'ff',87);

#If the above three statements are executed, there will be three small files under table A. execute the following statements on the hive command line
#View the number of files in table a
hive (default)> dfs -ls /user/hive/warehouse/A;
Found 3 items
-rwxr-xr-x   3 root supergroup        378 2020-12-24 14:46 /user/hive/warehouse/A/000000_0
-rwxr-xr-x   3 root supergroup        378 2020-12-24 14:47 /user/hive/warehouse/A/000000_0_copy_1
-rwxr-xr-x   3 root supergroup        378 2020-12-24 14:48 /user/hive/warehouse/A/000000_0_copy_2

#You can see that there are three small files, and then merge them using concatenate
hive (default)> alter table A concatenate;

#Check the number of files in table a again
hive (default)> dfs -ls /user/hive/warehouse/A;
Found 1 items
-rwxr-xr-x   3 root supergroup        778 2020-12-24 14:59 /user/hive/warehouse/A/000000_0

#Has been merged into one file

be careful:
1. The concatenate command only supports rcfile and orc file types.
2. When using the concatenate command to merge small files, you cannot specify the number of merged files, but you can execute the command multiple times.
3. When concatenate is used many times, the number of files does not change. This parameter is the same as mapreduce.input.fileinputformat . split.minsize=256mb The minimum size of each file can be set.

2. Adjust parameters to reduce the number of maps

  • Set the parameters of map input merge small file
#Merge small files before executing map
#The bottom layer of combinehiveinputformat is the combinefileinputformat method of Hadoop
#This method combines multiple files into a split as input in mapper
set  hive.input.format=org . apache.hadoop.hive . ql.io.CombineHiveInputFormat ; -- default

#Maximum input size of each map (this value determines the number of files after merging)
set mapred.max.split.size=256000000;   -- 256M

#The minimum size of a split on a node (this value determines whether files on multiple datanodes need to be merged)
set mapred.min.split.size.per.node=100000000;  -- 100M

#The minimum size of a split on a switch (this value determines whether files on multiple switches need to be merged)
set mapred.min.split.size.per.rack=100000000;  -- 100M
  • Set the parameters of map output and reduce output for merging:
#Set the map output to merge, and the default value is true
set hive.merge.mapfiles = true;

#Set the reduce side output to merge, and the default value is false
set hive.merge.mapredfiles = true;

#Set the size of the merged file
set hive.merge.size.per.task = 256*1000*1000;   -- 256M

#When the average size of the output file is less than this value, start an independent MapReduce task to merge the file
set hive.merge.smallfiles.avgsize=16000000;   -- 16M
  • Enable compression
#Whether to compress the query result output of hive
set hive.exec.compress.output=true;

#Does MapReduce job output use compression
set mapreduce.output.fileoutputformat.compress=true;

3. Reduce the number of reduce

#The number of reduce determines the number of output files, so you can adjust the number of reduce to control the number of hive table files,
#The partition function distribute by in hive just controls partition partition in Mr,
#Then, by setting the number of reduce and combining with partition function, the data can enter each reduce evenly.

#There are two ways to set the number of reduce. The first is to set the number of reduce directly
set mapreduce.job.reduces=10;

#The second is to set the size of each reduce. Hive will guess and determine the number of each reduce according to the total size of the data
set  hive.exec.reducers . bytes.per.reducer=5120000000 ; -- 1g by default, 5g

#Execute the following statement to allocate the data evenly to reduce
set mapreduce.job.reduces=10;
insert overwrite table A partition(dt)
select * from B
distribute by rand();

Explanation: if the number of reduce is set to 10, then use rand () to generate a number x% 10 randomly,
In this way, the data will be randomly entered into reduce to prevent some files from being too large or too small

4. Use Hadoop archive to archive small files

Hadoop archive, or har for short, is a file archiving tool that can efficiently put small files into HDFS blocks. It can package multiple small files into a har file, which can reduce the memory usage of namenode and still allow transparent access to files

#Used to control the availability of archives
set hive.archive.enabled=true;
#Tells hive whether the parent directory can be set when the archive is created
set hive.archive.har.parentdir.settable=true;
#Control the size of files that need to be archived
set har.partfile.size=1099511627776;

#Use the following command to archive
ALTER TABLE A ARCHIVE PARTITION(dt='2020-12-24', hr='12');

#Restore the archived partition to the original file
ALTER TABLE A UNARCHIVE PARTITION(dt='2020-12-24', hr='12');

be careful:
The archived partition can be viewed, but cannot be inserted rewrite. You must first UN archive


If it is a new cluster and there are no problems left over by history, it is recommended that hive use Orc file format and enable LZO compression.
If there are too many small files in this way, you can quickly merge them by using the command concatenate that comes with hive.

If you want to get more technical articles about big data, you can pay attention to the official account number:Learn big data in five minutes, focusing on big data technology research, sharing high-quality original technical articles