Monitoring of hive optimization (3)

Time:2021-8-13

Hive is one of the commonly used components in the big data field, mainly the operation of big data offline data warehouse. Hive performance tuning is often involved in daily work and interview, so mastering some hive tuning is an essential skill. The main factors affecting hive efficiency are data skew, data redundancy, job IO, different underlying engine configurations, hive parameters and hivesql execution. This paper mainly controls the overall performance of hive from the perspective of monitoring operation and maintenance. By monitoring hive metadata, we can find the irrationality and optimization points of hive table in advance, and transform passive operation and maintenance into active operation and maintenance.

1 hive metadata introduction

Hive metadata is generally stored in relational databases. MySQL is the most common choice. Here, hive metadata is stored in myslq. This time, we will introduce several main metadata tables, DBS, tbls, SDS and partitions

1.1 hive database related metadata table (DBS)

Monitoring of hive optimization (3)

1.2 hive table and view related metadata table (tbls)

Monitoring of hive optimization (3)

1.3 hive file storage information related metadata table (SDS)

Monitoring of hive optimization (3)

1.4 hive database related metadata tables (partitions)

Monitoring of hive optimization (3)

2 collect hive metadata

When using hive metadata for monitoring, ensure that the metadata information of the corresponding table or partition has been collected. Metadata is collected as follows

2.1 metadata of collection table

Analyze table name   compute statistic;

2.2 collect metadata of table fields

Analyze table name   compute statistic for columns;

2.3 collect metadata of all partitions

Analyze table name   Partition (partition column) compute statistic;

2.4 specify a specific partition to collect metadata

Analyze table name   Partition (partition column = partition value) compute statistic;

2.5 collect metadata of columns in all partitions

Analyze table name   Partition (partition column) compute statistical for columns;

3 hive metadata monitoring case

3.1 monitor the average size of files stored in common tables

For large file blocks, data skew may occur when reading data, affecting the operation efficiency of cluster tasks. The following SQL is for tables larger than twice the HDFS file block size:

​
--Overall logic: find the table tbls under the corresponding library through DBS, and then find the table attribute corresponding to each table through tbls to obtain totalsize and NumFiles. The former represents the file size and the latter represents the number of files

**SELECT**

 TBL_NAME,round(avgfilesize,1) **as** 'fileSize(Mb)'

**FROM** (

 **SELECT**

 tp.totalSize/(1024*1024)/numFiles avgfilesize,TBL_NAME

 **FROM** metastore.dbs d

 **INNER** **join** metastore.tbls t **on** d.DB_ID = t.DB_ID

 **left** **join** (

 **SELECT** TBL_ID,

 **MAX**(**case** PARAM_KEY **when** 'numFiles' **then** PARAM_VALUE **ELSE** 0 **END**) numFiles,

 **MAX**(**case** PARAM_KEY **when** 'totalSize' **then** PARAM_VALUE **ELSE** 0 **END** ) totalSize

 **from** metastore.table_params

 **GROUP** **by** TBL_ID

 ) tp **on** t.TBL_ID = tp.TBL_ID

 **Where * * d.name = 'library to monitor'

 **and** tp.numFiles **is** **not** **NULL**

 **and** tp.numFiles > 0

 )A * * where * * avgfilesize > file block size of HDFS * 2

 **ORDER** **BY** avgfilesize **desc**;
​

Monitoring of hive optimization (3)

3.2 monitor the average size of files stored in partitions. For partitions larger than twice the size of HDFS file blocks,

​
--Overall logic: first associate the tbls table with DBS, and the tbls table with partitions, and the partition table with partition_ PARAMS

**SELECT**

TBL_NAME,part_name,round(avgfilesize,1) **as** 'fileSize(Mb)'

**FROM** (

 **SELECT**

 pp.totalSize/(1024*1024)/numFiles avgfilesize,TBL_NAME,part.PART_NAME

 **FROM** metastore.dbs d

 **INNER** **join** metastore.TBLS t **on** d.DB_ID = t.DB_ID

 **INNER** **join** metastore.PARTITIONS part **on** t.TBL_ID = part.TBL_ID

 **left** **join** (

 **SELECT** PART_ID,

 --Number of files stored in each table

 **MAX**(**case** PARAM_KEY **when** 'numFiles' **then** PARAM_VALUE **ELSE** 0 **END**) numFiles,

 --Size of the file store

 **MAX**(**case** PARAM_KEY **when** 'totalSize' **then** PARAM_VALUE **ELSE** 0 **END** ) totalSize

 **from** metastore.PARTITION_PARAMS

 **GROUP** **by** PART_ID

 ) pp **on** part.PART_ID = pp.PART_ID

 **Where * * d.name = 'library to monitor'

 **and** pp.numFiles **is** **not** **NULL**

 **and** pp.numFiles > 0

)A * * where * * avgfilesize > file block size of HDFS * 2

**ORDER** **BY** avgfilesize **desc**;

Monitoring of hive optimization (3)

3.3 monitoring large tables without partition

For large data tables, if they are not partitioned, it means that the program needs to traverse more file blocks when reading the same data, and the performance will be greatly reduced.

​

**select** t.TBL_NAME ,round(totalSize/1024/1024,1) **as** 'fileSize(Mb)'

**FROM** metastore.DBS d

**inner** **join** metastore.TBLS t **on** d.`DB_ID` = t.`DB_ID`

**inner** **join** (

 **select** `TBL_ID`,**max**(**case** `PARAM_KEY` **when** 'totalSize' **then** `PARAM_VALUE` **else** 0 **end**) totalSize

 **from** `TABLE_PARAMS`

 **group** **by** `TBL_ID`

) tp **on** t.`TBL_ID` = tp.`TBL_ID`

**left** **join**

(

 **select** **distinct** `TBL_ID` **from** metastore.PARTITIONS p

) part **on** t.`TBL_ID` = part.`TBL_ID`

**Where * * D. ` name ` = 'library to monitor'

**and** part.`TBL_ID` **is** **null**

**and** totalSize/1024/1024/1024 > 30

**ORDER** **BY** totalSize/1024 **desc**;

Monitoring of hive optimization (3)

3.4 number of monitoring table partitions

Understand the number of partitions of a table. When doing a full table join, if the number of partitions of a table is small and there are many partitions, you can consider optimization methods such as partition merging

​

**SELECT**

t.TBL_ Name 'table name', D. ` name ` 'database name', * * count * * (part. Part)_ Name) 'number of partitions'

**FROM**

DBS d

**INNER** **JOIN** TBLS t **on** d.DB_ID = t.DB_ID

**INNER** **join** `PARTITIONS` part **on** part.TBL_ID = t.TBL_ID

**Where * * D. ` name ` = 'library to monitor'

**GROUP** **by** t.TBL_NAME,d.`NAME`

**ORDER** **BY** **COUNT**(part.PART_NAME) **desc**;

Monitoring of hive optimization (3)

Conclusion:

The main purpose of hive metadata monitoring is to control the overall situation of tables in hive. Here we mainly introduce the monitoring of big data blocks, non partitioned tables and table partitions. Of course, there are many other indicators, such as hive’s small files, table data storage format, etc. the long-term monitoring of these information can be displayed in combination with grafana, This is very important for the stable operation of the whole data warehouse. We will also publish articles related to hive SQL tuning later. Please look forward to it.

For more business information, please check the official website of yunchehttps://www.dtstack.com/dtsmart/

Recommended Today

Swift advanced (XV) extension

The extension in swift is somewhat similar to the category in OC Extension can beenumeration、structural morphology、class、agreementAdd new features□ you can add methods, calculation attributes, subscripts, (convenient) initializers, nested types, protocols, etc What extensions can’t do:□ original functions cannot be overwritten□ you cannot add storage attributes or add attribute observers to existing attributes□ cannot add parent […]