1、The background of hive
- The inconvenience of MapReduce programming
- The file on HDFS lacks schema (table name, name, ID, etc., which is a collection of database objects)
2、What is hive
What are the scenarios for hive?
Do some data cleaning (ETL), report and data analysis based on Hadoop
The structured data file can be mapped to a database table, and the SQL like query function can be provided.
Hive is a SQL parsing engine, which translates SQL statements into m / R jobs and then executes them in Hadoop.
- Open source by Facebook, initially used to solve the problem of massive structured log data statistics
- Data warehouse based on Hadoop
- Hive defines a kind of SQL query language: HQL
- Usually used for offline data processing (MapReduce was used in the early stage)
- The underlying layer supports a variety of different execution engines (now hive can be directly run on spark)
The underlying execution engines of hive include MapReduce, tez and spark
3. Hive features
- Hive’s biggest feature is that hive analyzes big data through SQL like, and avoids writing MapReduce program to analyze data, which makes it easier to analyze data
- Hive is to map data into databases and tables. Metadata information of databases and tables usually exists in relational databases (such as MySQL)
- Hive does not provide data storage function. Data is generally stored on HDFS (the requirements for data integrity and format are not strict)
- Hive is easy to expand its storage and computing power, which is inherited from Hadoop (suitable for large-scale parallel computing)
- Hive is designed for OLAP (online analytical processing) and does not support transactions
4. Hive architecture
Hive is C / S mode
The client side has JDBC / ODBC and thrift client, which can access hive remotely
It can be accessed by shell script or thrift protocol, and the hive data operation can be completed in the usual way of writing JDBC
Server：CLI、Thrift Server、HWI(Hive web Interface)、Driver、Metastore
- CLI, thrift server and HWI are independently deployed hive services exposed to clients
- Driver and Metastore are internal components of history. Metastore can also be used by the third-party SQL on Hadoop framework
- Beeine (hive 0.11 introduced), as hive JDBC client to access hiveserver2, solves the problemCLI concurrent accessproblem
The SQL string is input, the SQL string is parsed, the abstract syntax tree is transformed into a logical plan, and then the logical plan is optimized by using the optimization tool. Finally, the physical plan (serialization, deserialization, UDF function) is generated and handed over to the execution engine, Submit to MapReduce for execution (input and output can be local or HDFS / HBase). See the hive architecture in the figure below
Metadata management of Metastore: Derby (built-in), MySQL; Derby: Derby only accepts one hive session; MySQL: hive runs on Hadoop, and MySQL performs active / standby (timing synchronization operation)
As can be seen from the figure above, Hadoop and MapReduce are the foundation of hive architecture.
MetaStore：Store and manage hive’s metadata, and use relational database to save metadata information.
Parser and compiler:The SQL statement is generated into a syntax tree, and then a DAG job chain is generated to become a logical plan
optimizer:Only rule-based optimization is provided
- Column filtering: remove unnecessary columns from the query
- Row filtering: where condition judgment is filtered in the tablescan stage, and only the qualified partitions are read by using partition information
- Predicate push down: reduce the amount of data behind
- Join mode
. Map side join: adjust the join order to ensure that the large table is used as the drive table and the small table is loaded into all mapper memory
. Shuffle join: send the data of two tables to join according to hash function
. For the unbalanced data distribution table group by, in order to avoid data centralization to a few reducers, it is divided into two map reduce stages. In the first stage, the distinct column is used to shuffle, and then the data is partially aggregated at the reduce side to reduce the data size. In the second map reduce stage, the data is aggregated according to the group by column.
. Sort merge join: sort, cut data in order, and send the same range to the same node (create two sort tables in the background before running, or specify when creating tables)
. Hash is used for partial aggregation on the map side to reduce the data processing scale on the reduce side.
Actuator:The actuator converts DAG into Mr task. The executor will execute all the jobs in sequence. If there is no dependency, the jobs will be executed in a concurrent way.
5. Implementation process of hive SQL based on Hadoop
After SQL is written, it’s just the splicing of some strings, so it needs a series of parsing processing to finally become the execution job on the cluster
1. Parser: parses SQL into ast (abstract syntax tree) and performs syntax verification. The essence of AST is string
2. Analyzer: parsing syntax and generating QB (query block)
3. Logicl plan: analyze the logical execution plan to generate a bunch of Opertator trees
4. Logical optimizer: optimize the logical execution plan and generate a bunch of optimized Opertator trees
5. Physical plan: analyze physical execution plan and generate tasktree
6. Physical optimizer: optimizes the physical execution plan and generates the optimized tasktree, which is the job executed on the cluster
- Conclusion: after the above six steps, ordinary string SQL is parsed and mapped to the execution task on the cluster. The most important two steps are Logical execution plan optimization and physical execution plan optimization (circled in red)
View the execution plan.
explain select id,name from emp where deptid = 1001 order by id