Fast-hand Big Data Architecture Engineer Zhong Liang
This paper is based on the content shared by Zhong Liang, a fast-hand big data architecture engineer, in his speech “Practice and optimization of SQL on Hadoop in fast-hand big data platform” at the A2M Artificial Intelligence and Machine Learning Innovation Summit on May 18-19.
Content Introduction: This paper mainly introduces the architecture of SQL on Hadoop from four aspects: introduction of SQL on Hadoop, overview of fast-hand SQL on Hadoop platform, experience and improvement analysis of fast-hand SQL on Hadoop, and future plan of fast-hand SQL on Hadoop.
Introduction to 01SQL on Hadoop
SQL on Hadoop, as its name implies, is an SQL engine architecture based on Hadoop ecology. We often hear Hive, Spark SQL, Presto, Impala architecture. Next, I will briefly describe the commonly used architecture.
SQL on Hadoop-HIVE
HIVE, a data warehouse system. It maps the data structure to the stored data, and reads, writes and manages the large-scale distributed storage data through SQL.
According to the defined data schema and the output Storage, it compiles and optimizes the input SQL, generates the corresponding engine tasks, and then schedules the execution of the generated tasks.
HIVE currently supports engine types: MR, SPARK, TEZ.
Based on the architecture of HIVE itself, there are some additional service providers, such as Hive Server 2 and MetaStore Server, which are both Thrift architectures.
In addition, Hive Server 2 provides the ability for remote clients to submit SQL tasks, while MetaStore Server provides the ability for remote clients to manipulate metadata.
Introduction to SQL on Hadoop – SPARK
Spark is a fast and easy-to-use unified analysis engine for large-scale data processing with DAG as execution mode. Its main modules are divided into SQL engine, streaming processing, machine learning and graph processing.
Introduction to SQL on Hadoop – SPARKSQL
SPARKSQL is based on SPARK computing engine. It achieves uniform data access, integrates Hive and supports standard JDBC connection. SPARKSQL is often used in scenarios of data interaction analysis.
The main execution logic of SPARKSQL is to parse the SQL into a grammar tree, then generate a logical execution plan by semantic analysis, interact with metadata, optimize the logical execution plan, and finally translate the logical execution into a physical execution plan, namely RDD lineage, and perform tasks.
Introduction to SQL on Hadoop – PRESTO
PRESTO, an open source distributed SQL query engine for interactive analysis queries.
Because of memory-based computing, PRESTO performs better than MR and SPARK engines with a large number of IO operations. It is easy to expand flexibly and supports pluggable connections.
There are many use cases in the industry, including Facebook, AirBnb, American League and so on.
Introduction to SQL on Hadoop – Other Industry Solutions
We have seen so many SQL on Hadoop architectures, which show that they are practical and mature. Using the SQL on Hadoop architecture, we can achieve the need to support massive data processing.
Overview of 02 Fast-hand SQL on Hadoop Platform
Overview of Fast-Handed SQL on Hadoop Platform-Platform Size
The total amount of SQL and DQL in the query platform is about 700,000 and 180,000 respectively. AdHoc cluster is mainly used for interactive analysis and machine query. DQL takes an average of 300 seconds. AdHoc has Loacl tasks and acceleration engine applications internally, so the query requirement is low.
ETL cluster is mainly used for ETL processing and report generation. The average time consumed by DQL is 1000s, DQL P50 is 100s and DQL P90 is 4000s. In addition to the two clusters mentioned above, other small clusters are mainly used to provide separate services for use.
Fast-hand SQL on Hadoop Platform Overview – Service Level
The service layer is applied to the upper layer. There are four modules in the upper layer, including synchronization service, ETL platform, AdHoc platform and user program. At the upper level of scheduling, there are also four aspects of data, such as service-side log, after processing, it will be directly connected to HDFS, we will clean it later; service-point data and database information will be synchronized into the corresponding data sources, and we will put metadata. Information exists in the back-end metadata system.
The data crawled from web pages will be stored in HBase and subsequently cleaned and processed.
Fast-hand SQL on Hadoop Platform Overview – Platform Component Description
HUE and NoteBook mainly provide interactive query system. The report system and BI system are mainly ETL processing and common report generation. The additional metadata system serves the outside world. Fast-hand engines now support MR, Presto and Spark.
Management system is mainly used to manage our current cluster. Hive Server 2 cluster routing system is mainly used for engine selection. Monitoring system and operation and maintenance system, mainly for Hive Server 2 engine operation and maintenance.
We have encountered many problems in using Hive Server 2. Next, I will elaborate on how fast-hand optimization and practice.
03SQL on Hadoop User Experience and Improvement Analysis in Fast Hand
Hive Server 2 Multi-Cluster Architecture
There are currently several Hive Server 2 clusters, AdHoc and ETL, and other small clusters. Different clusters have corresponding connection ZK. Clients can connect Hive Server 2 cluster through ZK.
In order to ensure the stability of core tasks, ETL clusters are classified into core clusters and general clusters. When the client connects to HS2, we will determine the priority of tasks. High priority tasks will be routed to the core cluster, and low priority tasks will be routed to the general cluster.
Hive Server 2 Service Internal Flow Chart
Beacon Server serves as the back-end Hook Server service. With the help of Hook in HS2, it achieves the required functions in addition to HS2 services. Currently supported modules include routing, audit, SQL rewriting, task control, error analysis, optimization recommendations, etc.
Stateless, BeaconServer services support horizontal scaling. Based on the size of requests, the size of services can be flexibly adjusted.
Configuration dynamic loading, BeaconServer services support dynamic configuration loading. Each module supports switches, and services can be dynamically loaded and configured to achieve up-and-down. For example, the routing module can adjust or even fuse the routing ratio according to the cluster resources of the back-end acceleration engine.
Seamless upgrade, the back-end module of Beacon Server service can be upgraded offline independently without affecting the Hook HS2 service.
Pain Points in the Use of SQL on Hadoop Platform
Problems with using new engines for acceleration
Hive supports SPARK and TEZ engines, but is not suitable for production environments.
The SQL engine of SQL on Hadoop has its own advantages and disadvantages, and the threshold for users to learn and use is high.
There are differences in grammar and functional support between different SQL engines, which require a lot of testing and compatibility work, and the cost of full compatibility is high.
The service provided by different SQL engines will bring inconvenience to the consanguinity management, privilege control, operation and maintenance management and resource utilization of several warehouses.
Solution of Intelligent Engine
In Hive, customize the implementation engine.
Automatic routing function, no need to set up engine, automatically select the appropriate acceleration engine.
Eradicating rules matches SQL, pushing compatible SQL only to the acceleration engine.
Reuse HiveServer2 cluster architecture.
Intelligent Engine: Comparison of Mainstream Engine Schemes
Intelligent Engine: Module Design of Hive Server 2 Custom Execution Engine
There are two implementations based on Hive Server 2. The JDBC approach is to send SQL to the cluster started by the back-end accelerator engine through the JDBC interface. PROXY is a Client that pushes SQL down to the local acceleration engine to start.
The back-end clusters started by JDBC are all based on YARN, which can realize time-sharing reuse of resources. For example, the resources of the AdHoc cluster will be automatically recovered at night and reused as the resources of the report system.
Intelligent Engine: Design Architecture of SQL Routing Scheme
The routing scheme is based on the Hook architecture of HS2, which implements the corresponding Hook in HS2 for engine switching, and the routing service in the back-end Beacon Server service for matching the routing rules of SQL. Different clusters can configure different routing rules.
To ensure the stability of the post-calculation routing service, the team also designed Rewrite Hook to rewrite the SQL in the AdHoc cluster, automatically add the LIMIT upper limit, and prevent large data volume SCAN.
Intelligent Engine: A List of SQL Routing Rules
Intelligent Engine: Solution Advantage
Easy to integrate, the current mainstream SQL engine can easily implement JDBC and PROXY mode. Through configuration, we can integrate new query engines, such as impala, drill and so on.
Automated engine selection reduces the user’s engine usage cost and makes migration easier. In the case of acceleration engine overload, the proportion can be dynamically adjusted to prevent the impact of overload on acceleration performance.
Automatic downgrade ensures the reliability of operation. SQL routing supports failback module, which can choose whether to reroute engine to MR after execution failure according to configuration.
Module reuse, for the new engine, can reuse Hive Server 2 customized blood collection, authority authentication, concurrent lock control and other programs, greatly reducing the cost of use.
Resource reuse can dynamically adjust the resource occupied by ad hoc queries in time-sharing to effectively ensure the utilization of cluster resources.
Application Effect of Intelligent Engine DQL
Performance issues in Hive Server 2
FetchTask Acceleration: Presort and Logical Optimization
When the query is completed, the result file is polled locally until the LIMIT size is obtained, and then returned. In this case, when there are a large number of small files, and large files in the back end, it will lead to Bad Case, constantly interact with HDFS, access to file information and file data, greatly lengthening the running time.
Prior to Fetch, pre-sorting the size of the result file can improve performance hundreds of times.
Example: There are currently 200 files. 199 small files one record a, 1 large file mixed record a and test totally 200, big file name index after the small file.
FetchTask Acceleration: Presort and Logical Optimization
There is a Simple Fetch Optimizer optimizer in Hive that generates FetchTask directly to reduce resource application time and scheduling time. But there will be bottlenecks in this optimization. If the amount of data is small, but the number of files is large, the number of returned items is large, and there is a filter condition that can filter out the result data in large quantities. At this time, the input file is read serially, which results in a large query delay, but has no acceleration effect.
In SimpleFetch Optimizer optimizer, the criteria for determining the number of new files are added, and the task is finally submitted to the cluster environment to accelerate by increasing concurrency.
Example: Read the partition of the current 500 files. The optimized threshold of the number of files is 100.
Large Table Desc Table Optimization
A table has a large number of subpartitions, and its DESC process interacts with metadata to obtain all partitions. However, only table-related information is returned.
When interacting with metadata, the whole DESC query is delayed, and even the results can not be returned when the metadata pressure is high.
For the DSC process of TABLE, the process of interactive partition acquisition with metadata is removed directly, and the acceleration time is proportional to the number of sub-partitions.
Example: a large table with desc 100,000 partitions.
Reuse split calculation data, skip reduce estimation repetitive statistical input process. For tasks with large amount of input data, the scheduling rate is increased by 50%.
Parquet Serde init accelerates, skips the same table’s repeated column pruning optimization, and prevents map task OP init time-out.
Add Lazy Output Format and create files with record output to avoid the generation of empty files, resulting in a large number of downstream empty files read time-consuming.
StatsTask supports multi-threaded aggregation of statistics to prevent too many intermediate files from aggregating too slowly and increase running time.
AdHoc needs to turn on parallel compilation to prevent the problem of increasing overall latency caused by serial compilation of SQL.
Pain Points in the Use of SQL on Hadoop Platform
Quick-hand use of SQL on Hadoop: Common usability issues
Hive Server 2 Service Startup Optimization
When HS2 starts, it initializes the materialized view function and polls the whole metadata base, which results in a very long start-up time of HS2. The interval from offline state to reboot is too large and the usability is poor.
Modify the materialized view function to delayed lazy loading and separate thread loading, without affecting the service start of HS2. The materialized view supports the acquisition of cached information in loading to ensure the availability of functions.
The start-up time of HS2 was increased from 5 min + to < 5 s.
Hive Server 2 Configuration Hot Loading
HS2 itself has a high cost of offline and offline, so it needs to ensure that all tasks on the service are completed before it can be operated. The modification of configuration can be used as a high frequency operation and requires hot loading.
In the Thrift Server layer of HS2, we add an interface, which is automatically invoked when the configuration is pushed down and updated after it is connected with the operation and maintenance system, so that the hot loading of the configuration can take effect.
Scratchdir optimization of Hive Server 2
The scratchdir of Hive Server 2 is mainly used for temporary file storage during operation. When a session is created in HS2, scratchdir is created. When HDFS is under great pressure, a large number of sessions will block the process of creating scratchdir, resulting in the number of connections piled up to the upper limit, and eventually HS2 services can no longer connect to new connections, affecting service availability.
To solve this problem, we first separate the scratch directory of general query and create temporay table query, and support the lazy creation of scratch of create temporay table query. When creating temporay table creates a large number of temporary files, it will affect the delay time of HDFS NameNode, scratchdir HDFS NameNode of general query can respond normally.
In addition, HS2 also supports configuring multiple scratches. Different scratches can set load ratios to achieve HDFS load balancing.
Hive Stage Concurrent Scheduling Exception Repair
There are two problems in Hive scheduling.
1. When the non-execution state of child Task is complete, if there are multiple rounds of parent Task containing child Task, the child Task will be re-joined to the scheduling queue. For this case, you need to change the non-execution state to the initialization state.
Secondly, in the process of judging whether sub-Task is executable or not, because of the abnormal state detection, the sub-Task that needs to be scheduled can not be joined normally, resulting in the loss of Stage in the query. And this case, our approach is to add a round of Stage execution results status checks after the completion of the execution, once found that downstream Stage is not completed, throw errors directly, to achieve the completion of the query results status checks.
HS2 implements interface termination query SQL. With this function, abnormal SQL can be terminated in time.
Metastore JDOQuery query optimization, keyword exception skipping, to prevent metadata from long-term cartoon or some exception queries affecting metadata.
Increase switch control, mandatory coverage of the appearance directory, solve insert overwrite appearance, file rename error problem.
Hive parquet push-down adds closing configuration to avoid parquet abnormal push-down OR condition, resulting in incorrect results.
The executeForArray function joins a super large string to cause OOM, increasing restriction optimization.
Adding the function of reading partition data according to table schema, avoiding reading data exception caused by non-cascading modification of partition schema.
Pain Points in the Use of SQL on Hadoop Platform
Why Develop SQL Expert System
Some users have no development experience and can not handle the error returned by the processing engine.
Some error information is not clear, users can not correctly understand the cause of the error.
The cost of failed task checking is high, so we need to be familiar with the Hadoop system.
User’s wrong SQL and SQL that needs to be optimized have a lot of commonalities. Manpower maintenance cost is high, but system analysis cost is low.
SQL Expert System
Based on the Hook architecture of HS2, the SQL expert system implements three main modules in the back end of Beacon Server, namely, the SQL rule control module, the SQL error analysis module and the SQL optimization recommendation module. The knowledge base of the SQL expert system, including keywords, explanation of reasons, processing schemes and other major information, is stored in the back-end database, and has been accumulated.
Through the expert system of SQL, the back-end can control the abnormal query of SQL to avoid wasting the resources of the abnormal SQL or affecting the stability of the cluster. When users encounter problems, they can get the solution of the problem directly, which reduces the cost of use.
Example: Empty partition query control.
Job Diagnosis System
SQL expert system can solve some of the fault diagnosis requirements of HS2 task execution, but for example, the judgment of the causes of the problems such as job health, abnormal task execution, etc., needs a special system to solve them. For this reason, we designed a job diagnosis system.
At the level of YARN, the job diagnosis system collects Counter and configures it according to different execution engines. At the implementation level, relevant optimization suggestions are put forward.
The data of job diagnosis system can also be provided to SQL expert system through API to supplement the cause of the problem for analysis.
Job Diagnosis System provides query pages to query running tasks. The following is the task query process for hitting map with too many rules input:
In the job interface, you can also view more job diagnostic information and job modification suggestions.
Pain Points in the Use of SQL on Hadoop Platform
SQL on Hadoop in Fast Hand Use: Common Operational Maintenance Problems
Audit Analysis-Architecture Diagram
Audit function is also a module of Beacon Server service.
By sending the required information such as SQL, IP, User and so on to the back-end through the Hook configured in HS2, we can extract the data base, table, Columns and operation information, and then store them in the Druid system after analysis. Users can query part of the open data through the visual platform.
Audit Analysis – Hotspot Information Query
Hot information query is about to show hot information for a period of time, the user’s hot operations, which include access to which libraries, which tables, and which types of operations.
Audit analysis-consanguinity information query
As can be seen from the figure below, consanguinity information shows the upstream dependencies created by a table, which are generally used to determine the impact of the table.
Audit Analysis-Historical Operational Query
Historic operations can be traced back to operations on a table over a period of time. The user, client, platform and time of operation can be obtained. It is generally used to track the changes of tables.
Hive Server 2 Cluster AB Handover Scheme
Because Hive Server 2 service itself has a high cost of offline and offline, it often takes a long time to perform an upgrade operation and affects availability. The A B switching scheme of Hive Server 2 cluster mainly relies on the way that A cluster is online and B cluster is standby, and achieves seamless upgrade operation by switching the online cluster machine on ZK.
Hive Server 2 Cluster Dynamic Up and Down
Hive Server 2 cluster deploys Metrics monitoring, which can track the usage of cluster services in real time. In addition, we have transformed the HS2 service to realize the interface between HS2 ZK offline and requesting Cancel.
When the external Monitor monitoring perceives that the continuous memory is too high, it will automatically trigger the FGC operation of HS2 service process. If the memory is still too high, the service will be directly offline through ZK. According to the time sequence of query submission, the query will be stopped in turn until the memory is restored to ensure the normal operation of the remaining tasks in the service.
Hive Server 2 Cluster Management Platform
Hive Server 2 needs to know the status of each cluster and each HS2 service in the multi-cluster state. Through the management platform, you can view version status, start-up time, resource usage, and the status of online and offline.
Follow-up with the operation and maintenance platform, can be more convenient to carry out a key gray level and upgrade.
Summary of Improvement of Quick-Hand Query Platform
04 Fast-hand SQL on Hadoop’s Future Plan
Upgrading Expert System to Realize Automation Parameter Tuning and SQL Optimization
Cache Acceleration of AdHoc Queries
Research and Application of New Engine
The above content comes from teacher Zhong Liang’s sharing. Would you like to see more speeches about fast-hand teachers? June 21-23 to attend the GIAC Global Internet Architecture Conference Shenzhen Station. ~We invited Yan Ren, the testing director of Fast Hand Application Research and Development Department, to tell us about the topic of Quality Monitoring on Fast Hand Mobile Terminal.
In addition, the organizing committee also invited 105 guests from top-line Internet companies, such as Google, Microsoft, Oracle, eBay, Baidu, Ali, Tencent, Shangtang, Tucson, byte beating, Sina, Metro Comments, to attend the conference, focusing on AI, Dazhong, Cloud-Native, IoT, Chaos Engineering, Fintech, Data and Commerce. Intelligence, engineering culture and management, classical architecture and other topics share their practical experience, problems encountered and solutions. Now fill in the registration information and get all the PPTs of GIAC Summit free of charge! Come and sign up for the two-dimensional code in the identification map.