At the recent GDG developer conference in Guangzhou, Dong Lin, senior technical director of getpush, took “SQL one-stop solution for heterogeneous data” as the theme, and deeply shared many years of practical experience of getpush in the field of SQL. This article will elaborate the unified SQL from three aspects
1、 Why unify SQL
2、 How to unify SQL
3、 Practice of individual push unified SQL
(the following is arranged according to the content of the speech)
Why unify SQL
Data operation is the battlefield of multi arms cooperation
Data analysts, data R & D engineers, operation and maintenance engineers and even product operators are required to complete the internal work around data. The efficiency of communication and cooperation becomes a key factor.
Dazzling data storage engine
With the development of big data industry, data storage engine is in the stage of blooming, emerging technologies emerge in endlessly: such as relational database, including mysql, Oracle, SQL Server and oceanbase launched by ant; NoSQL solution, including redis, aerospike, mongodb, etc.; Hadoop based solution, including hive, HBase, etc.; and newsql direction, including Greenplum, tidb, Doris, Clickhouse, etc.
A variety of storage engines make many people who participate in data operations feel at a loss and don’t know what way to choose. Developers have to spend a lot of money trying various solutions to meet the market demand. Developers need to know exactly where the data exists, how the field format is, how the relationship between data tables is, and how the data is operated. Technicians and business personnel also need to spend a lot of time mastering the performance and characteristics of various storage engines.
Various data storage schemes
There are many types of engine selection and high learning cost
In addition to the storage engine, the choice of computing engine also brings trouble to everyone. Spark and Flink have their own advantages, and they are developing rapidly and learning from each other. In addition, there are many machine learning engine solutions, such as tensorflow, pytorch and the machine learning algorithm library in the computing engine. However, the learning cost of these solutions is relatively high, which often makes developers feel entangled and difficult to choose.
The working language is not unified
Storage engine and computing engine have many solutions, which will bring great language barriers to collaborative work. For analysts, SQL is widely used in daily life, but sometimes python, shell script and other methods are used to complete data processing. Data modelers mainly rely on python, while data developers mainly use Java and scala to develop real-time tasks. Product operators will even use Excel to complete some simple analysis tasks. Of course, most of the time, they still express their needs to the data analysts, and the analysts help to complete them.
Language barriers also limit the efficiency of collaboration to a certain extent, and lack flexibility in resource allocation. For example, real-time tasks based on spark or Flink can only be completed by data R & D students at present, which is easy to cause a backlog of work.
“Unified SQL” at the right time, place and people
If we think about it carefully, we will find that data processing is essentially the processing of data warehouse, and all kinds of data operations can be abstracted as ETL process of data warehouse, that is, data extraction, transformation and load. At present, SQL is the best DSL (domain specific language) to describe the data processing process, and it is the de facto standard.
In the current environment, it is the general trend to implement the solution of unified SQL, which has the basic conditions of favorable time, place and people
· time: with the growth of data volume, the cost of calculation, human resources and communication increases, which is unbearable for enterprises;
Good location: mainstream relational databases, MPP databases, computing engines, ES, and even NoSQL solutions have or plan to support SQL syntax;
· Renhe: SQL language is easy to use, with only nine verbs for its core functions; analysts, modelers, data developers and even product operators can quickly master SQL.
We think that we can complete the transformation of the principle of 28 by unifying SQL, that is, from spending 80% of the time on 20% of the routine data operations to using 20% of the time to complete 80% of the routine data operations. In this way, we can have more time to solve more complex work, to think about the value of data.
How to unify SQL
In order to realize the unification of SQL, we think that we need to meet four core requirements: metadata access, cross data sources, support offline and real-time computing, support machine learning.
Taking a typical offline computing scenario as an example, we hope that we can achieve the goal through simple SQL, that is, to fuse hive data with MySQL data, and then write back to HBase.
update hbase.biz.user_balances as a
set a.balance = ret.balance
( select b.uid, c.balance from hive.warehouse.users as b inner join mysql.biz.balances as c on b.uid = c.uid ) as ret
where a.uid = ret.uid`
Similarly, for a typical real-time computing scenario, we also hope to complete the integration of Kafka data stream and MySQL through a simple SQL, and then write back to redis
update redis.dashboard.brands as a
set a.cnt = ret.cnt
( select c.brand, count(distinct b.uid) as cnt from kafka.app.visit_logs as b inner join mysql.warehouse.users as c on b.uid = c.uid group by c.brand ) as ret
where a.brand = ret.brand`
In order to find a suitable solution, we first investigated the independent open source solution. We pay attention to Qihoo 360’s open source quicksql. This solution basically resolves SQL into ast (abstract syntax tree), and then pushes the execution plan to the specific execution engine. Quicksql supports many data sources, including hive, mysql, ES, mongodb, etc. unfortunately, this solution does not support real-time tasks, and is a solution for offline scenarios.
Apache calculate of this scheme is of great significance. Apache compute is a dynamic data management framework based on SQL. First, it parses the SQL syntax into ast for syntax verification, then optimizes the query based on RBO and CBO, and finally connects external data sources through JDBC and other engines. In addition, calcite supports metadata docking, and developers can input metadata information needed for table name, row number, distribution, sorting and other optimization through the framework provided by calcite.
Next, we investigate the current mainstream real-time computing engine to see if it can provide relatively complete SQL support. The first evaluation is spark SQL. It parses the SQL syntax into ast through ANTLR parser, then optimizes the query, and finally decomposes the instructions into RDD tasks.
This solution can support offline and real-time, but there are several problems when it is used in the company: one is that if you want to expand and support more data, you need to modify spark The connector part of SQL needs to modify the spark environment. For a large-scale spark cluster, updating and upgrading the underlying environment will bring greater risks. In addition, the updating of the underlying environment is also difficult to accept for the scenario of private deployment.
In addition, we need to connect external data sources through personalized DDL syntax like using, which destroys the standardization of SQL to a certain extent. But overall, spark SQL does provide us with a very good solution and ideas.
Then, we evaluated Flink SQL. It is a very valuable component provided by Ali for Flink. The overall process is very similar to spark’s solution. It parses SQL syntax through compute, performs query optimization, and then pushes down the execution plan. It also supports many data sources. Relatively speaking, Flink SQL has better support than spark SQL, but there are still problems similar to spark SQL, including metadata connection, migration cost of underlying engine, and the use of custom with keyword to declare data source access.
Because these DDL syntax are not standard SQL syntax, on the one hand, it will bring some learning costs, on the other hand, it will bring compatibility problems, resulting in the same SQL can not run on spark and Flink platforms at the same time.
Finally, we also investigate the SQL support of machine learning. We found an open source solution called mlsql, which can quickly complete model training and prediction through keywords like train and predict. It also supports cross data source query. However, mlsql focuses more on the integration of the whole process from the front to the back, and there is still a lack of mature application cases in large-scale data volume. We need to carry out a lot of customized development for our use scenarios. However, in general, its grammatical expansion is of great significance to us.
/tmp/train as trainData;
train trainData as RandomForest.
/tmp/rf where keepVersion=”true” and fitParam.0.featuresCol=”content” and fitParam.0.labelCol=”label” and fitParam.0.maxDepth=”4″ and fitParam.0.checkpointInterval=”100″ and fitParam.0.numTrees=”4″ ;
predict testData as RandomForest.
*Source: mlsql official document
According to the above research, we have summarized and sorted out, and hope that the technical architecture of unified SQL is as follows:
1. It is compatible with two kinds of computing engines (spark / Flink) and does not modify the computing engine;
2. The data source can be expanded flexibly;
3. The SQL optimization process can be controlled independently, logically decoupled from the engine, and physically bound to the engine;
4. The extension supports machine learning grammar.
Practice of individual push unified SQL
In order to facilitate different roles, different positions of personnel to participate in the work of data management, we have developed a set of data center system, which is called daily data management platform in the company. The daily data management platform is composed of five modules: data warehouse, model platform, data asset management platform, data integration platform and data development platform. It helps enterprises solve the pain points of data management, reduce the threshold of data use and improve the value of data use.
Through the data asset management platform, we can input data assets, manage the blood relationship between data and monitor the data quality. Through the data integration platform, we can dock with the data source and complete the data structure work. Through the data development platform, we can complete the construction and maintenance of data warehouse. Among them, the unified SQL Engine GQL is the foundation of a push data development platform.
The structure of GQL is very similar to that of spark SQL and Flink SQL mentioned above. Both of them decompose SQL into syntax tree structure through the parsing layer, and then complete the query optimization work combined with the meta information of data. Finally, these execution tasks are sent to the corresponding computing engine or storage engine.
In terms of metadata, we connect the information of the data governance platform with the SQL parsing layer to eliminate the difference of DDL. We no longer need to declare the data to be accessed in the form of temporary tables defined by keywords such as with and using, which improves the standardization of SQL.
In the development environment, we expand the support of GQL based on the database navigator plug-in. Developers can write and debug GQL on IntelliJ idea and view query results directly.
In terms of data sources, we currently support five types of data sources, including JDBC / ODBC interface, file system, Hadoop system, kV data and Kafka. In the future, we hope that the API data interface can also be accessed through GQL.
In machine learning, GQL mainly supports two functions. In the aspect of Feature Engineering, the mechanical work, including the processing of outliers and missing values, feature combination and feature automatic screening, can be completed through GQL. In terms of models, GQL supports supervised learning models (such as logistic regression, random forest, xgboost, etc.) and unsupervised clustering models.
As a DSL, SQL has simple syntax, but it is not universal. Some tasks are not suitable to be described and completed by SQL, such as iteration and encryption and decryption algorithm, which requires UDF (user define function) extension to make SQL code more rich. GQL also provides extended support for UDF, which can be added through Java, python, etc.
We think SQL is the most suitable language for people to deal with data at present. Metadata access, cross data source, offline and real-time support, machine learning support are the four core functions of unified SQL. Flink SQL can basically meet our needs. If the underlying computing engine is Flink, there is no doubt that it can solve most of the problems. However, if you want to maintain greater flexibility and scalability, it is necessary to consider the development of SQL as an independent technology stack. Combined with personal practice, GQL is used as the unified SQL Engine of daily data management platform. We share the technical solutions and functions, hoping to inspire you.