The Foundation of OPPO Data Platform: Building Real Data Warehouse Based on Flink SQL

Time:2019-9-11

Author | Zhang Jun

This paper summarizes the Flink Meetup Conference held in Shenzhen on April 13, 2019, and shares with Zhang Jun, who is currently responsible for the development of OPPO Big Data Platform and Apache Flink contributor. The main contents of this paper are as follows:

  • The evolution of OPPO real-time data warehouse;
  • Extension based on Flink SQL;
  • Constructing the application case of real-time warehouse;
  • Consideration and Prospect of future work.

I. Evolution of OPPO Real-time Data Warehouse

OPPO Business and Data Scale

Everyone knows that OPPO is a smartphone, but they do not know what relationship OPPO has with the Internet and big data. The following figure outlines OPPO’s business and data situation:

OPPO, as a mobile phone manufacturer, customizes its own Color OS system based on Android, with over 200 million active users today. Around ColorOS, OPPO has built many Internet applications, such as application stores, browsers, information flows, etc. In the process of operating these Internet applications, OPPO has accumulated a large amount of data. On the right side of the picture above is the evolution of the overall data scale: since 2012, the annual growth rate has been 2-3 times. Up to now, the total data volume has exceeded 100 PB, and the daily incremental data volume has exceeded 200 TB.
To support such a large amount of data, OPPO developed a set of data systems and services, and gradually formed its own data platform system.

1.2. OPPO Data Station

This year, everyone is talking about data desks. How does OPPO understand data desks? We divide it into four levels:

  • The lowest level is the unified tool system, which covers the “access-governance-development-consumption” full data link.
  • Based on the tool system, a data warehouse is constructed, which is divided into “primitive layer – detail layer – Summary layer – Application layer”, which is also the classic warehouse architecture.
  • What is global data system? That is to connect all business data of the company and form a unified data asset, such as ID-Mapping, user tags, etc.
  • Ultimately, scenario-driven data products and services are needed to enable data to be used by businesses.

The above is the whole system of OPPO data platform, in which the data warehouse is in a very basic and core position.

1.3. Constructing OPPO Off-line Number Warehouse

Over the past two or three years, our focus has been on the construction of offline warehouses. Firstly, the data source is mobile phone, log file and DB database. We build a high availability and high throughput access system based on Apache NiFi, which unifies the data into HDFS and forms the original layer. Then, the hourly ETL based on Hive and the heavenly summary Hive are presented. Tasks, respectively responsible for the calculation and generation of detail layer and summary layer; finally, the application layer is based on OPPO internal research and development of data products, mainly report analysis, user portraits and interface services. In addition, the middle level of detail also supports Presto-based ad hoc queries and self-referencing.
With the gradual improvement of off-line warehouse, the demand for real-time warehouse is becoming more and more intense.

1.4. The Demand of Real-time Number Warehouse

For the demand of real-time warehouse, we are usually from the business perspective, but in fact, from the perspective of the platform, real-time can also bring practical benefits. Firstly, from the business side, there will be real-time application scenarios for report forms, labels, interfaces, etc. See several cases on the left of the above figure respectively. Secondly, for the platform side, we can see from three cases:firstIn OPPO, a large number of batch tasks start at 0 o’clock and process data by T+1, which will result in the outbreak of computing load and great pressure on cluster.SecondLabel import is also a T + 1 batch task, which takes a long time for each full import.ThirdThe monitoring of data quality must also be T+1, which leads to some problems that can not be found in time.

Since both the business side and the platform side have the demand of real-time, how can OPPO build its own real-time warehouse?

1.5. Smooth migration from offline to real-time

Whether a platform or a system, it can not be separated from the upper and lower levels of composition: the upper level is API, is user-oriented programming abstraction and interface; the lower level is Runtime, is a kernel-oriented execution engine. We want the migration from offline to real-time to be smooth. What does that mean? From the API level, the abstraction of warehouse is Table, and the programming interface is SQL + UDF. Users are used to such API in the offline warehouse era. It is better to keep consistency after migrating to real-time warehouse. From the Runtime level, computing engine evolved from Hive to Flink and storage engine from HDFS to Kafka.

Based on the above ideas, the real-time pipeline can be obtained by transforming the offline pipeline mentioned earlier.

1.6. Constructing OPPO Real-time Number Warehouse

As can be seen from the figure above, the whole pipeline is basically similar to the offline bunker, except that Hive is replaced by Flink and HDFS is replaced by Kafka. From the point of view of the overall process, the basic model is invariable, or is it composed of cascade calculation of the original layer, detail layer, summary layer and application layer.

So the core problem here is how to build this pipeline based on Flink. Here’s what we did based on FlinSQL.

2. Extension based on Flink SQL

2.1.Why Flink SQL

Firstly, why use Flink SQL? The following figure shows the basic structure of Flink framework, the bottom is Runtime, which we think has four core advantages: first, low latency, high throughput; second, end-to-end Exactly-once; third, fault-tolerant state management; fourth, Window & Event Time. Support. Based on Runtime, three levels of API are abstracted, and SQL is at the top.

What are the advantages of Flink SQL API? We also look at it from four aspects:firstSupport ANSI SQL standard;SecondIt supports rich data types and built-in functions, including common arithmetic operations and statistical aggregation.ThirdSource/Sink can be customized, based on which the upstream and downstream can be flexibly extended;FourthBatch flow unification, the same SQL, can run offline or real-time.

So, how to program based on Flink SQL API? Here is a simple demonstration:

First is the definition and registration of input/output tables, where two tables of Kakfa are created to specify what version of Kafka is and which topic it corresponds to; then is the registration of UDF, which is not listed here for length reasons; and finally is the execution of real SQL. As you can see, in order to execute SQL, we need to do so much coding, which is not the interface we want to expose to users.

2.2. Development IDE Based on WEB

As mentioned earlier, the abstraction of warehouse is Table, and the programming interface is SQL + UDF. For users, the programming interface provided by the platform should be similar to the one shown above. Users who have used HUE for interactive queries should be familiar with it. The menu on the left is the Table list, and the SQL editor on the right. You can write the SQL directly on it and submit it for execution. To achieve such an interactive way, Flink SQL is impossible by default. There is gap between them. To sum up, there are two points:firstMetadata management, how to create database tables, how to upload UDF, so that it can be directly referenced in SQL later;SecondSQL job management, how to compile SQL, how to submit jobs.
In the course of technical research, we found Uber’s open source AthenaX framework in 2017.

2.3. AthenaX: REST-based SQL Manager

AthenaX can be seen as a REST-based SQL manager. How does it implement SQL job and metadata management?

For SQL job submission, AthenaX has a Job abstraction that encapsulates information such as SQL to be executed and job resources. All Jobs are hosted by a JobStore, which regularly matches the Running App in YARN. If not, the corresponding Job will be submitted to YARN.

For metadata management, the core issue is how to inject externally created library tables into Flink so that they can be identified in SQL. In fact, Flink itself reserves the ability to dock with external metadata, providing two abstractions, External Catalog and External Catalog Table, respectively. On this basis, AthenaX encapsulates a TableCatalog and extends it at the interface level. At the submission stage of the SQL job, AthenaX automatically registers TableCatalog to Flink, then calls the Flink SQL interface to compile the SQL into Flink’s executable unit JobGraph, and finally submits it to YARN to generate a new App.

Although AthenaX defines the TableCatalog interface, it does not provide a directly usable implementation. So, how can we implement it in order to receive our existing metadata system?

2.4. Flink SQL Registry Table Procedure

First, we need to figure out how the Flink SQL internal registry table. The whole process involves three basic abstractions: TableDescriptor, TableFactory and TableEnvironment.

TableDescriptor, as its name implies, is a description of a table. It consists of three sub-descriptors: Connector, which describes the source of data, such as Kafka, ES, etc. Format, which describes the format of data, such as csv, json, avro, etc. and Schema, which describes the name and type of each field. There are two basic implementations of Table Descriptor – Connect Table Descriptor is used to describe internal tables, that is, tables created programmatically; External Catalog Table is used to describe external tables.

With the Table Descriptor, you need the Table Factory to instantiate the Table based on the description information. Different description information needs different TableFactory to process. How can Flink find a matching TableFactory implementation? In fact, to ensure the extensibility of the framework, Flink uses Java SPI mechanism to load all declared TableFactory and traverse to find which TableFactory matches the TableDescriptor. TableDescriptor is converted into a map before being passed to TableFactory, and all descriptive information is expressed in key-value form. TableFactory defines two methods for filtering matches — requiredContext (), which is used to detect whether the value of a particular key matches, such as whether connector. type is kakfa; and supportedProperties (), which is used to detect whether a key can be identified, and if an unrecognized key appears, to indicate No. Method matching.

Matching the correct TableFactory, the next step is to create the real Table and register it through the TableEnvironment. Only when a table is registered successfully can it be referenced in SQL.

2.5. Flink SQL docking external data sources

Understanding the process of Flink SQL registry tables gives us the idea that if tables created by external metadata can also be converted into TableFactory recognizable maps, they can be seamlessly registered to TableEnvironment. Based on this idea, we have implemented the docking of Flink SQL with existing metadata centers. The general process is shown in the following figure:

The tables created by the metadata center will store metadata information into MySQL. We use one table to record the basic information of Table, and then three tables to record the descriptive information of Connector, Format and Schema after they are converted into key-value. The reason for the split into three tables is to be able to update these three descriptive information independently. Next is the customized External Catalog, which can read the four tables of MySQL and convert them into map structures.

2.6. Real-time table-dimension table Association

So far, our platform has the ability of metadata management and SQL job management, but there are still some basic features missing in order to be truly open to users. Star models are unavoidable by building warehouses. Here is a relatively simple case: the middle fact sheet records the click stream of advertisements, and the surrounding dimension tables are about users, advertisements, products and channels.

Assuming that we have an SQL analysis, we need to associate click flow tables with user dimension tables. How should this be implemented in Flink SQL at present? We have two implementations, one is based on UDF, and the other is based on SQL transformation. Let’s talk about them separately.

2.7. UDF-based dimension table Association

First, the implementation based on UDF requires the user to rewrite the original SQL to SQL with UDF calls. Here is user DimFunc, which is the code implementation on the right of the figure above. UserDimFunc inherits the TableFunction abstracted from Flink SQL, which is one of the UDF types that can convert any row of data into one or more rows of data. In order to realize dimension table association, when UDF is initialized, the data of dimension table need to be loaded from MySQL and cached in memory cache. In the subsequent processing of each row of data, TableFunction calls the Eval () method to find the cache according to user_id in Eval (), thus realizing the association. Of course, this assumes that the dimension table data is relatively small, if the data volume is large, it is not suitable for full loading and caching, so it is not expanded here.

Based on UDF implementation, it is not very friendly to users and platforms: users need to write strange SQL statements, such as LATERAL TABLE in the figure; platforms need to customize specific UDF for each associated scenario, which is too expensive to maintain. Is there a better way? Let’s take a look at the implementation of SQL-based transformation.

2.8. Dimensional table association based on SQL transformation

We hope to solve the problems caused by UDF implementation. Users do not need to rewrite the original SQL and the platform does not need to develop many UDFs. One way of thinking is whether it is possible to automatically associate dimension tables by adding a layer of SQL parsing and rewriting before the SQL is submitted to Flink for compilation. After a certain amount of technical research and POC, we found that it is feasible, so it is called the realization of SQL conversion. Next, the idea is explained.

First, the added SQL parsing is to identify whether there are pre-defined dimension tables in SQL, such as user_dim in the figure above. Once the dimension table is identified, the process of SQL rewriting will be triggered, and the join statement labeled with red boxes will be rewritten into a new Table. How can this Table be obtained? We know that the concept of “flow table duality” has been developed in the field of flow computing in recent years, and Flink is also a practitioner of this concept. This means that Stream and Table can be converted into each other in Flink. We convert the table corresponding to ad_clicks into Stream, call flatmap to form another Stream, and finally convert back to Table, and we get ad_clicks_user. The final question is, how does flatmap implement dimension table association?

The Flatmap operation for Stream in Flink is actually a Rich Flatmap Funciton, which invokes its flatmap () method to convert every row of data. Then, we can customize a RichFlatmapFunction to load, cache, find and associate dimension table data, similar to the UDF-based TableFunction implementation.

Since the implementation logic of Rich Flatmap Funciton is similar to that of TableFunction, why is this implementation more general than UDF-based? The core point is that there is an additional layer of SQL parsing, which can get the information of dimension table (such as dimension table name, Association field, select field, etc.) and then encapsulate it as JoinContext and pass it to Rich Flatmap Funciton, so that the expression ability can be universal.

3. Application Case of Building Real-time Number Warehouse

Here are some typical application cases, which are implemented on our platform using Flink SQL.

3.1. Real-time ETL splitting

Here is a typical real-time ETL link, which separates the small tables corresponding to each service from the large tables.

OPPO’s largest data source is the embedded point of the mobile phone. The data from the mobile phone APP has a feature that all data are reported through a unified number of channels. Because it is impossible for every business to have a new buried point, it is necessary to upgrade the client to add new channels. For example, we have a sdk_log channel, where all buried points of APP applications report data, resulting in a huge original layer surface corresponding to this channel, dozens of TBs a day. But in fact, each business only cares about its own part of the data, which requires us to split ETL in the original layer.

This SQL logic is relatively simple. It just filters some business fields and inserts them into different business tables. It features that multiple lines of SQL are eventually merged into one SQL and submitted to Flink for execution. What worries us is that if it contains 4 SQL, will it read the same data four times? In fact, there are some optimizations in Flink compiling SQL, because the final point is the same Kafka topic, so only one data is read.

In addition, the same Flink SQL is used for both offline and real-time ETL splitting into HDFS and Kafka, respectively. Flink itself supports inks written to HDFS, such as Rolling File Sink.

3.2. Real-time index statistics

Here is a typical case of computed information flow CTR, which calculates exposure and click times respectively in a certain period of time, divides the click rate into Mysql, and then visualizes it through our internal report system. The feature of this SQL is that it uses the Tumbling Window and subqueries.

3.3. Real-time label import

Here is a real-time label import case. The mobile terminal senses the longitude and latitude of the current user in real time, converts it into POI, imports ES, and finally makes user orientation on the label system.

The feature of this SQL is Aggregate Function. Within a five-minute window, we only care about the latitude and longitude of the latest user report. Aggregate Function is a UDF type, usually used for aggregation metrics statistics, such as computing sum or average. In this example, since we only care about the latest latitude and longitude, we can replace the old data every time.

Consideration and Prospect of Future Work

Finally, I would like to share with you some thoughts and plans about our future work, which are not mature enough to discuss with you.

4.1. End-to-end Real-time Flow Processing

What is end-to-end? One end is the raw data collected, the other end is the presentation and application of the report/label/interface. The two ends are connected by the real-time flow in the middle. At present, we are dealing with real-time flow based on SQL. The source table is Kafka, and the target table is Kafka, which is imported into Druid/ES/HBase after unified Kafka. The purpose of this design is to improve the stability and availability of the whole process. Firstly, as a buffer of downstream system, Kafka can avoid the abnormal influence of downstream system on the calculation of real-time flow (one system is stable, which is more stable than multiple systems at the same time, with higher probability); secondly, real-time flow from Kafka to kafka, E. Xactly-once semantics is mature and consistent.

Then, the end-to-end process is actually accomplished by three separate steps, each of which may be handled by different roles: data processing needs data developers, data import needs engine developers, and data capitalization needs product developers.

Can our platform automate end-to-end processes, imports, and capitalization with only one SQL submission? In this way, the Kafka Table is no longer seen in data development, but should be a Scene-Oriented display table/tag table/interface table. For example, when creating tables, the platform will automatically import the real-time stream result data from Kafka into Druid, and then automatically import Druid data source into the report system, or even automatically generate report templates by specifying dimensions, indicators and other fields.

4.2. Real-time blood relationship analysis

Regarding consanguinity analysis, friends who have done off-line warehousing are well aware of its importance, and it plays an indispensable key role in data governance. This is also true for real-time data warehouses. We hope to build end-to-end blood relationship, from the access channel of the acquisition system, to the real-time table and real-time operation flowing through the middle, to the products of consumer data, can be clearly displayed. Based on the analysis of blood relationship, we can evaluate the value of data and calculate the cost of data.

4.3. Integration of offline-real-time data warehouse

Finally, one direction is the integration of off-line real-time data warehouse. We believe that in the short term, real-time warehouse can not replace offline warehouse, and the coexistence of the two is the new normal. In the era of offline warehouse, how to adapt the tool system we have accumulated to the real-time warehouse, and how to realize the integration of offline and real-time warehouse management? In theory, their data sources are identical, and their upper abstractions are Table and SQL, but they are different in essence, such as time granularity and computing mode. For data tools and products, what modifications need to be made to achieve complete integration, which is also our exploration and thinking.



Author: apache_flink

Read the original text

This article is the original content of Yunqi Community, which can not be reproduced without permission.