Official account of dbaplus community
Author: Wang Kang, senior development engineer of vipshop data platform
Since 2017, in order to ensure the smooth operation of internal business in peacetime and during the big promotion period, vipshop has started to build a high-performance, stable, reliable and easy-to-use real-time computing platform based on kubernetes. The current platform supports mainstream frameworks such as Flink, spark and storm.
This article will be divided into five aspects to share the practical application and product experience of vipshop Flink
- Development overview
- Flink containerization practice
- Construction of Flink SQL platform
- Application cases
- Future planning
1、 Development overview
1. Cluster size
In terms of cluster scale, we have 2000 + physical machines, mainly deploying kubernetes remote dual live cluster, using kubernetes’ names, labels and tails to achieve business isolation and preliminary computing load isolation.
There are more than 1000 online real-time applications, including Flink tasks, Flink SQL tasks, storm tasks and spark tasks. At present, we mainly support Flink SQL, because SQL is a trend, so we need to support the online platform of SQL tasks.
2. Platform architecture
We analyze the overall architecture of the real-time computing platform from the bottom up
- Resource scheduling layer (bottom layer)
In fact, the deployment mode is used to run kubernetes. Although the platform supports yarn scheduling, yarn scheduling shares resources with batch tasks, so mainstream tasks still run on kubernetes. Moreover, the yarn scheduling layer is mainly a set of yarn clusters deployed offline. In 2017, we developed a set of solutions of Flink on kubernetes. Because the underlying scheduling is divided into two layers, we can borrow resources in real time and offline when resources are tight.
- Storage layer
It is mainly used to support the company’s internal real-time data VMS based on Kafka. The VDP data based on binlog and native Kafka are used as message bus. The status is stored on HDFS, and the data is mainly stored in redis, mysql, HBase, kudu, HDFS, Clickhouse, etc.
- Computing engine layer
It mainly includes Flink, storm and spark. Currently, the main promotion is Flink. Each framework will support several versions of images to meet different business needs.
- Real time platform layer
It mainly provides job configuration, scheduling, version management, container monitoring, job monitoring, alarm, log and other functions. It also provides multi tenant resource management (quota, label management) and Kafka monitoring. Resource allocation is also divided into big promotion day and ordinary day. The resources of big promotion are different from ordinary resources, and the authority control of resources is also different. Before Flink 1.11, the self built metadata management system of the platform was Flink SQL management schema; Since version 1.11, it has been integrated with the company’s metadata management system through hive metadata.
- application layer
It mainly supports some scenes of real-time large screen, recommendation, experimental platform, real-time monitoring and real-time data cleaning.
2、 Flink containerization practice
1. Containerization scheme
The above is the architecture diagram of the real-time platform Flink containerization. Flink containerization is actually deployed based on the standalone mode.
Our deployment mode has three roles: client, job manager and task manager. Each role will be controlled by a deployment.
Users upload task jar package and configuration through the platform, and store them on HDFS. At the same time, the configuration and dependency maintained by the platform are also stored on HDFS. When the pod is started, initialization operations such as pull will be performed.
The main process in the client is an agent developed by go. When the client starts, it will first check the status of the cluster. When the cluster is ready, it will pull the jar package from HDFS, and then submit the task to the cluster. The main task of client is to do fault tolerance, its main function is to monitor the task status, do savepoint and other operations.
Through the smart agent deployed on each physical machine, the indicators of the container are written into m3, and the metrics are written into Prometheus through the Flink leaky interface, which is displayed in combination with grafana. Similarly, through the vfilebeat deployed on each physical machine to collect and write the attached logs to es, the log retrieval can be realized in dragonfly.
1) Flink platform
In the process of practice, we must combine the specific scenarios and ease of use, and then consider doing platform work.
2) Flink stability
In the process of our application deployment and operation, exceptions are inevitable. At this time, the platform needs to do some strategies to ensure the stability of the task after the exception occurs.
- The health and availability of Pod:
It is detected by livenessprobe and readinessprobe, and the restart strategy of pod is specified. Kubernetes itself can pull up a pod.
- When the Flink task generates an exception:
Flink has its own set of restart strategy and failure mechanism, which is its first layer of protection.
In the client, it will regularly monitor the status of Flink, update the latest checkpoint address to its own cache, report it to the platform, and then solidify it into mysql. When Flink fails to restart, the client submits the task from the latest successful checkpoint again. This is its second level of security.
After this layer solidifies checkpoint into mysql, Flink ha mechanism is no longer used, and ZK’s component dependency is reduced.
When the current two layers fail to restart or the cluster is abnormal, the platform will automatically pull up a cluster from the latest checkpoint solidified in MySQL and submit tasks. This is its third layer guarantee.
- Disaster recovery of computer room:
The user‘s jar package and checkpoint are stored in remote dual HDFS.
Two machine rooms and two clusters in different places.
2. Kafka monitoring scheme
Kafka monitoring is a very important part of task monitoring. The overall process is as follows:
The platform provides monitoring of Kafka accumulation. Users can configure their own Kafka monitoring on the interface to tell them what kind of cluster they are in and how to consume message and other configuration information. You can extract the user Kafka monitoring configuration from mysql, and then monitor Kafka through JMX. After such information is collected, it is written to the downstream Kafka, and then real-time monitoring alarms through another Flink task. At the same time, these data are synchronously written into CK, so as to feed back to our users (CK can also be used here, but Prometheus can also be used for monitoring, But CK will be more suitable), and finally show it to users with grafana components.
3、 Construction of Flink SQL platform
With the previous Flink containerization scheme, we will start the construction of Flink SQL platform. As we all know, there is a certain cost to develop such a streaming API. Flink is definitely faster than storm, and it is relatively stable and easier, but for some users, especially some students of java development, there is a certain threshold to do this.
The implementation of kubernetes’ Flink containerization facilitates the release of Flink API applications, but it is still not convenient for the task of Flink SQL. So the platform provides a more convenient online editing and publishing, SQL Management and other development platform.
1. Flink SQL solution
The Flink SQL solution of the platform is shown in the figure above. The task publishing system and the metadata management system are completely decoupled.
1) Flink SQL task publishing platform
In practice, we need to consider ease of use and do platform work. The main operation interface is shown in the figure below
- Version management, syntax verification and topology management of Flink SQL;
- UDF universal and task level management, support user-defined UDF;
- The parameterized configuration interface is provided to facilitate users to go online.
The figure below is an example of user interface configuration
The following figure is an example of cluster configuration
2) Metadata management
Before 1.11, the platform built its own metadata management system UDM, Mysql to store Kafka, redis and other schemas, and through custom catalog to connect Flink and UDM, so as to realize metadata management.
After 1.11, the Flink integration hive gradually improved, the platform reconstructed the Flink SQL framework, and through the deployment of a SQL gateway service, the SQL client jar package maintained by ourselves was called in the middle, so as to get through with offline metadata, realize the unification of real-time offline metadata, and lay a good foundation for the integration of stream and batch.
The operation interface of the Flink table created in the metadata management system is shown in the following figure: create the metadata of the Flink table, persist it to hive, and read the table schema information of the corresponding table from hive when Flink SQL starts.
2. Practice of Flink SQL
The platform integrates and develops the officially supported or unsupported connectors, decouples the image from the connector, format and other related dependencies, and can quickly update and iterate.
1) Practice of Flink SQL
Flink SQL is mainly divided into the following three layers:
- Connector layer
Support VDP connector to read source data source;
Support sink & dimension table Association of redis string, hash and other data types;
Support kudu connector & catalog & dimension table Association;
Support protobuf format to analyze real-time cleaning data;
Support VMS connector to read source data source;
Support Clickhouse connector sink distributed table & local table writing;
Hive connector supports watermark commit policy partition commit Policy & array, decimal and other complex data types.
- Runtime layer
It mainly supports the modification of the implementation plan of the topology map;
Dimension table associated with keyby optimizes cache to improve query performance;
Dimension table Association delay join.
- Platform layer
Support JSON HLL related processing function;
Support Flink operation related parameter settings, such as minibatch and aggregation optimization parameters;
Flink upgrade Hadoop 3.
2) Topology execution plan modification
Aiming at the problem that the parallelism of stream graph generated by SQL cannot be modified, the platform provides modifiable topology preview to modify relevant parameters. The platform will provide users with the resolved flinksql exercise plan JSON, use uid to ensure the uniqueness of operators, modify the parallelism and chain strategy of each operator, and also provide methods for users to solve the back pressure problem. For example, in the case of small concurrent large batches of Clickhouse sink, we support modifying the parallelism of Clickhouse sink, with source parallelism = 72 and sink parallelism = 24, to improve the Clickhouse sink TPS.
3) Dimension table associated keyby optimized cache
In order to reduce the number of IO requests, reduce the reading pressure of dimension table database, reduce the delay and improve the throughput, there are three measures
The following is a graph of the dimension table associated with the keyby optimized cache:
Before optimization, the dimension table is associated with the lookupjoin operator and the normal operator chain, and the dimension table is associated with the lookupjoin operator and the normal operator chain between optimizations. The join key is used as the key of hash strategy.
After optimization in this way, for example, in the original 3000W data dimension table, there are 10 TM nodes, and each node has to cache 3000W data, with a total cache of 300 million. After keyby optimization, each TM node only needs to cache 3000W / 10 = 300W of data, and the total amount of cached data is only 3000W, which greatly reduces the amount of cached data.
4) Dimension table Association delay join
In dimension table Association, there are many business scenarios. Before adding new data to dimension table data, the mainstream data has already undergone join operation, which will lead to the situation that the association is not available. Therefore, in order to ensure the correctness of the data, we cache the data that can not be associated with, and delay the join.
The simplest way is to set the number of retries and the interval of retries in the function associated with the dimension table. This method will increase the delay of the whole flow, but it can solve the problem when the mainstream QPS is not high.
When the join dimension table is not associated, it is cached first, and the delayed join is performed according to the number of retries and the interval of retries.
4、 Application cases
1. Real time data warehouse
1) Real time data warehousing
Real time data warehouse is mainly divided into three processes
- After real-time data cleaning, the first level Kafka can write to the second level Kafka, mainly in protobuf format, and then write to hive 5min table through Flink SQL, so as to do subsequent quasi real-time ETL and accelerate the preparation time of ODS layer data source.
- The data of MySQL service library is parsed through VDP to form binlog CDC message flow, and then written into hive 5min table through Flink SQL. At the same time, it will be submitted to the user-defined partition, and then the partition status will be reported to the service interface. Finally, an offline scheduling will be done.
- The business system generates the business Kafka message flow through VMS API, and writes it to hive 5min table after parsing by Flink SQL. It can support string, JSON, CSV and other message formats.
It is very convenient to use Flink SQL for streaming data warehousing, and version 1.12 already supports automatic merging of small files, which solves a very common pain point in the big data layer.
We customize the partition submission policy. When the partition is ready, we will call the partition submission API of the real-time platform. When offline scheduling is scheduled, we will check whether the partition is ready through this API.
After adopting the Flink SQL unified warehousing scheme, we can achieve the following results:
First of all, we not only solve the problem of unstable flume solutions in the past, but also realize self-service warehousing, which greatly reduces the maintenance cost of warehousing tasks, and ensures the stability.
Secondly, we also improve the timeliness of off-line data warehouse, which can be enhanced by reducing the hourly level to 5min granularity.
2) Real time index calculation
- After real-time application consumption cleaning, Kafka is associated through redis dimension table, API and other methods, and then UV is incrementally calculated through Flink window, and persistently written to HBase.
- After the real-time application consumes the VDP message flow, it uses redis dimension table, API and other methods to correlate, then calculates the sales and other related indicators through Flink SQL, incrementally upsert to kudu, which is convenient for batch query according to range partition, and finally provides the final service to the real-time large screen through data service.
In the past, storm method was usually used in index calculation, which needed API customization development. After adopting such Flink scheme, we can obtain the following results:
Switch the calculation logic to Flink SQL, reduce the calculation task caliber, change quickly, and solve the problem of slow online modification cycle;
By switching to Flink SQL, you can quickly modify and go online, reducing the cost of maintenance.
3) Real time offline integrated ETL data integration
The specific process is shown in the figure below:
In the latest version of Flink SQL, the ability of dimension table join has been continuously strengthened. It can not only correlate the dimension table data in the database in real time, but also correlate the dimension table data in hive and Kafka, which can flexibly meet the requirements of different workload and timeliness.
Based on the powerful streaming ETL capability of Flink, we can do data access and data conversion in the real-time layer, and then return the data in the detail layer to the offline data warehouse.
We introduce the implementation of hyperloglog (hereinafter referred to as HLL) used in Presto into the spark udaf function to connect HLL objects between spark SQL and presto engine. For example, the HLL objects generated by spark SQL through the prepare function can not only merge query in spark SQL, but also merge query in presto.
The specific process is as follows:
Example of UV approximate calculation:
2. Experimental platform (Flink real time data into OLAP)
Vipshop experimental platform is an integrated platform for a / B-test experimental effect analysis of massive data by configuring multi-dimensional analysis and drill down analysis. An experiment is composed of a stream of traffic (such as user requests) and the modification of the relative contrast experiment on this stream of traffic. The experimental platform has the requirements of low latency, low response and super large scale data (10 billion level) for massive data query.
The overall data structure is as follows:
- Offline data is imported into Clickhouse through waterdrop;
- After the real-time data is cleaned, parsed and expanded in Kafka through Flink SQL, the redis dimension table is used to associate the commodity attributes, and the distributed table is used to write the real-time data to Clickhouse, then the data service ad hoc is used to query, and the external interface is provided through the data service.
The business data flow is as follows:
Our experimental platform has a very important es scene. After we launch an application scene, if I want to see the effect, including the exposure, click, add purchase and collection generated by the launch. We need to write the details of each data, such as some data of streaming, into CK according to the scene partition.
Through Flink SQL redis connector, we support redis’s sink, source dimension table Association and other operations. We can easily read and write redis and realize dimension table Association. The cache can be configured in the dimension table association to greatly improve the application efficiency. The pipeline of real-time data stream is realized by Flink SQL. Finally, the large and wide table is sink into CK, and murmurhash3 is made according to a field granularity_ 64 storage, to ensure that the data of the same user exists in the same shard node group, so that the join between the large CK tables becomes the join between the local tables, reduce the data shuffle operation, and improve the join query efficiency.
5、 Future planning
1. Improve the usability of Flink SQL
Flink SQL is a little different for hive users. Whether hive or spark SQL is a scene of batch processing.
Therefore, there are still many inconveniences in debugging our Flink SQL, and there are certain thresholds for offline hive users, such as manual configuration of Kafka monitoring and task pressure test tuning. Therefore, how to minimize the user’s use threshold, let users only need to understand SQL or business, shield the concepts in Flink SQL from users, and simplify the user’s use process is a big challenge.
In the future, we will consider doing some intelligent monitoring to tell users the problems existing in the current task. We don’t need users to learn too much. We will try our best to automate and give users some optimization suggestions.
2. Implementation of data Lake CDC analysis scheme
On the one hand, we do data lake mainly to solve our binlog real-time update scenario. At present, our VDP binlog message stream is written to hive ODS layer through Flink SQL to speed up the preparation time of ODS layer data source, but a large number of duplicate messages will be generated to re merge. We will consider the CDC warehousing scheme of Flink + data lake for incremental warehousing.
On the other hand, we hope to replace kudu through the data lake. Some of our important businesses are using kudu. Although kudu is not widely used, in view of the fact that the operation and maintenance of kudu is much more complex and less popular than the general database operation and maintenance, and the Kafka message flow and aggregation results after order widening require very strong real-time upsert capability, we started to investigate the solution of CDC + data lake, The incremental upsert capability of this scheme is used to replace the kudu incremental upsert scenario.
Q1: is the VDP connector read from MySQL binlog? Is cannal a tool?
A1: VDP is a component of the company’s binlog synchronization, which parses the binlog and sends it to Kafka. It is based on the secondary development of canal. We define a CDC format, which can connect to the company’s VDP Kafka data source, which is similar to the canal CDC format. At present, there is no open source to enable our company to use a synchronization scheme of binlog.
Q2: UV data output to HBase, sales data output to kudu, output to different data sources, mainly because of what kind of strategy?
A2: kudu is not as widely used as HBase. The TPS of UV real-time writing is relatively high, HBase is more suitable for single query scenario, writing HBase has high throughput and low latency, and small range query latency is low; Kudu has some OLAP features. It can store order details, speed up listing, and do OLAP analysis in combination with spark and presto.
Q3: Excuse me, how do you solve the data update problem of Clickhouse? Such as data index update.
A3: the update of CK is asynchronous merge, which can only be asynchronous merge in the same shard, the same node and the same partition, which is weak consistency. CK is not recommended for indicator update scenarios. If there are scenarios with strong update requirements in CK, you can try the aggregating merge tree solution, replace update with insert, and merge at the field level.
Q4: how can binlog write ensure data De duplication and consistency?
A4: binlog hasn’t been written into CK yet. This scheme doesn’t look mature. This is not recommended. We can use the solution of CDC + data lake.
Q5: how to monitor and solve if the writing of each node is unbalanced? How to look at data skew?
A5: you can use the system.parts local table of CK to monitor the amount of written data and size of each partition in each table of each machine to view the data partition, so as to locate a partition of a table, a machine.
Q6: how do you do task monitoring or health check on real time platform? How to recover automatically after an error? Is the yarn application mode used now? Is there a yard application corresponding to multiple Flink jobs?
A6: for Flink 1.12 +, Prometheus reporter is supported to expose some Flink metrics indicators, such as operator’s watermark, checkpoint related indicators, such as size, time consumption, failure times and other key indicators, and then collect and store them for task monitoring alarm.
Flink’s original restart strategy and failure mechanism are used as the guarantee of the first layer.
In the client, the status of Flink will be monitored regularly. At the same time, the latest checkpoint address will be updated to its own cache, reported to the platform, and solidified into mysql. When Flink fails to restart, the client submits the task from the latest successful checkpoint again. As a second level of assurance. After this layer solidifies checkpoint into mysql, Flink ha mechanism is no longer used, and ZK’s component dependency is reduced.
When the current two layers fail to restart or the cluster is abnormal, the platform will automatically pull up a cluster from the latest chekcpoint solidified in MySQL and submit tasks as the guarantee of the third layer.
We support yarn per job mode, and mainly deploy standalone cluster based on Flink on kubernetes mode.
Q7: are all components on your big data platform containerized or mixed?
A7: at present, our real-time computing frameworks such as Flink, spark, storm and presto are containerized. For details, please refer to 1.2 platform architecture above.
Q8: isn’t kudu running on kubernetes?
A8: kudu does not run on kubernetes. There is no mature solution. And kudu is based on cloudera manager operation and maintenance, there is no need to go to kubernetes.
Q9: is it OK to store the Flink real-time data warehouse dimension table in CK and then query CK?
A9: it’s OK. It’s worth trying. Both fact table and dimension table data can be stored, and hashing can be done according to a certain field (such as user)_ In order to achieve the effect of local join.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.