Flink’s practice in vipshop

Time:2021-6-18

Introduction:Flink’s containerization practice and product experience in vipshop.

Since 2017, vipshop has built a high-performance, stable, reliable and easy-to-use real-time computing platform based on k8s to support the smooth operation of vipshop’s internal business in peacetime and greatly promoted. The current platform supports mainstream frameworks such as Flink, spark and storm. This article mainly shares Flink’s containerization application and productization experience. The contents include:

  1. Development overview
  2. Flink containerization practice
  3. Construction of Flink SQL platform
  4. Application cases
  5. Future planning

GitHub address
https://github.com/apache/flink
Welcome to like Flink and send star~

1、 Development overview

The platform supports real-time computing applications of all departments within the company. The main business includes real-time large screen, recommendation, experimental platform, real-time monitoring and real-time data cleaning.

1.1 cluster scale

Flink's practice in vipshop

The platform has two machine rooms and two clusters in different places, with more than 2000 physical machine nodes. It uses k8s’ names, labels and tails to achieve business isolation and preliminary computing load isolation. At present, there are about 1000 online real-time applications. Recently, the platform mainly supports the launch of Flink SQL tasks.

1.2 platform architecture

Flink's practice in vipshop

  • The figure above shows the overall architecture of vipshop’s real-time computing platform.
  • The bottom layer is the resource scheduling layer of the computing task node, which actually runs on k8s in the mode of deployment. Although the platform supports yarn scheduling, yarn scheduling shares resources with batch tasks, so the mainstream tasks still run on k8s.
  • The storage layer supports the company’s internal real-time data VMS based on Kafka, VDP data based on binlog and native Kafka as message bus. The status is stored on HDFS, and the data is mainly stored in redis, mysql, HBase, kudu, Clickhouse, etc.
  • In the computing engine layer, the platform supports the containerization of mainstream frameworks such as Flink, spark and storm, and provides encapsulation and components of some frameworks. Each framework will support several versions of images to meet different business requirements.
  • The platform layer provides job configuration, scheduling, version management, container monitoring, job monitoring, alarm, log and other functions, multi tenant resource management (quota, label management) and Kafka monitoring. Before Flink version 1.11, the platform built its own metadata management system as Flink SQL management schema. Since version 1.11, it has integrated with the company’s metadata management system through hive metadata.

The top layer is the application layer of each business.

2、 Flink containerization practice

2.1 containerization practice

Flink's practice in vipshop

The figure above shows the container architecture of Flink, a real-time platform. Flink containerization is deployed based on the standalone pattern.

  • The deployment mode has three roles: client, jobmanager and task manager. Each role is controlled by a deployment.
  • Users upload tasks, jar packages, configurations, etc. through the platform, and store them on HDFS. At the same time, the configuration and dependency maintained by the platform are also stored in HDFS. When the pod is started, initialization operations such as pull will be performed.
  • The main process in the client is an agent developed by go. When the client starts, it will first check the status of the cluster. When the cluster is ready, it will pull the jar package from HDFS and submit the task to the Flink cluster. At the same time, the main function of the client is to monitor the status of the task, do savepoint and other operations.
  • The smart agent deployed on each physical machine collects the indicators of the container and writes them to m3, and writes metrics to Prometheus through the interface of Flink leak, combined with the demonstration of grafana. Similarly, through the vfilebeat deployed on each physical machine to collect and write the attached logs to es, the log retrieval can be realized in dragonfly.

■ Flink platform

In the process of practice, combined with specific scenarios and ease of use considerations, the platform work is done.

  • The task configuration of the platform is decoupled from the image, Flink configuration and custom components. At present, the platform supports versions 1.7, 1.9, 1.11 and 1.12.
  • The platform supports pipeline compiling or uploading jar, job configuration, alarm configuration, life cycle management and so on, so as to reduce the development cost of users.
  • The platform has developed the container level paged function of tuning diagnosis, such as flame diagram, and the function of logging in the container to support the user’s job diagnosis.

■ Flink stability

In the process of application deployment and running, there will inevitably be exceptions. The following is the strategy for the platform to ensure the stability of tasks after abnormal conditions.

  • The health and availability of pod are detected by livenessprobe and readinessprobe, and the restart strategy of pod is specified.
  • When the Flink task is abnormal:

    1. Flink’s native restart strategy and failure mechanism are used as the guarantee of the first layer.
    2. In the client, it will regularly monitor the status of Flink, update the latest checkpoint address to its own cache, report to the platform, and solidify it into mysql. When Flink fails to restart, the client submits the task from the latest successful checkpoint again. As a second level of assurance. After this layer solidifies checkpoint into mysql, Flink ha mechanism is no longer used, and ZK’s component dependency is reduced.
    3. When the current two layers fail to restart or the cluster is abnormal, the platform will automatically pull up a cluster from the latest chekcpoint solidified in MySQL and submit tasks as the guarantee of the third layer.

  • Disaster recovery of computer room:

    • The user’s jar package and checkpoint are stored in remote dual HDFS
    • Two machine rooms and two clusters in different places

2.2 Kafka monitoring scheme

Kafka monitoring is a relatively important part of our task monitoring. The overall monitoring process is as follows.

Flink's practice in vipshop

The platform provides monitoring Kafka accumulation, consumption message and other configuration information. After extracting the user’s Kafka monitoring configuration from mysql, it monitors Kafka through JMX, writes it to downstream Kafka, and then monitors it in real time through another Flink task. At the same time, it writes the data to CK to show the user.

3、 Construction of Flink SQL platform

The container implementation of Flink based on k8s facilitates the release of Flink API applications, but it is still not convenient for the task of Flink SQL. So the platform provides a more convenient online editing and publishing, SQL Management and other development platform.

3.1 Flink SQL solution

Flink's practice in vipshop

The Flink SQL solution of the platform is shown in the figure above, and the task publishing system is completely decoupled from the metadata management system.

■ Flink SQL task publishing platform

In practice, considering the ease of use, the platform work is done. The main operation interface is as follows:

  • Version management of Flink SQL, syntax verification, topology management, etc;
  • UDF universal and task level management, support user-defined UDF;
  • The parameterized configuration interface is provided to facilitate users to go online.

Flink's practice in vipshop

Flink's practice in vipshop

Metadata management

Before 1.11, the platform built its own metadata management system UDM, Mysql to store Kafka, redis and other schemas, and through custom catalog to connect Flink and UDM, so as to realize metadata management. After 1.11, the Flink integrated hive gradually improved, and the platform reconstructed the flinksql framework. By deploying a SQL gateway service and calling the SQL client jar package maintained by ourselves, the platform got through with offline metadata, realized the unification of real-time offline metadata, and did a good job for the integration of stream and batch. The operation interface of the Flink table created in the metadata management system is as follows: create the metadata of the Flink table, persist it to hive, and read the table schema information of the corresponding table from hive when Flink SQL starts.

Flink's practice in vipshop

3.2 Flink SQL related practices

The platform integrates and develops the officially supported or unsupported connectors, decouples the image from the connector, format and other related dependencies, and can quickly update and iterate.

Practice of Flink SQL

Flink's practice in vipshop

  • At present, the platform supports the officially supported connector, and constructs the internal connectors of redis, kudu, Clickhouse, VMS, VDP and other platforms. The platform built an internal Pb format to support the reading of real-time cleaning data of protobuf. The platform builds internal catalog such as kudu and VDP, which supports reading related schema directly without creating DDL.
  • The platform layer is mainly in UDF, adjustment of common operation parameters, and upgrading Hadoop 3.
  • The runntime layer mainly supports the modification of the execution plan of the topology graph and the optimization of the keyby cache associated with the dimension table

Topology execution plan modification

Aiming at the problem that the parallelism of stream graph generated by SQL cannot be modified, the platform provides modifiable topology preview to modify relevant parameters. The platform will provide users with the resolved flinksql exercise plan JSON, use uid to ensure the uniqueness of operators, modify the parallelism and chain strategy of each operator, and also provide methods for users to solve the back pressure problem. For example, in the case of small concurrent large batches of Clickhouse sink, we support modifying the parallelism of Clickhouse sink, with source parallelism = 72 and sink parallelism = 24, to improve the Clickhouse sink TPS.

Flink's practice in vipshop

Dimension table associated keyby optimized cache

In order to reduce the number of IO requests, reduce the reading pressure of dimension table database, reduce the delay and improve the throughput, the following measures are adopted:

  • When the amount of dimension table data is small, through the full dimension table data cache in the local, and TTL control cache refresh, this can greatly reduce the number of IO requests, but will require more memory space.
  • When the dimension table has a large amount of data, we can improve the throughput and reduce the reading pressure of the database by using async and LRU cache strategy, TTL and size to control the failure time and cache size of the cache data.
  • When the dimension table has a large amount of data and the mainstream QPS is very high, you can turn on the key of the dimension table join as the hash condition to partition the data. That is, the partition strategy in the calc node is hash, so that the dimension table data of the subtask of the downstream operator is independent, which can not only improve the hit rate, but also reduce the memory usage.

Flink's practice in vipshop

Before optimization, dimension table associates lookupjoin operator with normal operator chain.

Flink's practice in vipshop

The dimension table between optimizations is associated with lookupjoin operator and normal operator, and the join key is used as the key of hash strategy. After optimization in this way, for example, for the original dimension table with 3000W data, there are 10 TM nodes, and each node needs to cache 3000W data, with a total of 3000W * 10 = 300 million. After keyby optimization, each TM node only needs to cache 3000W / 10 = 300W of data, and the total amount of cached data is only 3000W, which greatly reduces the amount of cached data.

■ dimension table Association delay join

In dimension table Association, there are many business scenarios. Before adding new data to dimension table data, the mainstream data has already undergone join operation, which will lead to the situation that the association is not available. Therefore, in order to ensure the correctness of the data, we cache the data that can not be associated with, and delay the join.

The simplest way is to set the number of retries and the interval of retries in the function associated with the dimension table. This method will increase the delay of the whole flow, but it can solve the problem when the mainstream QPS is not high.

When the join dimension table is not associated, it is cached first, and the delayed join is performed according to the number of retries and the interval of retries.

4、 Application cases

4.1 real time data warehouse

Real time data warehousing

Flink's practice in vipshop

  • After real-time cleaning, the first level Kafka of traffic data is written to the second level cleaning Kafka, mainly in protobuf format, and then written to the hive 5min table through Flink SQL, so as to make subsequent quasi real-time ETL and accelerate the preparation time of ODS layer data source.
  • The data of MySQL service library is parsed through VDP to form binlog CDC message flow, and then written into hive 5min table through Flink SQL.
  • The business system generates the business Kafka message flow through VMS API, and writes it into the hive 5min table after parsing by Flink SQL. Support string, JSON, CSV and other message formats.
  • It’s very convenient to use Flink SQL to do streaming data warehousing, and version 1.12 already supports the automatic merging of small files, which solves the pain point of small files.
  • We customize the partition submission policy. When the partition is ready, we will call the partition submission API of the real-time platform. When offline scheduling is scheduled, we will check whether the partition is ready through this API.

After adopting the Flink SQL unified warehousing scheme, we can get the following benefits: it can solve the unstable problem of the previous flume scheme, and users can self warehousing, greatly reducing the maintenance cost of warehousing tasks. The efficiency of off-line data warehouse is improved, and the granularity is reduced from hour level to 5min level.

Real time index calculation

Flink's practice in vipshop

  • After real-time application consumption cleaning, Kafka is associated through redis dimension table, API and other methods, and then UV is incrementally calculated through Flink window, and persistently written to HBase.
  • After the real-time application consumes the VDP message flow, it uses redis dimension table, API and other methods to correlate, then calculates the sales and other related indicators through Flink SQL, incrementally upsert to kudu, which is convenient for batch query according to range partition, and finally provides the final service to the real-time large screen through data service.

In the past, storm method was usually used in index calculation, and API customization was needed. After adopting such Flink scheme, we can get the following benefits: cutting the calculation logic to Flink SQL, reducing the problems of fast change of calculation task caliber, slow online modification cycle, etc. Switching to Flink SQL can make quick modification, go online quickly and reduce maintenance cost.

Real time offline integrated ETL data integration

Flink's practice in vipshop

In the latest version of Flink SQL, the ability of dimension table join has been continuously strengthened. It can not only correlate the dimension table data in the database in real time, but also correlate the dimension table data in hive and Kafka, which can flexibly meet the requirements of different workload and timeliness.

Based on the powerful streaming ETL capability of Flink, we can do data access and data conversion in the real-time layer, and then return the data in the detail layer to the offline data warehouse.

We introduce the implementation of hyperloglog (hereinafter referred to as HLL) used in Presto into the spark udaf function to get through the intercommunication of HLL objects between spark SQL and presto engine. For example, the HLL objects generated by spark SQL through prepare function can not only merge query in spark SQL, but also merge query in Presto. The specific process is as follows:

Flink's practice in vipshop

Example of UV approximate calculation:

Step 1: generate HLL object with spark SQL

insert overwrite dws\_goods\_uv partition (dt=’${dt}’,hm=’${hm}’) AS select goods\_id, estimate\_prepare(mid) as pre\_hll from dwd\_table\_goods group by goods\_id where dt = ${dt} and hm = ${hm}

Step 2: Spark SQL through goods\_ The HLL object of ID dimension merges into brand dimension

insert overwrite dws\_brand\_uv partition (dt=’${dt}’,hm=’${hm}’) AS select b.brand\_id, estimate\_merge(pre\_hll) as merge\_hll from dws\_table\_brand A left join dim\_table\_brand\_goods B on A.goods\_id = B.goods\_id where dt = ${dt} and hm = ${hm}

Step 3: how to query brand dimension in spark SQL

select brand\_id, estimate\_compute(merge\_hll ) as uv from dws\_brand\_uv where dt = ${dt}

Step 4: Presto merge query HLL objects generated by park

select brand\_id,cardinality(merge(cast(merge\_hll AS HyperLogLog))) uv from dws\_brand\_uv group by brand\_id

Therefore, based on the real-time offline integrated ETL data integration architecture, we can obtain the following benefits:

  • Unified the basic public data source;
  • It improves the timeliness of offline data warehouse;
  • The maintenance cost of components and links is reduced.

4.2 experimental platform (Flink real-time data into OLAP)

Vipshop experimental platform is an integrated platform that provides a / B – test experimental effect analysis of massive data by configuring multi-dimensional analysis and drill down analysis. An experiment is composed of a stream of traffic (such as user requests) and the modification of the relative contrast experiment on this stream of traffic. The experimental platform has the requirements of low latency, low response and super large scale data (10 billion level) for massive data query. The overall data structure is as follows:

Flink's practice in vipshop

After the data cleaning, parsing and expansion in Kafka through Flink SQL, the redis dimension table is used to associate the commodity attributes, the distributed table is used to write to the Clickhouse, and then the data service ad hoc is used to query. The business data flow is as follows:

Flink's practice in vipshop

Through the Flink SQL redis connector, we support the sink, source dimension table Association and other operations of redis. It is very convenient to read and write redis and realize dimension table Association. The cache can be configured in the dimension table Association, which greatly improves the application efficiency. The pipeline of real-time data stream is realized by Flink SQL. Finally, the large and wide table is sink into CK, and murmurhash3 is made according to a field granularity\_ 64 storage, to ensure that the data of the same user exists in the same shard node group, so that the join between the large CK tables becomes the join between the local tables, reduce the data shuffle operation, and improve the join query efficiency.

5、 Future planning

5.1 improve the usability of Flink SQL

At present, there are many inconveniences in debugging our Flink SQL. For offline hive users, there are certain thresholds, such as manual configuration of Kafka monitoring and task pressure test tuning. How to minimize the user’s threshold is a big challenge. In the future, we will consider doing some intelligent monitoring, telling users the problems existing in the current task, automating as much as possible, and giving users some optimization suggestions.

5.2 implementation of CDC analysis scheme for data Lake

At present, our VDP binlog message flow is written to the hive ODS layer through Flink SQL to speed up the preparation time of ODS layer data source, but a large number of duplicate messages will be generated to re merge. We will consider the CDC warehousing scheme of Flink + data lake for incremental warehousing. In addition, Kafka message flow and aggregation results after order widening require very strong real-time upsert capability. At present, we mainly use kudu, but kudu cluster is relatively independent and low-cost. We will investigate the incremental upsert capability of data lake to replace kudu incremental upsert scenario.

For more technical problems related to Flink, you can scan the code to join the community nail exchange group ~

Flink's practice in vipshop
Activity recommendation:

It only costs 99 yuan to experience Alibaba cloud’s enterprise class product based on Apache Flink – real time computing Flink! Click the link below for details:https://www.aliyun.com/product/bigdata/sc?utm\_content=g\_1000250506

Flink's practice in vipshop

Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.