MQ hive real time data integration based on Flink

Time:2021-3-12

In the process of data middle station construction, a typical data integration scenario is to import MQ (message queue, such as Kafka, rocketmq, etc.) data into hive for downstream data warehouse construction and index statistics. Since MQ hive is the first layer of data warehouse construction, it requires high accuracy and real-time of data.

This paper mainly focuses on the MQ hive scenario, aiming at the pain points of existing solutions in byte beating, proposes a real-time solution based on Flink, and introduces the current situation of the new solution in byte beating.

Existing solutions and pain points

The existing solutions for byte skipping are shown in the figure below, which are mainly divided into two steps:

  1. Write MQ data to HDFS file through dump service
  2. Then import HDFS data into hive through batch ETL and add hive partition

MQ hive real time data integration based on Flink

Pain point

  1. The task chain is long, and the original data needs to be transformed many times before it can enter hive
  2. The real-time performance is poor, and the delay of dump service and batch ETL will lead to the delay of final data output
  3. The storage and calculation cost is large, and MQ data is repeatedly stored and calculated
  4. Based on native Java, there are some problems such as single point of failure and machine load imbalance after data traffic continues to grow
  5. The operation and maintenance cost is high, and the existing infrastructure such as Hadoop / Flink / yarn cannot be reused in the company
  6. Remote disaster recovery is not supported

Real time solution based on Flink

advantage

Aiming at the pain point of traditional solutions of the company, we propose a real-time solution based on Flink, which writes MQ data to hive in real time, and supports event time and exact only semantics. Compared with the old scheme, the advantages of the new scheme are as follows:

  1. Based on the development of flow engine Flink, support the semantics of exactly only
  2. Higher real-time, MQ data directly into hive, no intermediate computing link
  3. Reduce intermediate storage, and the whole process data will only be landed once
  4. Support yarn deployment mode to facilitate user migration
  5. Flexibility of resource management to facilitate capacity expansion and operation and maintenance
  6. Support double room disaster recovery

Overall structure

The overall architecture is shown in the figure below, which mainly includes three modules: DTS (data transmission service) source, DTS core and DTS sink. The specific functions are as follows:

  1. DTS source connects to different MQ data sources and supports Kafka, rocketmq, etc
  2. DTS sink outputs data to the target data source and supports HDFS, hive, etc
  3. DTS core runs through the whole data synchronization process, reads the source data through the source, processes it through DTS framework, and finally outputs the data to the target through sink.
  4. DTS framework integrates core functions such as type system, file segmentation, exactly only, task information collection, event time and dirty data collection
  5. Yarn deployment mode is supported, and resource scheduling and management are flexible

MQ hive real time data integration based on Flink

Exactly Once

Flink framework can provide exact once or at least once semantics through checkpoint mechanism. In order to support exact only semantics in MQ hive full link, MQ source and hive sink need to support exact only semantics. In this paper, checkpoint + 2pc protocol is used to implement, and the specific process is as follows:

  1. When the data is written, the source side pulls the data from the upstream MQ and sends it to the sink side; the sink side writes the data to the temporary directory
  2. In checkpoint snapshot stage, the source side saves MQ offset to the state; the sink side closes the file handle and saves the current checkpoint ID to the state;
  3. In checkpoint complete phase, the source side commit MQ offset; the sink side moves the data in the temporary directory to the official directory
  4. In checkpoint recovery stage, the latest successful checkpoint directory is loaded and the state information is recovered. The source side takes MQ offset saved in the state as the starting position; the sink side restores the latest successful checkpoint ID and moves the data of the temporary directory to the official directory

Realize optimization

In practical use scenarios, especially in large concurrency scenarios, HDFS write latency is prone to glitches, because individual task snapshot times out or fails, resulting in the failure of the entire checkpoint. Therefore, for checkpoint failure, it is important to improve the fault tolerance and stability of the system. Here, we make full use of the feature that checkpoint ID is strictly monotonically increasing. Every time we do a checkpoint, the current checkpoint ID must be larger than before. Therefore, in the checkpoint complete stage, we can submit temporary data less than or equal to the current checkpoint ID. The specific optimization strategies are as follows:

  1. The temporary directory of sink is {dump}_ path}/{next_ cp_ ID}, here is next_ cp_ The definition of ID is the latest CP_ id + 1
  2. In checkpoint snapshot stage, sink saves the latest CP_ ID to state and update next at the same time_ cp_ The ID is CP_ id + 1
  3. In checkpoint complete phase, sink will check all the files in the temporary directory that are less than or equal to the current CP_ Move the ID data to the official directory
  4. In checkpoint recovery phase, sink restores the latest successful CP_ ID, and the temporary directory is less than or equal to the current CP_ Move the ID data to the official directory

Type system

Because different data sources support different data types, in order to solve the problems of data synchronization between different data sources and the compatibility of different types of conversion, we support the DTS type system. DTS types can be subdivided into basic types and compound types. Compound types support type nesting. The specific conversion process is as follows:

  1. At the source side, the source data type is converted into the DTS type inside the system
  2. On sink side, the DTS type inside the system is converted to the target data source type
  3. The DTS type system supports the conversion between different types, such as the conversion between string type and date type

MQ hive real time data integration based on Flink

Rolling Policy

The sink side writes concurrently, and the traffic of each task is different. In order to avoid generating too many small files or too large files, we need to support custom file segmentation strategy to control the size of a single file. At present, it supports three file segmentation strategies: file size, maximum UN updated time and checkpoint.

Optimization strategy

Hive supports parquet, ORC, text and other storage formats. Different storage formats have different data writing processes, which can be divided into two categories

  1. Rowformat: Based on single write, supports HDFS truncate operation according to offset, such as text format
  2. Bulkformat: Based on block writing, HDFS truncate operation is not supported, such as parquet and orc formats

In order to guarantee the semantics of exactly only and support multiple formats such as parquet, Orc and text at the same time, it is mandatory to do file segmentation in every checkpoint to ensure that all written files are complete, and there is no need to do truncate operation in checkpoint recovery.

fault tolerant

Ideally, the streaming task will run all the time without restart, but in reality, it will inevitably encounter the following scenarios:

  1. To upgrade the Flink computing engine, you need to restart the task
  2. As the upstream data increases, the task concurrency needs to be adjusted
  3. Task Failover

■ concurrency adjustment

Currently, Flink natively supports state rescale. In the specific implementation, MQ offset is saved to liststate when the task is doing checkpoint snapshot; after the job is restarted, the job Master will allocate liststate to each task equally according to the concurrency of the operator.

■ Task Failover

Due to the network jitter, write timeout and other external factors, task writing failure is inevitable. How to quickly and accurately do task failure is more important. Currently, Flink natively supports multiple task failure strategies. This paper uses region failure strategy to restart all tasks in the region where the failed task is located.

Remote disaster recovery

Background

In the era of big data, the accuracy and real-time of data is particularly important. This paper provides a multi room deployment and remote disaster recovery solution. When the host room is temporarily unable to provide external services due to network outage, power failure, earthquake, fire and other reasons, it can quickly switch the service to the disaster preparedness room, and guarantee the exact once semantics at the same time.

Disaster recovery components

The overall solution requires the cooperation of multiple disaster recovery components. The disaster recovery components are shown in the figure below, mainly including MQ, yarn and HDFS. The details are as follows:

  1. MQ needs to support multi room deployment. When the host room fails, it can switch the leader to the standby room for downstream consumption
  2. The yarn cluster is deployed in both the main machine room and the standby machine room to facilitate the migration of Flink jobs
  3. The downstream HDFS needs to support multi machine room deployment. When the main machine room fails, it can switch the master to the standby machine room
  4. Flink job runs on yard, and the task state backend is saved to HDFS. The multi machine room support of HDFS ensures the multi machine room of state backend

MQ hive real time data integration based on Flink

Disaster recovery process

The overall disaster recovery process is as follows:

  1. Under normal circumstances, MQ leader and HDFS master are deployed in the host room, and the data is synchronized to the standby room. At the same time, Flink job runs in the host room and writes the task state into HDFS. Note that state is also a multi room deployment mode
  2. In case of disaster, MQ leader and HDFS master migrate from the host room to the disaster preparedness room. Meanwhile, Flink job also migrates to the disaster preparedness room, and recovers the offset information before disaster through state to provide exact once semantics

MQ hive real time data integration based on FlinkMQ hive real time data integration based on Flink

Event time archiving

Background

In data warehouse construction, the processing logic of process time and event time is different. For processing time, the data will be written to the time partition corresponding to the current system time; for event time, the data will be written to the corresponding time partition according to the production time of the data, which is also referred to as archiving in this paper. In the actual scenario, it is inevitable to encounter a variety of upstream and downstream failures and recover after a period of time If the strategy of archiving is adopted, the data during the accident will be written to the time partition after recovery, which eventually leads to the problem of partition hole or data drift; if the strategy of archiving is adopted, the data will be written according to the event time, then there is no such problem. Because the upstream data event time will be out of order, and the hive partition should not continue to write after it is generated, it is impossible to achieve unlimited archiving in the actual writing process, and it can only be archived within a certain time range. The difficulty of archiving lies in how to determine the global minimum archiving time and how to tolerate a certain disorder.

■  Global minimum archive time

The source side reads concurrently, and a task may read data from multiple MQ partitions at the same time. For each pariton of MQ, the current partition archiving time is saved, and the minimum value in the partition is taken as the task’s minimum archiving time, and finally the minimum value in the task is taken as the global minimum archiving time.

MQ hive real time data integration based on Flink

Disorder processing

In order to support out of order scenarios, an archive interval setting is supported, where global min watermark is the global minimum archive time, partition watermark is the current archive time of the partition, and partition min watermark is the minimum archive time of the partition. Only when the event time meets the following conditions, the archive will be performed:

  1. The event time is greater than the global minimum archive time
  2. The event time is greater than the partition minimum archive time

MQ hive real time data integration based on Flink

Hive partition generation

Principle

The difficulty of hive partition generation lies in how to determine whether the partition data is ready and how to add partitions. Since the sink side writes concurrently, there will be multiple tasks writing the same partition data at the same time. Therefore, only when all tasks have finished writing the partition data can the partition data be considered ready

  1. On the sink side, for each task to save the current minimum processing time, it needs to satisfy the monotonic increasing feature
  2. When checkpoint complete, the task reports the minimum processing time to JM
  3. After JM gets the minimum processing time of all tasks, it can get the global minimum processing time, which is used as the minimum ready time of hive partition
  4. When the minimum ready time is updated, you can judge whether to add hive partition

MQ hive real time data integration based on Flink

Dynamic partition

Dynamic partitioning is to determine which partition directory the data is written to according to the value of upstream input data, rather than to a fixed partition directory, such as the scenario of date = {date} / hour = {hour} / APP = {app}. The final partition directory is determined according to the partition time and the value of APP field, so that the same app data is in the same partition every hour. In the static partition scenario, each task only writes one partition file at a time, but in the dynamic partition scenario, each task may write multiple partition files at the same time. For writing in Parque format, the data is first written to the local cache, and then batch written to hive. When the task processes too many file handles at the same time, it is prone to oom. In order to prevent single task oom, it will periodically detect the file handle and release the file handle that has not been written for a long time.
MQ hive real time data integration based on Flink

Messenger

Messenger module is used to collect job running status information, so as to measure job health and market index construction.

■ Meta information collection

The principle of meta information collection is as follows. On the sink side, the core indicators of tasks, such as traffic, QPS, dirty data, write latency, event time and write effect, are collected through messenger collector. Among them, dirty data needs to be output to external storage, and task running indicators need to be output to grafana for large market indicators display.
MQ hive real time data integration based on Flink

■ dirty data collection

In data integration scenario, it is inevitable to encounter dirty data, such as type configuration error, field overflow, incompatible type conversion and so on. For the streaming task, because the task will run all the time, it needs to be able to count the dirty data flow in real time, save the dirty data to the external storage for troubleshooting, and sample the output in the running log.

Market monitoring

The overall indicators cover global indicators and individual job indicators, including write success flow and QPS, write latency, write failure flow and QPS, and archive effect statistics, as shown in the following figure:
MQ hive real time data integration based on Flink
MQ hive real time data integration based on Flink

Future planning

The Flink based real-time solution has been launched and promoted in the company. In the future, it will focus on the following aspects:

  1. The function of data integration is enhanced to support access to more data sources and user-defined data conversion logic
  2. Data lake, supports real-time import of CDC data
  3. The stream batch architecture is unified and supports full and incremental scenario data integration
  4. Architecture upgrade to support more deployment environments, such as k8s
  5. Perfect service and reduce the cost of user access

summary

With the gradual diversification and rapid development of byte beating business products, the functions of byte beating internal one-stop big data development platform are becoming more and more rich, and provide global data integration solutions in offline, real-time, full-scale and incremental scenarios. The scale of the initial hundreds of tasks has increased to tens of thousands of scale, and the daily data processing has reached Pb level, which is based on Flink At present, the real-time solution has been vigorously promoted and used in the company, and gradually replaced the old MQ hive link.

reference:

1.Real-time Exactly-once ETL with Apache Flink
http://shzhangji.com/blog/201…
2.Implementing the Two-Phase Commit Operator in Flink
https://flink.apache.org/feat…
3.A Deep Dive into Rescalable State in Apache Flink
https://flink.apache.org/feat…
4.Data Streaming Fault Tolerance
https://ci.apache.org/project…