Each article in this series is relatively short and updated from time to time. Starting from some practical cases, this article aims to improve the posture of small partners A kind of Potential level. This article introduces the Flink sink schema field design tips, reading time is about 2 minutes, do not say much, directly into the text!
##Add the version version field to the sink schema
Such as title, practice cases and usage directly.
Practical cases and usage
- The version field value of each record produced in the non fault scenario is 1
- In the fault scenario, the data with version > 1 (non-1) can be generated in the same sink, which represents the fault repair data provided to downstream consumption
Manageable failure scenarios
The failure of upstream Flink task a results in dirty data output to Kafka x, and downstream consumers can be divided into two categories as follows:
- Downstream is the Flink task: the Flink Task B consumes the dirty data in Kafka x, and then calculates and produces the wrong data
- The downstream is OLAP engine and Bi kanban: the result is that the Kanban display data is abnormal
First of all, this paper introduces the overall ideas to avoid and deal with the above problems
- 1. Optimize logic to ensure the stability of upstream tasksFirst of all, through some optimization means, we can ensure that the upstream Flink task a does not fail
- 2. Configure job monitoring alarm: configure the corresponding monitoring alarm for the whole link, and find and locate problems in time
- 3. Formulate fault treatment and repair plan: it is necessary to formulate corresponding fault treatment and repair plan, and once there is a fault, it is necessary to have the ability to deal with the fault
- 4. The downstream improves consumption and processing methods according to the characteristics of data sources: ensure that even if dirty data is consumed, the business logic will not be affected
The following is a brief introductionPoint 2For the above scenarios, there are currently three options to repair the data:
- Option 1 – offline repair: repair data is produced offline to cover dirty data. The disadvantage is that the delay of fault repair is high, the off-line and real-time data sources need to be switched, and the manual operation cost is high
- Scenario 2 – real time repair: rerun repair logic, output repair data to Kafka x-fix, downstream Flink Task B starts consumption again from the specified offset in Kafka x-fix, calculates and produces correct data. For the downstream flynk Task B, the code logic needs to be changed. There are two switching logic between the modified topic and the original topic, and the repair logic is more complex
- Scenario 3 – real time repair (version field scheme in this section): in order to avoid the high cost operation caused by the downstream data source switching operation, the repair data can be generated in the original Kafka topic, and the normal output data and repair data can be distinguished by the version field. Compared with schemes 1 and 2, there is no data source switching logic, and the downstream control version The field value can be consumed to the corresponding repair data, which significantly reduces the cost of manual operation, and the repair logic is relatively simple
Note: scheme 3 needs to reserve a certain buffer for Kafka X. otherwise, when outputting repair data, the QPS of writing or reading Kafka x is too high, which will affect the normal data output task.
##Add timestamp field to sink schema
Practical cases and usage
In the window scenario, the following fields can be added to the sink schema:
- flink_ process_ start_ Time (long): the time stamp that represents the start of logical processing of the Flink window
- flink_ process_ end_ Time (long): the time stamp representing the end of logical processing of the Flink window
- window_ Start (long): represents the start time stamp of the Flink window
- window_ End (long): represents the end time stamp of the link window
###Production practice cases
- flink_ process_ start_ time，flink_ process_ end_ Time can help users locate the cause of data deviation in the development, testing and verification stages
- window_ start，window_ End can help users locate whether there are lost data in each window processing and the specific data processed by each window
This paper mainly introduces the skills of adding version and timestamp extension fields in sink schema to help users improve the efficiency and availability of real-time data fault recovery in production environment.