Optimization and practice of Flink SQL in byte beating


Organize | aven (Flink community volunteer)

Abstract: This article is shared by Apache Flink Committee, byte beat architecture R & D engineer Li Benchao. It introduces the application of Flink in byte in four chapters. The contents are as follows:

  • General introduction
  • Practice optimization
  • Flow batch integration
  • Future planning

1、 General introduction

Optimization and practice of Flink SQL in byte beating

Blink announced open source in December 2018. It took about a year for Flink 1.9 to be released on August 22, 2019. Before the release of Flink 1.9, the internal SQL platform was built based on the master branch. After 2-3 months of time, we released the streaming SQL platform built by blink planner based on Flink 1.9 in October, 19, and promoted it internally. In this process, we found some interesting demand scenarios and some strange bugs.

Flink SQL extension based on 1.9

Although the latest version of Flink supports SQL DDL, Flink 1.9 does not. The internal byte is DDL extended based on Flink 1.9, which supports the following syntax:

  • create table
  • create view
  • create function
  • add resource

At the same time, watermark definition, which is not supported by Flink version 1.9, is also supported after DDL extension.

We have received a lot of feedback that “SQL can’t express complex business logic” when we recommend you to use SQL to express jobs as much as possible. After a long time, I found that some of the so-called complex business logic of many users make some external RPC calls. For this scenario, an RPC dimension table and sink are built inside the byte, so that users can read and write RPC services. This greatly expands the use scenarios of SQL, including FAAS, which is similar to RPC. The support of redis / abase / bytable / bytesql / RPC / FAAS dimension table is added inside the byte.

At the same time, several internal connectors are implemented

  1. source: RocketMQ
  2. sink:

And a matching format for connector is developed: Pb / binlog / bytes.

Online interface SQL platform

Optimization and practice of Flink SQL in byte beating

In addition to the expansion of Flink’s own functions, a SQL platform is also launched in byte, which supports the following functions:

  • SQL edit
  • SQL parsing
  • SQL debugging
  • Custom UDF and connector
  • version control
  • task management

2、 Practice optimization

In addition to the expansion of functions, some optimizations have been made for the deficiencies of Flink 1.9 SQL.

Window performance optimization

1. Windows Mini batch is supported

Mini batch is a characteristic function of blink planner. Its main idea is to accumulate a batch of data, and then access the state again, so as to reduce the number of access state and the cost of serialization and deserialization. This optimization is mainly in the rocksdb scenario. If it is in heap state, mini batch is not optimized. In some typical business scenarios, the feedback can reduce the CPU overhead by 20 ~ 30%.

2. Extended window type

At present, there are three kinds of built-in windows in SQL, rolling window, sliding window and session window. These three semantic windows can not meet the needs of some user scenarios. For example, in the live broadcast scenario, analysts want to count the UV (unique visitor), Gmv (gross merchandise volume) and other indicators of an anchor every hour after the broadcast. The natural division of the rolling window can not meet the needs of users. Some custom windows are made in the byte to meet the common needs of users.

-- my_ Window is a custom window, which meets the specific partition mode
FROM MySource
my_window(ts, INTERVAL '1' HOURS)

3、window offset

This is a general function, which is supported in the datastream API layer, but not in SQL. Here’s an interesting scene. Users want to open windows for a week, which turns into an unnatural week starting from Thursday. Because no one would have thought that January 1, 1970, was Thursday. After adding the support of offset, we can support the correct natural week window.

FROM MySource

Dimension table optimization

1. Delay join

In the scenario of dimension table join, the dimension table often changes, especially the new dimension, and the join operation occurs before the dimension is added, which often leads to non Association.

Therefore, the user hopes that if the join cannot be reached, the data will be temporarily cached before trying, and the number of attempts can be controlled, and the rules of delaying join can be customized. This demand scenario is not only within the byte, many students in the community also have similar needs.

Based on the above scenario, the delayed join function is implemented, and an operator supporting the delayed join dimension table is added. When the join fails to hit, the local cache will not cache the empty result. At the same time, it will temporarily save the data in a state, and then it will try again according to the setting timer and its number of retries.

Optimization and practice of Flink SQL in byte beating

2. Key function of dimension table

Optimization and practice of Flink SQL in byte beating

Through topology, we find that CaCl operator and lookupjoin operator are chained together. Because it doesn’t have the semantics of a key.

When the job parallelism is relatively large, the subtask of each dimension table join accesses all the cache space, which puts great pressure on the cache.

However, when we look at the SQL of join, we can see that the equivalent join naturally has hash attribute. The configuration is opened directly, and the running user directly takes the key of join table as the hash condition to partition the data. This can ensure that the access space between subtasks of each downstream operator is independent, which can greatly improve the initial cache hit rate.

In addition to the above optimization, there are two dimensions table optimization currently under development.

1、Broadcast dimension table: in some scenarios, the dimension table is small and updated infrequently, but the QPS of the job is especially high. If you still visit the external system to join, the pressure will be very high. And when the job fails, the local cache will all fail, which will cause great access pressure to the external system. Then, the improved scheme is to regularly scan all dimension tables and send them to the downstream through join key hash to update the subtask cache of each dimension table.
2、Mini-Batch: mainly for some I / O request is relatively high, the system also supports the ability of batch request, such as RPC, HBase, redis, etc. In the past, the way is to request one by one, and async I / O can only solve the problem of I / O delay, not the problem of access volume. By implementing the dimension table operator of mini batch version, the times of accessing external storage by dimension table association are greatly reduced.

Join optimization

Currently, Flink supports three join methods; They are interval join, regular join and temporary table function.

The first two semantics are the same flow and flow join. Temporary table is the join of flow and table. The flow on the right will form a table in the form of primary key, and the flow on the left will delete the join table. In this way, only one data can participate in a join and only one result can be returned. It’s not that you can join as many as you want.

There are several differences between them

Optimization and practice of Flink SQL in byte beating

We can see that the three join methods all have their own defects.

  1. The defect of interval join is that it can cause out join data and watermark disorder.
  2. In the case of regular join, the biggest drawback is retract amplification (which will be explained in detail later).
  3. The problem of temporary table function is more than others. There are three problems.
  • DDL is not supported
  • The semantics of out join is not supported (the limitation of flink-7865)
  • Watermark is not updated due to data disconnection on the right side, and the downstream cannot be calculated correctly (flink-18934)

For the above deficiencies, the corresponding modification has been made inside the byte.

Enhance checkpoint recovery

For SQL jobs, it is difficult to recover from checkpoint once conditions change.

It is true that the ability of SQL jobs to recover from checkpoint is relatively weak, because sometimes some changes that do not seem to affect checkpoint can not be recovered. There are two main reasons why it can’t be recovered;

  • The first point: the operator ID is automatically generated, and then the ID it generates changes for some reason.
  • Second, the logic of operator calculation has changed, that is, the definition of the internal state of the operator has changed.

Example 1: the parallelism is modified and cannot be recovered.

Optimization and practice of Flink SQL in byte beating

Source is one of the most common stateful operators. If the logic of source and the operator chain of the following operators change, it is completely unrecoverable.

The top left of the figure below is a logic generated by the normal community version of the job. The source and the operators with the same parallelism will be chained together, and the user cannot change them. However, the parallelism of operators is often changed. For example, the source is changed from 100 to 50, and the concurrency of CaCl is 100. At this point, the logic of chain will change.

Optimization and practice of Flink SQL in byte beating

In view of this situation, the internal byte is modified to allow the user to configure. Even if the parallelism of source is the same as that of the whole job, it is not associated with the following operator chain.

Example 2: the DAG changes result in unrecoverability.

Optimization and practice of Flink SQL in byte beating

This is a special case. There is an SQL (above). You can see that the source has not changed, and the three aggregations have no relationship with each other, and the state cannot be recovered.

The reason why the job cannot be recovered is due to the operator ID generation rule. At present, the generation rules of operator ID in SQL are related to the number of operators that can be chained together in the upstream, its own configuration and downstream. Because adding a new indicator will lead to adding a downstream node of Calc, which will change the operator ID.

In order to deal with this situation, a special configuration mode is supported, which allows users to ignore the number of downstream chain operators when generating operator ID.

Example 3: the new aggregation index cannot be recovered

This is the most demanding and complicated part of users. Users expect that after adding some aggregation indicators, the original indicators can be recovered from checkpoint.

Optimization and practice of Flink SQL in byte beating

You can see that the left part of the figure is the operator logic generated by SQL. Count, sum, sum, count, distinct will be stored in the valuestate in a baserow structure. Distinct is more special and will be stored in a mapstate separately.

As a result, if an indicator is added or decreased, the original state cannot be recovered from the valuestate normally, because the state “schema” stored in the vaulestate does not match the new “schema” after modifying the indicator, so it cannot be deserialized normally.

Optimization and practice of Flink SQL in byte beating

Optimization and practice of Flink SQL in byte beating

Before we discuss the solution, let’s review the normal recovery flow. First restore the state of the serializer from the checkpoint, and then restore the state through the serializer. Next, the operator registers a new state definition. The new state definition will be compared with the original state definition in terms of compatibility. If it is compatible, the state will be restored successfully. If it is incompatible, an exception will be thrown and the task will fail.

Another way to deal with incompatibility is to allow the return of a migration (to implement state recovery of two mismatched types), then the recovery can also succeed.

Make corresponding modifications to the above process:

  1. The first step is to make the new and old serializers know each other’s information, add an interface, modify the process of statebackend resolve compatibility, transfer the old information to the new, and make them get the whole migrate process.
  2. The second step is to judge whether the old and new are compatible. If not, whether a migration is needed. Then let the old serializer restore the state, and use the new serializer to write the new state.
  3. The code generation of aggregation is processed. When it is found that the index obtained by aggregation is null, some initialization will be done.

Through the above modification, we can basically achieve normal, and the new aggregation index can be recovered from the disassembled scheme.

3、 Research on the integration of flow and batch

Business status

The technical team has done a lot of technical exploration in advance before the integration and business promotion. The overall judgment is that the SQL layer can achieve the integration of stream and batch semantics, but in practice, many differences have been found.

For example, the session window of stream computing or the window based on processing time can’t be used in batch computing. At the same time, SQL has some complex over windows in batch computing, and there is no corresponding implementation in stream computing.

However, these special scenarios may only account for 10% or even less, so it is feasible to implement flow batch integration with SQL.

Optimization and practice of Flink SQL in byte beating

Flow batch integration

This diagram is quite common and similar to the architecture of most companies. What are the drawbacks of this architecture?

  1. Different data sources: batch tasks usually have a pre-processing task, whether offline or real-time, which is written to hive after one layer of processing. The real-time task is to read the original data from Kafka, which may be in JSON format or Avro format. It directly results in that the executable SQL in batch task has no result generation in stream task or the execution result is wrong.
  2. Computing is not of the same origin: batch tasks are generally hive + spark architecture, while stream tasks are basically based on Flink. Different execution engines have some differences in implementation, which leads to inconsistent results. Different execution engines have different API definitions UDF, and they cannot be shared. In most cases, two UDFs with the same function based on different APIs are maintained.

In view of the above problems, this paper proposes an integrated flow batch architecture based on Flink.

  1. Data is not of the same origin: streaming processing is processed by Flink first, and then written to MQ for downstream streaming Flink jobs to consume. For batch processing, Flink processing is used to stream write to hive, and then batch Flink jobs are used to process.
  2. Different sources of engines: since they are all streaming and batch jobs developed based on Flink, naturally there is no problem of computing different sources, and at the same time, it avoids maintaining multiple UDFs with the same functions.

The flow batch integration architecture based on Flink:

Optimization and practice of Flink SQL in byte beating

Business income

  1. Unified SQL: a set of SQL is used to express two scenarios of flow and batch computing to reduce the development and maintenance work.
  2. Reuse UDF: streaming and batch computing can share one UDF. This is positive for the business.
  3. Engine unification: the learning cost of the business and the maintenance cost of the architecture will be reduced a lot.
  4. Optimization unification: most optimizations can be applied to both flow and batch computing, for example, the optimization flow and batch of planner and operator can be shared.

4、 Future work and planning

Optimization of retract amplification problem

Optimization and practice of Flink SQL in byte beating

What is retract amplification?

In the figure above, there are four tables. The first table performs dedup, and then joins with the other three tables. The logic is relatively simple. Table a inputs (A1) and outputs (A1, B1, C1, D1) results.

When table a inputs an A2, the data needs to be de duplicated due to the dedup operator, an operation to withdraw A1 – (A1) and an operation to add A2 + (A2) are sent to the downstream. After receiving – (A1), the first join operator will send – (A1) to – (A1, B1) and + (null, B1) (in order to maintain the correct semantics it thinks) downstream. After receiving + (A2), it sends – (null, B1) and + (A2, B1) to the downstream, so the operation is doubled. Then the downstream operator operation will be magnified all the time, and the final sink output may be magnified as much as 1000 times.

Optimization and practice of Flink SQL in byte beating

How to solve it?

The original retract two pieces of data into a changelog format data, passed between operators. After receiving the change log, the operator processes the change, and then only sends a change log to the downstream.

Future planning

Optimization and practice of Flink SQL in byte beating

1. Function optimization
  • Checkpoint resilience supporting all types of aggregate indicator changes
  • window local-global
  • Fast emit of event time
  • Broadcast dimension table
  • Mini batch support for more operators: dimension table, topn, join, etc
  • Fully compatible with hive SQL syntax
2. Business expansion
  • Further promote streaming SQL to 80%
  • Explore the product form of integration of landing, flow and batch
  • Promote the standardization of real time data warehouse