On July 7,Flink 1.11.0 is officially releasedAs one of the release managers in this version, I would like to share with you my experience and the interpretation of some representative features. Before entering the in-depth interpretation, let’s have a brief understanding of the general process of community publishing to help you better understand and participate in the work of Flink community.
- First, at the beginning of the planning of each version, 1-2 volunteers will be selected as the release manager. In version 1.11.0, I’m the release manager for China, and another Piotr nowojski from verica is the release manager for Germany. To some extent, this also shows that the proportion of Chinese developers and contributions in the whole community is very important.
- Next, we will do this version of feature kickoff. In some general directions, the planning cycle of the community may be relatively long, and it will be completed in stages and steps across multiple versions to ensure the quality. The emphasis of each version will be different. For example, the first two versions focus on the enhancement of batch processing, while this version focuses on the improvement of the ease of use of stream processing. The feature list of community planning will be discussed in the mailing list to collect more user / developer opinions and feedback.
- The general development cycle is 2-3 months, and the approximate feature freeze time will be clearly planned in advance, followed by release candidate release and testing, and bug fix. Generally, after several rounds of iteration, a relatively stable candidate version will be officially voted through, and then it will be officially released based on this version.
From the functional planning in early March to the official release in early July, Flink 1.11.0 has taken almost four months to enhance and improve the ecology, ease of use, production availability, stability and other aspects of Flink. I will share with you one by one.
Flink 1.11.0 was released four times after the feature was frozen, and the candidate was finally approved. According to statistics, a total of 236 contributors participated in this version development, 1474 JIRA problems were solved, more than 30 flip were involved, and 2325 commit were submitted.
Throughout the past five releases, we can see that since 1.9.0, Flink has entered a stage of rapid development, and each dimension index has almost doubled compared with before. Since 1.9.0, Alibaba’s internal blink project has been integrated by open source Flink. By 1.10.0, the two major versions have been fully integrated, which has greatly enhanced the ecological construction, functionality, performance and production stability of Flink.
The initial positioning of Flink 1.11.0 is to focus on solving the problem of ease of use and improving the production and use experience of user business. On the whole, it does not do big architecture adjustment and function development, but tends to develop small versions with fast iteration. However, from the above statistical indicators, the data of the so-called “small version” in all dimensions is no less than that of the first two large versions. The number of problem-solving and the number of contributors are also increasing, with 62% of them from China.
Now we will deeply analyze what features Flink 1.11.0 has brought to us that we have been waiting for. From the API layer directly used by users to the execution engine layer, we will choose some representative features to interpret from different dimensions. For a more complete feature list, please pay attention to the release blog.
II. Ecological improvement and usability improvement
These two dimensions are complementary to each other to some extent, and it is difficult to strictly distinguish them. The lack of ecological compatibility often causes inconvenience in use, and the process of improving ease of use is also the process of constantly improving the relevant ecology. In this regard, the most obvious user perception should be the use of the table & SQL API level.
1. Table & SQL supports change data capture (CDC)
CDC is widely used in the scenarios of copying data, updating cache, synchronizing data between microservices, auditing log, etc. many companies are using open source CDC tools, such as MySQL CDC. It is a strong requirement to access and parse CDC in Table & SQL through Flink, which has been mentioned in many previous discussions. It can help users process changelog stream in real time, and further expand the application scenarios of Flink, such as synchronizing data in MySQL to PG or elasticsearch, low latency temporary join and changelog.
In addition to considering the above real requirements, the concept of “dynamic table” defined in Flink has two flow models: append mode and update mode. It has been supported in previous versions to transform stream into “dynamic table” through append mode. Therefore, update mode is further supported in 1.11.0, and “dynamic table” is completely realized from the conceptual level.
In order to support parsing and outputting changelog, how to encode and decode these update operations between external system and Flink system is the first problem to be solved. Considering that source and sink are a bridge connecting external systems, flip-95 solves this problem when defining a new table source and table sink interface.
In the public CDC Research Report, debezium and canal are the most popular CDC tools among users. These two tools are used to synchronize changelog to other systems, such as message queue. Therefore, flip-105 first supports debezium and canal formats, and Kafka source also supports parsing the above formats and outputting update events. Avro (debezium) and protobuf (canal) will be further supported in subsequent versions.
CREATE TABLE my_table ( ...) WITH ( 'connector'='...', -- e.g. 'kafka' 'format'='debezium-json', 'debezium-json.schema-include'='true' -- default: false (Debezium can be configured to include or exclude the message schema) 'debezium-json.ignore-parse-errors'='true' -- default: false );
2. Table & SQL supports JDBC catalog
Before 1.11.0, if users rely on Flink’s source / sink to read and write relational databases or read changelog, they must manually create the corresponding schema. Moreover, when the schema in the database changes, the corresponding Flink jobs need to be updated manually to maintain consistency and type matching. Any mismatch will cause an error in the runtime and cause the job to fail. Users often complain about this seemingly redundant and cumbersome process, and the experience is very poor.
In fact, for any external system connected with Flink, there may be similar problems mentioned above. In 1.11.0, we focus on the problem of docking with relational database. Flip-93 provides the basic interface of JDBC catalog and the implementation of Postgres catalog, so as to facilitate the subsequent docking with other types of relational databases.
After version 1.11.0, when using Flink SQL, users can automatically obtain the table schema without entering DDL. In addition, any schema mismatching errors will be checked in the compilation phase in advance to avoid the job failure caused by previous running errors. This is a typical example of improving ease of use and user experience.
3. Hive real time data warehouse
Since version 1.9.0, Flink has been committed to integrating hive from an ecological perspective, aiming to build a hive data warehouse with batch flow integration. After the iteration of the first two versions, batch compatibility and production availability have been achieved, and the performance under tpc-ds 10t benchmark is more than 7 times that of hive 3.0.
1.11.0 focuses on the realization of real-time data warehouse scheme in hive ecology, improves the user experience of end-to-end streaming ETL, and achieves the goal of batch flow integrated hive data warehouse. At the same time, the compatibility, performance and ease of use are further strengthened.
In the solution of real-time data warehouse, with the streaming processing advantage of Flink, it can read and write hive in real time
- Hive write: flip-115 perfects and extends the basic capability and implementation of the file system connector. The sink of the table / SQL layer can support various formats (CSV, JSON, Avro, parquet, ORC), and all formats of hive table.
- Partition support: data import hive introduces partition submission mechanism to control visibility sink.partition – commit.trigger Control the time of partition submission, and pass the sink.partition – commit.policy.kind Select the submission policy, support success file and Metastore submission.
- Hive read: real time streaming read hive, read new partition incrementally by monitoring partition generation, or read new file incrementally by monitoring new file generation in folder.
Improvements in hive availability:
- Flip-123 provides syntax compatibility for users through hive dialog, so that users can directly migrate hive scripts to Flink for execution without switching between the CLI of Flink and hive.
- Provide built-in support for hive related dependencies to avoid users downloading the required dependencies. Now you only need to download a package and configure Hadoop_ Classpath can be run.
In hive performance, vectorization reading of ORC (hive 2 +) has been supported in 1.10.0, and parquet and vectorization support of ORC have been supplemented in 1.11.0 to improve performance.
4. New source API
As mentioned earlier, source and sink are a bridge between Flink and external systems. They are very important for improving ecology, usability and end-to-end user experience. The community has planned the complete reconstruction of the source side as early as a year ago. As can be seen from the ID of flip-27, it is an early feature. However, because it involves many complex internal mechanisms and takes into account the implementation of various source connectors, the design needs to be very comprehensive. POC implementation started from 1.10.0, and finally caught up with the release of 1.11.0.
First, briefly review the main problems before source
- For users, it is not easy to transform the existing source or re implement a production level source connector in Flink. Specifically, there is no common code to reuse, and they need to understand a lot of internal details of Flink and implement specific event time allocation, watermark output, idleness monitoring, thread model, etc.
- Batch and stream scenarios need to implement different sources.
- The concept of partitions / splits / shards is not explicitly expressed in the interface. For example, the discovery logic and data consumption of split are coupled in the implementation of source junction, which increases the complexity in the implementation of Kafka or kinesis type source.
- In the runtime execution layer, checkpoint lock preempted by source function will bring a series of problems, and the framework is difficult to optimize.
The above pain points are fully considered in the design of flip-27
- First, two different components, split enumerator and source reader, are introduced into job manager and task manager to decouple split discovery and corresponding consumption processing. At the same time, it is convenient to combine different strategies at will. For example, in the existing Kafka connector, there are many different partition discovery strategies and implementations coupled together. Under the new architecture, we only need to implement one kind of source reader to adapt to various split enumerator implementations to correspond to different partition discovery strategies.
- The source connector implemented under the new architecture can achieve batch flow unification. The only cell is the limited input to the batch scene. The split enumerator will produce a fixed number of split sets, and each split is a limited data set. For the unlimited input to the flow scene, the split enumerator will either produce an unlimited number of splits or the split itself is an infinite data set.
- The complex timestamp assignor and watermark generator are transparently built in the source reader module, which is imperceptible to users. In this way, if users want to implement a new source connector, they generally do not need to implement this part of the function repeatedly.
At present, the existing source connector of Flink will be re implemented based on the new architecture in subsequent versions. Legacy source will continue to maintain several versions to maintain compatibility. Users can also try to experience the development of new source according to the instructions in the release document.
5. Pyflink ecology
As we all know, python language is widely used in the field of machine learning and data analysis. Since version 1.9.0, Flink has been compatible with Python ecology. Python and Flink work together to produce pyflink, which outputs Flink’s real-time distributed processing capability to Python users. The first two versions of pyflink have supported Python table API and UDF. In 1.11.0, the support for Python ecosystem pandas and the integration with SQL DDL / client have been expanded. At the same time, the performance of Python UDF has been greatly improved.
Specifically, the previous ordinary Python UDF can only process one piece of data per call, and it needs serialization / deserialization on both the Java side and python side, which costs a lot. In 1.11.0, Flink supports customizing and using vector Python UDF in Table & SQL jobs. Users only need to add an additional parameter UDF in UDF modification_ Type = panda. The benefits are:
- Each call can process n pieces of data.
- The data format is based on Apache arrow, which greatly reduces the serialization / deserialization overhead between Java and python processes.
- It is convenient for Python users to develop high-performance Python UDF based on Python libraries commonly used in data analysis fields such as numpy and pandas.
In addition, pyflink in 1.11.0 also supports:
- Seamless switching between pyflink table and pandas dataframe (flip-120) enhances the usability and compatibility of pandas ecosystem.
- Python udtf (flink-14500) can be defined and used in Table & SQL, and Java / scala udtf is no longer necessary.
- Python optimizes the performance of Python UDF (flip-121), which is 30 times higher than 1.10.0.
- User defined metric (flip-112) in Python UDF is convenient for monitoring and debugging UDF execution.
The above interpretation focuses on the API level, which can be directly perceived by the user’s development work. Let’s take a look at the changes of the execution engine layer in 1.11.0.
III. improvement of production availability and stability
1. Support application mode and kubernetes enhancement
Before version 1.11.0, Flink mainly supported the following two modes:
- Session mode: start a cluster in advance, and all jobs share the resources of the cluster. The advantage is to avoid the extra cost of starting the cluster separately for each job, but the disadvantage is that the isolation is slightly poor. If a job hangs a task manager (TM) container, all jobs in the container will be restarted. Although each job has its own job manager (JM) to manage, but these JM are running in a process, which is easy to bring load bottlenecks.
- Per job mode: in order to solve the problem of poor isolation in session mode, each job starts an independent cluster according to resource requirements, and the JM of each job also runs in an independent process, so the load is much smaller.
The common problem of the above two modes is that the user code needs to be executed on the client, the corresponding job graph is compiled and generated, and submitted to the cluster for running. In this process, we need to download and upload the relevant jar package to the cluster, and the load pressure of the client and the network is easy to become a bottleneck, especially when a client is shared by multiple users.
In 1.11.0, application mode (flip-85) is introduced to solve the above problems. A cluster is started according to the application granularity, and all jobs belonging to the application run in the cluster. The core is that the generation of job graph and the submission of jobs are not executed on the client side, but transferred to the JM side. In this way, the load of network download and upload will also be distributed to the cluster, and there is no longer the bottleneck of the client point mentioned above.
Users can use the application mode through bin / Flink run application. At present, both yarn and kubernetes (k8s) support this mode. Yarn application will pass all the dependencies needed to run the job to JM through yarn local resource on the client. K8s application allows users to build images containing user jars and dependencies. At the same time, TM will be created automatically according to jobs, and the whole cluster will be destroyed after completion. Compared with session mode, k8s application has better isolation. K8s no longer has strict per job mode. Application mode is equivalent to the implementation of per job submitting job in cluster.
In addition to supporting application mode, Flink native k8s also improves many basic features (flink-14460) in 1.11.0 to meet the production availability standard. For example, node selector, label, annotation, acceleration, etc. In order to integrate with Hadoop more conveniently, it also supports the function of automatically mounting Hadoop configuration according to environment variables.
2. Checkpoint & savepoint optimization
The Checkpoint and Savepoint mechanisms have always been one of the core competitiveness of Flink’s advanced nature. The community’s changes in this area are very cautious. In recent major editions, there are few major functions and structural adjustments. In the user’s mailing list, we can often see user feedback and complaints: for example, checkpoint fails for a long time, savepoint is not available after job restart, and so on. 1.11.0 selectively solves some common problems in this aspect, and improves production availability and stability.
Before 1.11.0, the meta data and state data in savepoint were stored in two different directories respectively. If you want to migrate the state directory, it is difficult to identify this mapping relationship. It may also cause the directory to be deleted by mistake, and it is also troublesome for directory cleaning. 1.11.0 integrates the two parts of data into one directory, so as to facilitate the overall transfer and reuse. In addition, the previous meta reference to state uses the absolute path, so that the path changes after the state directory migration are not available. 1.11.0 changes the state reference to the relative path to solve this problem (flink-5763), which makes the management, maintenance and reuse of savepoint more flexible and convenient.
In the actual production environment, users often encounter the problems of checkpoint timeout failure and long time failure. Once the job fails, a large amount of historical data will be played back, the job will not progress for a long time, and the end-to-end delay will increase. 1.11.0 improves the optimization and speed-up of checkpoint from different dimensions, with the goal of achieving a lightweight checkpoint in minutes or even seconds.
Firstly, the checkpoint coordinator is added to inform the task to cancel the checkpoint (flink-8871), so as to avoid unnecessary pressure on the system when the task side is still executing the canceled checkpoint. At the same time, the task side abandons the canceled checkpoint, so that it can participate in the execution of the checkpoint triggered by the coordinator more quickly. To some extent, it can also avoid the failure of the new checkpoint due to the execution timeout again. This optimization also makes it convenient to turn on local recovery by default. The task side can clean up the invalid checkpoint resources in time.
Secondly, in the back pressure scenario, a large number of buffers are piled up in the whole data link, resulting in checkpoint barrier ranking behind the data buffer, which can not be processed and aligned by the task in time, which leads to checkpoint not being executed for a long time. This problem is solved from two dimensions in 1.11.0
1) TryReduce the total number of buffers in the data link(flink-16428) so that checkpoint barrier can be aligned as soon as possible.
- The upstream output controls the maximum threshold (backlog) of a single sub partition stack buffer to avoid the accumulation of a large number of buffers on a single link in the scenario of uneven load.
- Modify the upstream and downstream default buffer configuration reasonably without affecting the network throughput performance.
- The basic protocol of upstream and downstream data transmission is adjusted to allow a single data link to be configured with 0 or more exclusive buffers without deadlock, so that the total number of buffers is decoupled from the size of job concurrency. According to the actual demand, the buffer ratio is customized to balance the throughput performance and checkpoint speed.
Part of the optimization work has been completed in 1.11.0, and the rest will be completed in the next version.
2) RealizedNew unaligned checkpoint mechanism(flip-76) fundamentally solves the problem of checkpoint barrier alignment in backpressure scenarios. In fact, this idea has been brewing for design as early as before version 1.10.0. Due to the large changes of many modules involved, the implementation mechanism and thread model are also very complex. We implemented two different prototypes of POC, tested and compared their performance, and determined the final scheme. Therefore, we did not complete the MVP version until 1.11.0, which is also the only heavyweight feature of the execution engine layer in 1.11.0. Its basic idea can be summarized as follows
- Checkpoint barrier transmits across data buffers, and does not queue in the input and output queues for processing, which is decoupled from the computing power of the operator. The transmission of checkpoint barrier between nodes only has network delay, which can be ignored.
- There is no need to wait for the barrier to align among multiple input links of each operator to execute checkpoint. The first arriving barrier can trigger checkpoint in advance, which can further speed up checkpoint without affecting the overall performance due to the delay of individual links.
- In order to keep the semantics consistent with the previous aligned checkpoint, all the unprocessed input and output data buffers will be used as channel states for snapshot persistence when checkpoint is executed, and recovered together with the operator state when fail. In other words, the aligned mechanism ensures that all data in front of the barrier must be processed, and the state is reflected in the operator state in real time; while the unaligned mechanism delays the operator state reflected by the unprocessed data in front of the barrier until the failure restart, and reflects it through the channel state replay, which is consistent in the end from the perspective of state recovery. Note that although extra persistence of in flight buffer is introduced here, this process is actually completed in the asynchronous phase of checkpoint. In the synchronous phase, only a lightweight buffer reference is used, so it will not take up too much computing time of the operator and affect the throughput performance.
Unaligned checkpoint can significantly accelerate the completion time of checkpoint in severe backpressure scenarios, because it no longer depends on the overall computing throughput, but is more related to the storage performance of the system, which is equivalent to the decoupling of computing and storage. However, it also has some limitations. It will increase the size of the overall state and bring extra overhead to IO storage. Therefore, it is not suitable to use unaligned checkpoint mechanism in the scenario where IO is already a bottleneck.
Unaligned checkpoint in 1.11.0 has not been used as the default mode, which needs to be manually configured by the user, and only takes effect in the exactly once mode. However, savepoint mode is not supported at present, because savepoint involves the rescale scenario of the job. Channel state does not support state splitting at present, which will be further supported in later versions. Therefore, savepoint will still use the previous aligned mode, which may take a long time to complete in the backpressure scenario.
During the development of Flink version 1.11.0, we see more and more contributors from China participating in the development of core functions, and witness the more and more prosperous ecological development of Flink in China. For example, contributors from Tencent participated in the development of k8s, checkpoint and other functions, and contributors from byte beat participated in the development of table & SQL And the development of engine network layer. I hope more companies can participate in Flink open source community, share experiences in different fields, and make Flink open source technology keep advanced, and it can benefit more audiences.
After the brief adjustment of “small version” of 1.11.0, Flink is brewing the next large version of features. I believe there will be many heavyweight features on the stage. Let’s wait and see!