Pulsar Flink connector 2.5.0 officially released

Time:2020-9-25

Pulsar Flink connector 2.5.0 officially released

After continuous efforts, the community successfully released pulsar Flink connector version 2.5.0. The pulsar Flink connector integrates Apache pulsar and Apache Flink (data processing engine), which allows Apache Flink to read and write data to Apache pulsar.

Project address: https://github.com/streamnati…

Next, we will introduce the new features introduced by pulsar Flink connector 2.5.0 in detail, hoping to help you better understand the pulsar Flink connector.

background

Flink is a fast-growing distributed computing engine. In version 1.11, Flink supports the following new features:

  • The core engine introduces a non aligned checkpoint mechanism. This mechanism significantly improves the Flink fault tolerance mechanism, which can improve the checkpoint speed of severe back pressure operation.
  • Provides a new set of source interfaces. Through the running mechanism of unified flow and batch job source, it supports common internal implementation, such as event time processing, watermark generation and idle concurrency detection. This new source interface can greatly reduce the complexity of developing new sources.
  • Flink SQL supports change data capture (CDC). It makes it easy for Flink to translate and consume database change logs using tools like debezium. Table API and SQL also help file system connectors support more user scenarios and formats, enabling streaming data to be written from pulsar to hive and other scenarios.
  • Pyflink optimizes the performance of multiple parts, including support vectorization of user-defined functions (Python UDF). These changes enable the Flink Python interface to interoperate with common Python libraries such as pandas and numpy, making Flink more suitable for data processing and machine learning scenarios.

After the release of the new version, in order to let our partners use the pulsar Flink connector supporting Flink 1.11 as soon as possible, we have upgraded the new version of pulsar Flink connector.

We found that this upgrade is very difficult. The problem is that the new version of Flink has increased or decreased its support for public APIs (changes in the basic fieldsdatatype type, streamtableenvironment package and execute method), changed the table check schema operation to start-up check, and the connector runtime converted to catalog, which directly made the old and new versions incompatible.

After many considerations, we finally decided to add pulsar-flink-1.11 module to support Flink 1.11. Here, I would like to thank Chen Hang and Wu Zhanpeng of the bigo team for their technical support for the compatible upgrade of Flink 1.11 to the community.

Pulsar schema contains the type structure information of messages, which can be well integrated with Flink table. In Flink 1.9, SQL types can be bound to physical types for the schematype of pulsar.

However, after the change of Flink 1.11 and table, the SQL type can only use the default physical type. The schematype of pulsar does not support the default physical type of Flink date and event. We have added a new native type for pulsar schema to integrate with Flink SQL type system.

New features of pulsar Flink connector

Here are some of the main features added to pulsar Flink connector 2.5.0.

pulsar-flink

Support Flink 1.11 and Flink SQL DDL

Flink 1.11 has been upgraded to a large extent, and some public APIs have been added or deleted. As a result, the pulsar connectors of Flink 1.9 and Flink 1.11 cannot be compatible. This change divides the project into two modules to support different versions of Flink. Bigo’s Chen Hang and Wu Zhanpeng children’s shoes have made great efforts for this feature.

  • Flink version 1.11 is supported
  • New Flink SQL DDL support
  • Update topic partition strategy to make consumption more uniform
  • Flink 1.11 is compatible with pulsar schema

For more information on implementation, see pr-115: https://github.com/streamnati… 。

Add pulsardeserializationschema interface

Abstract pulsardeserializationschema interface enables users to customize decoding and obtain more source information. For more information on implementation, see pr-95: https://github.com/streamnati… 。
Contributor: @ wuzhangpeng

Flink sink adds JSON support

In the Flink sink implementation, the pulsar schema type supports JSON.
For more information on implementation, see pr-116: https://github.com/streamnati… 。
Contributor: @ jianyun8023

Pulsarcatalog changed to be based on genericinmemorycatalog implementation

The implementation of pulsarcatalog is changed to inherit genericinmemorycatalog.
For more information on implementation, see pr-91: https://github.com/streamnati… 。
Contributor: @ Sijie

Pulsar Schema

Add Java 8 time and date type to the native type of pulsar schema

Add Java 8 common support for instance, localdate, Localtime and localdatetime to pulsar schema.

For more information on implementation, see pr-7874: https://github.com/apache/pul… 。
Contributor: @ jianyun8023

summary

The release of pulsar Flink connector 2.5.0 is a big milestone for this fast-growing project. I would like to thank Chen Hang, Wu Zhanpeng, Guo Sijie and Zhao Jianyun for their contributions to this release.

If you have a good idea or want to be a project contributor, you are welcome to submit an issue or refer to our contribution Guide: https://github.com/streamnati… 。

Related links

  • What’s new in Flink 1.11 (Flink China)
  • Pulsar Flink Connector:https://github.com/streamnati…
  • streamnative/pulsar-flink:https://github.com/streamnati…