Apache pulsar is the next generation distributed messaging system of Yahoo open source. In September 2018, it graduated from Apache Software Foundation and became a top-level project. Pulsar’s unique layered and fragmented architecture not only ensures the performance and throughput of the big data message flow system, but also provides high availability, high scalability and easy maintenance.
Fragmentation architecture reduces the storage granularity of message flow data from partition to fragmentation, and the corresponding hierarchical storage makes pulsar the best choice for unbounded streaming data storage. This enables pulsar to better match and adapt to Flink’s batch flow integrated computing mode.
1. Introduction to pulsar
With the development of open source, enterprises in various industries can provide more functions for pulsar according to different needs, so at present, pulsar is no longer just a middleware function, but gradually develops into an event streaming platform with the functions of connect, store and process.
In connection, pulsar has its own pub / sub model, which can meet the application scenarios of Kafka and rocketmq at the same time. At the same time, the function of pulsar IO is actually connector, which can easily import or export data sources to pulsar.
In addition, in pulsar 2.5.0, we add an important mechanism: protocol handler. This mechanism supports the addition of additional protocol support in broker customization, which can ensure that you can enjoy some advanced functions of pulsar without changing the original database. So pulsar also extends to Kop, ActiveMQ, rest, etc.
After pulsar provides a way for users to import, it is necessary to consider storing on pulsar. Pulsar uses distributed storage, which started with Apache bookkeeper. Later, more hierarchical storage was added to select storage modes such as jcloud and HDFS. Of course, hierarchical storage is also limited by storage capacity.
Pulsar provides an infinite storage abstraction, which is convenient for the third-party platform to carry out better batch flow fusion calculation. That is the data processing ability of pulsar. Pulsar’s data processing ability is actually based on the difficulty and effectiveness of your data calculation.
At present, pulsar includes the following integration processing methods:
- Pulsar Function: pulsar’s own function processing can be completed and applied to pulsar through different system functions.
- Pulsar Flink connector and pulsar spark connector: as a batch flow fusion computing engine, Flink and spark both provide flow computing mechanism. If you’re already using them, congratulations. Because pulsar also supports these two kinds of calculations, there is no need for you to do extra operations.
- Presto (Pulsar SQL): some friends will use SQL more in the application scenario, interactive query and so on. Pulsar and presto are well integrated, and can be processed in pulsar with SQL.
1.2 subscription model
In terms of usage, pulsar is similar to the traditional messaging system and is based on the publish subscribe model. Users are divided into two roles: producer and consumer. For more specific needs, users can also consume data in the role of reader. Users can publish data under specific topics as producers or subscribe to specific topics as consumers to obtain data. In this process, pulsar implements data persistence and data distribution. Pulsar also provides schema function to verify data.
As shown in the figure below, there are several subscription modes in pulsar:
- Exclusive subscription
- Fail over subscription
- Shared subscription
- Key order preserving shared subscription (key)_ shared）
There are two types of topics in pulsar: partitioned topic and not partitioned topic.
Partitioned topics are actually composed of multiple non partitioned topics. Theme and partition are logical concepts. We can regard theme as a large infinite event stream, which is divided into several small infinite event streams.
Correspondingly, in physics, pulsar adopts hierarchical structure. Each event flow is stored in a segment, and each segment includes many entries, in which one or more message entities sent by users are stored.
Message is not only the data stored in the entry, but also the data obtained by consumers in pulsar. Besides byte stream data, message also includes key attribute, two time attributes, message ID and other information. The message ID is the unique identifier of the message, including the information of ledger ID, entry ID, batch index and partition index. As shown in the figure below, the storage locations of segment, entry, message and partition of the message in pulsar are recorded respectively. Therefore, the message content can also be found physically.
2. Pulsar architecture
A pulsar cluster consists of brokers cluster and bookies cluster. Brokers are independent of each other and are responsible for providing services on a certain topic to producers and consumers. Bookies are also independent of each other, responsible for storing segment data, which is the place of message persistence. In order to manage the configuration information and proxy information, pulsar also uses the zookeeper component. Brokers and bookies will register on zookeeper. The structure of pulsar is introduced from the specific read-write path of the message (see the figure below).
In the write path, the producer creates and sends a message to the topic. The message may be routed to a specific partition by some algorithm (such as round robin). Pulsar will select a broker to serve the partition, and the message of the partition will be actually sent to the broker. When the broker gets a message, it will write it to books in the form of write quorum (QW). When the number of successful writes to bookies reaches the set value, the broker will receive a completion notification, and the broker will also return a notification to the producer that the write is successful.
In the read path, the consumer must first initiate a subscription before connecting with the broker corresponding to the topic. The broker requests data from books and sends it to the consumer. When the data is accepted successfully, the consumer can choose to send confirmation information to the broker, so that the broker can update the consumer’s access location information. As mentioned earlier, for the data just written, pulsar will be stored in the cache, so it can be read directly from the brokers’ cache, which shortens the read path.
Pulsar separates storage from service, and achieves good scalability. At the platform level, it can adjust the number of books to meet different needs. At the user level, you only need to communicate with brokers, and brokers are designed to be stateless. When a broker is unavailable due to failure, you can dynamically generate a new broker to replace it.
3. Internal mechanism of pulsar connector
First of all, pulsar connector is relatively simple to use. It is composed of a source and a sink. The function of source is to pass messages under one or more topics into the source of Flink. The function of sink is to get data from the sink of Flink and put it into some topics. In terms of usage, it is the same as Kafa connector Very similar, you need to set some parameters when using.
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment(); Properties props = new Properties(); props.setProperty("topic", "test-source-topic") FlinkPulsarSource<String> source = new FlinkPulsarSource<>( serviceUrl, adminUrl, new SimpleStringSchema(), props); DataStream<String> stream = see.addSource(source); FlinkPulsarSink<Person> sink = new FlinkPulsarSink( serviceUrl, adminUrl, Optional.of(topic), // mandatory target topic props, TopicKeyExtractor.NULL, // replace this to extract key or topic for each record Person.class); stream.addSink(sink);
Now we will introduce the implementation mechanism of some features of kulsar connector.
3.1 accurate once
Because the message ID in pulsar is globally unique and orderly, it also corresponds to the physical storage of the message in pulsar. Therefore, in order to implement exactly once, pulsar connector stores the message ID in checkpoint with the help of Flink’s checkpoint mechanism.
For the source task of the connector, every time the checkpoint is triggered, the message ID currently processed by each partition will be saved in the state store. In this way, when the task is restarted, each partition can find the message location corresponding to the message ID through the reader seek interface provided by pulsar, and then read the message data from this location.
Through checkpoint mechanism, it can also send the notification of the end of data use to the node that stores the data, so that it can accurately delete the expired data and make the reasonable use of the storage.
3.2 dynamic discovery
Considering that the tasks in Flink run for a long time, users may need to dynamically add some topics or partitions in the process of running tasks. Pulsar connector provides an automatic discovery solution.
Pulsar’s strategy is to start another thread to periodically query whether the set theme has changed and whether the partition has been added or deleted. If a new partition is added, a new reader task will be created to complete the deserialization of the data under the theme. Of course, if the partition is deleted, the reading task will be reduced accordingly.
3.3 structured data
In the process of reading the data under the theme, we can transform the data into structured records for processing. Pulsar supports Avro schema and Avro / JSON / protobuf message format data to be converted into row format data in Flink. For metadata that users care about, pulsar also provides the corresponding metadata domain in row.
In addition, pulsar is newly developed based on Flink version 1.9, which supports table API and catalog. Pulsar makes a simple mapping, as shown in the figure below. The tenant / namespace of pulsar is mapped to the database of catalog, and the theme is mapped to the specific table in the database.
4. Future planning
First of all, we mentioned that pulsar stores data in bookeeper and can also import it into a file system such as HDFS or S3. However, for analytical applications, we often only care about some attributes of each data in all data. Therefore, column storage will improve the performance of IO and network. Pulsar is also trying to use segment storage Is stored as a column in.
Secondly, in the original read path, both reader and comsumer need to pass data through brokers. If a new way of bypass broker is adopted, the location of bookie stored in each message can be found directly by querying metadata. In this way, the data can be read directly from bookie and the reading path can be shortened, so as to improve the efficiency.
Finally, compared with Kafka, pulsar is physically stored in segments, so in the process of reading, by improving parallelism and establishing multi thread to read multiple segments at the same time, it can improve the completion efficiency of the whole job, but it also requires your task itself to be responsible for each topic There is no strict requirement on the access order of partitions, and whether the newly generated data is not stored in segement or needs to be accessed by cache to obtain data. Therefore, parallel reading will become an option to provide users with more options.