About Apache pulsar
Apache pulsar is a top-level project of the Apache Software Foundation. It is a native distributed message flow platform for the next generation cloud. It integrates message, storage and lightweight functional computing. It adopts a separate architecture design of computing and storage, supports multi tenant, persistent storage, multi machine room cross regional data replication, and has strong consistency, high throughput Stream data storage features such as low latency and high scalability.
Bigo, founded in 2014, is a rapidly developing technology company. Based on powerful audio and video processing technology, global audio and video real-time transmission technology, artificial intelligence technology and CDN technology, bigo has launched a series of audio and video social and content products, including bigo live and likee. It has nearly 100 million users worldwide, and its products and services have covered more than 150 countries and regions.
Initially, bigo’s message flow platform mainly used open source Kafka as data support. With the increasing data scale and the continuous iteration of products, the data scale carried by the bigo message flow platform has doubled. The downstream online model training, online recommendation, real-time data analysis, real-time data warehouse and other services put forward higher requirements for the real-time and stability of the message flow platform. Open source Kafka clusters are difficult to support massive data processing scenarios. We need to invest more manpower to maintain multiple Kafka clusters, which will lead to higher and higher costs, mainly reflected in the following aspects:
- Data storage and message queuing services are bound. Cluster capacity expansion / partition balancing requires a large number of copies of data, resulting in the decline of cluster performance.
- When the partition replica is not in ISR (synchronous) state, once a broker fails, it may cause data loss or the partition cannot provide read-write services.
- When Kafka broker disk fails / space occupancy is too high, manual intervention is required.
- The cluster uses KMM (Kafka mirror maker) for cross region synchronization, which is difficult to achieve the expected performance and stability.
- In the catch up read scenario, pagecache pollution is easy to occur, resulting in the decline of read and write performance.
- The number of topic partitions stored on Kafka broker is limited. The more partitions, the worse the disk read-write order and the lower the read-write performance.
- The growth of Kafka cluster scale leads to a sharp increase in operation and maintenance costs, which requires a lot of manpower for daily operation and maintenance; In bigo, it takes 0.5 person / day to expand a machine to Kafka cluster and perform partition balancing; It takes 1 person / day to shrink a machine.
If we continue to use Kafka, the cost will continue to rise: expand and shrink the machine and increase the operation and maintenance manpower. At the same time, with the growth of business scale, we have higher requirements for the message system: the system should be more stable and reliable, easy for horizontal expansion and low latency. In order to improve the real-time, stability and reliability of message queue and reduce the operation and maintenance cost, we began to consider whether we should do localization secondary development based on open source Kafka, or see if there is a better solution in the community to solve the problems we encounter when maintaining Kafka cluster.
In November 2019, we began to investigate the message queue, compare the advantages and disadvantages of the current mainstream message flow platform, and connect with our needs. During the investigation, we found that Apache pulsar is the next generation cloud native distributed message flow platform, integrating message, storage and lightweight functional computing. Pulsar is capable of seamless capacity expansion, low latency and high throughput, and supports multi tenant and cross regional replication. Most importantly, the architecture of separating pulsar storage and computing can perfectly solve the problem of Kafka capacity expansion and contraction. The pulsar producer sends the message to the broker, who writes it to the bookkeeper in the second layer through the bookie client.
Pulsar adopts the hierarchical architecture design of separation of storage and computing, supports multi tenant, persistent storage, multi machine room cross regional data replication, and has the characteristics of highly scalable streaming data storage with strong consistency, high throughput and low delay.
- Horizontal expansion: it can seamlessly expand to hundreds of nodes.
- High throughput: already in Yahoo! Has been tested in a production environment that supports publish subscribe (pub sub) of millions of messages per second.
- Low latency: it can still maintain low latency (less than 5 ms) under large-scale message volume.
- Persistence mechanism: pulsar’s persistence mechanism is built on Apache bookkeeper to realize read-write separation.
- Read write separation: bookkeeper’s read write separation IO model gives full play to the sequential write performance of the disk, is relatively friendly to the mechanical hard disk, and the number of topics supported by a single bookie node is not limited.
In order to further deepen our understanding of Apache pulsar and measure whether pulsar can really meet the needs of large-scale news pub sub in our production environment, we have carried out a series of pressure tests since December 2019. Because we use mechanical hard disk without SSD, we encountered some performance problems during the pressure test. With the assistance of streamnative, we conducted a series of performance tuning for broker and bookkeeper respectively, and the throughput and stability of pulsar were improved.
After 3 ~ 4 months of pressure test and tuning, we believe that pulsar can completely solve various problems we encounter when using Kafka, and launched pulsar in the test environment in April 2020.
Apache pulsar at bigo: pub sub consumption model
In May 2020, we officially used pulsar cluster in production environment. Pulsar’s scene in bigo is mainly the classic production and consumption mode of pub sub. the front end includes Baina service (data receiving service implemented in C + +), Kafka’s mirror maker and Flink, as well as producers in other languages such as Java, Python and C + +. The backend consists of Flink and Flink SQL, as well as consumer data of clients in other languages.
In the downstream, our docking business scenarios include real-time data warehouse, real-time ETL (extract transform load, the process of extracting, transforming and loading data from the source to the destination), real-time data analysis and real-time recommendation. In most business scenarios, Flink is used to consume the data in pulsar topic and conduct business logic processing; The client languages used in other business scenarios are mainly distributed in C + +, go, python, etc. After being processed by their respective business logic, the data will eventually be written to hive, pulsar topic, Clickhouse, HDFS, redis and other third-party storage services.
Pulsar + Flink real-time streaming platform
At bigo, we built a real-time streaming platform with Flink and pulsar. Before introducing this platform, let’s first understand the internal operation mechanism of pulsar Flink connector. In the pulsar Flink source / sink API, there is a pulsar topic upstream, a Flink job in the middle, and a pulsar topic downstream. How do we consume this topic, and how do we process data and write pulse topic?
According to the code example on the left of the figure above, initialize a streamexecutionenvironment and configure it, such as modifying the property and topic values. Then create a flinkpulsarsource object, which is filled with serviceurl (brokerlist), adminurl (admin address) and the serialization method of topic data. Finally, the property will be passed in, so that the data in pulsar topic can be read. The use method of sink is very simple. First create a flinkpulsarsink, specify target topic in the sink, specify topickeyextractor as the key, and call addsink to write the data to the sink. This production and consumption model is very simple, very similar to Kafka.
How are the consumption of pulsar topic and Flink linked? As shown in the following figure, when creating a new flinkpulsarsource, a new reader object will be created for each partition of topic. It should be noted that the underlying pulsar Flink connector uses the reader API for consumption, and a reader will be created first. This reader uses the pulsar non durable cursor. The feature of reader consumption is to commit immediately after reading a piece of data, so you may see that there is no backlog information in the subscription corresponding to the reader in the monitoring.
In pulsar version 2.4.2, when the topic subscribed by non durable cursor receives the data written by the producer, it will not save the data in the broker’s cache, resulting in a large number of data reading requests falling into bookkeeper, reducing the data reading efficiency. Bigo fixed this issue in pulsar version 2.5.1.
After reader subscribes to pulsar topic and consumes the data in pulsar topic, how does Flink ensure exactly once? The pulsar Flink connector uses another independent subscription, which uses a dual cursor. When Flink triggers a checkpoint, pulsar Flink connector will checkpoint the status of the reader (including the consumption location of each pulsar topic partition) to the file, memory or rocksdb. When the checkpoint is completed, it will issue a notify checkpoint complete notification. After receiving the checkpoint completion notification, the pulsar Flink connector submits the consumption offset of all current readers, i.e. message ID, to the pulsar broker with an independent subscriptionname, and then the consumption offset information will be truly recorded.
After the offset commit is completed, pulsar broker will store the offset information (represented by cursor in pulsar) in the bookkeeper of the underlying distributed storage system. The advantage of this is that when the Flink task is restarted, there will be two layers of recovery guarantee. The first case is to recover from the checkpoint: you can directly obtain the message ID of the last consumption from the checkpoint. Through this message ID, the data flow can continue to consume. If it is not recovered from the checkpoint, after the Flink task is restarted, it will obtain the offset position corresponding to the last commit from the pulsar according to the subscriptionname and start consumption. This can effectively prevent the problem that the checkpoint is damaged and the whole Flink task cannot be started successfully.
The checkpoint process is shown in the figure below.
First do checkpoint n, and then publish notify checkpoint complete. After waiting for a certain time interval, then do checkpoint n + 1. After completion, you will also perform notify checkpoint complete. At this time, commit the durable cursor and finally commit to the server of pulsar topic, so as to ensure the exact once of checkpoint, Message “keep alive” can also be guaranteed according to the subscription you set.
What problems will topic / partition discovery solve? When the Flink task consumes topic, if the topic adds partitions, the Flink task needs to be able to automatically discover partitions. How does pulsar Flink connector achieve this? Readers subscribing to topic partitions are independent of each other. Each task manager contains multiple reader threads. The topic partitions contained in a single task manager are mapped according to the hash function. When a new partition is added to topic, the newly added partition will be mapped to a task manager. After the task manager finds the new partition, it will create a reader and consume new data. Users can set
partition.discovery.interval-millisParameter to allocate the detection frequency.
In order to lower the threshold for Flink to consume pulsar topic and make pulsar Flink connector support richer new features of Flink, bigo message queuing team added pulsar Flink SQL DDL (data definition language) and Flink 1.11 support to pulsar Flink connector. Previously, pulsar Flink SQL officially provided only supports catalog. It is not convenient to consume and process the data in pulsar topic through DDL. In the bigo scenario, most topic data is stored in JSON format, but the JSON schema is not registered in advance, so it can only be consumed after specifying the topic DDL in Flink SQL. For this scenario, bigo has made secondary development based on pulsar Flink connector and provided a code framework for consuming, parsing and processing pulsar topic data in the form of pulsar Flink SQL DDL (as shown in the figure below).
In the code on the left, the first step is to configure the consumption of pulsar topic. First, specify the DDL form of topic, such as rip, rtime, uid, etc. the following is the basic configuration of consuming pulsar topic, such as topic name, service URL, admin URL, etc. After reading the message, the underlying reader will decode the message according to the DDL and store the data in test_ flink_ SQL table. The second step is conventional logic processing (such as field extraction, join, etc.) to obtain relevant statistical information or other relevant results, return these results and write them to HDFS or other systems. The third step is to extract the corresponding fields and insert them into a hive table. Because Flink 1.11 has better write support for hive than 1.9.1, bigo has made another API compatibility and version upgrade to make pulsar Flink connector support Flink 1.11. Bigo’s real-time streaming platform based on pulsar and Flink is mainly used for real-time ETL processing scenarios and ab test scenarios.
Real time ETL processing scenario
Real time ETL processing scenarios mainly use pulsar Flink source and pulsar Flink sink. In this scenario, pulsar topic implements hundreds or even thousands of topics, and each topic has an independent schema. We need to perform routine processing on hundreds of topics, such as field conversion, fault-tolerant processing, writing to HDFS, etc. Each topic corresponds to a table on HDFS. Hundreds of topics will show hundreds of tables on HDFS. The fields of each table are different. This is the real-time ETL scenario we encounter.
The difficulty of this scenario is the large number of topics. If each topic maintains a Flink task, the maintenance cost is too high. Previously, we wanted to sink the data in pulsar topic directly to HDFS through HDFS sink connector, but it was very troublesome to process the logic inside. Finally, we decided to use one or more Flink tasks to consume hundreds of topics. Each topic has its own schema. We directly use the reader to subscribe to all topics, perform schema parsing and post-processing, and write the processed data to HDFS.
With the program running, we find that this scheme also has problems: the pressure between operators is unbalanced. Because some topic traffic is large and some traffic is small, if it is completely mapped to the corresponding task manager through random hash, the traffic handled by some task managers will be very high, while the traffic handled by some task managers will be very low, resulting in very serious congestion on some task machines and slowing down the processing of Flink flow. Therefore, we introduce the concept of slot group, which is grouped according to the traffic of each topic. The traffic will be mapped to the number of topic partitions. When creating topic partitions, it is also based on the traffic. If the traffic is very high, create more partitions for topic, and vice versa. When grouping, the topics with low traffic are divided into a group, and the topics with high traffic are placed in a group separately, which well isolates resources and ensures the overall traffic balance of task manager.
AB test scenario
The real-time data warehouse needs to provide hour tables or day tables to provide data query services for data analysts and Recommendation Algorithm Engineers. In short, there will be many management in app applications, and various types of management will report to the server. If the original management is directly exposed to the business party, different business users need to access different original tables, extract data from different dimensions, and calculate the association between tables. Frequent data extraction and association operations on the underlying basic table will seriously waste computing resources. Therefore, we extract the dimensions concerned by users from the basic table in advance and combine multiple management points to form one or more wide tables, covering 80% – 90% of the scenario tasks related to the above recommendations or data analysis.
In the real-time warehouse scenario, we also need real-time intermediate tables. Our solution is to use pulsar Flink SQL to parse the consumed data into corresponding tables for topic a to topic K. In general, the common way to aggregate multiple tables into one table is to use join. For example, join tables a to K according to uid to form a very wide table; However, in Flink SQL, joining multiple wide tables is inefficient. Therefore, bigo uses union instead of join to make a wide view, returns the view in hours, writes it to Clickhouse, and provides real-time query to downstream business parties. Using union to replace the aggregation of the join accelerator table can control the output of the intermediate table at the hour level to the minute level.
The output day table may also need to join the table stored on hive or the offline table on other storage media, that is, the problem of joining between the flow table and the offline table. If you join directly, the intermediate state to be stored in the checkpoint will be relatively large, so we have optimized it in another dimension.
The left part is similar to the hour table. Each topic is consumed by pulsar Flink SQL and converted into a corresponding table. Union operations are performed between the tables, and the table obtained by union is entered into HBase in days (HBase is introduced here to replace its join).
Join offline data is required on the right. Spark is used to aggregate offline hive tables (such as tables A1, A2 and A3). The aggregated data will be written to HBase through carefully designed row keys. The status after data aggregation is as follows: suppose that the key of the data on the left fills in the first 80 columns of the wide table, and the data calculated by the spark task corresponds to the same key, fill in the last 20 columns of the wide table, and form a large wide table in HBase. The final data is extracted from HBase again and written into Clickhouse for upper level users to query. This is the main architecture of AB test.
Since its launch in May 2020, pulsar has operated stably, processing tens of billions of messages per day, and the incoming traffic of bytes is 2 ~ 3 GB / s. Apache pulsar provides features such as high throughput, low latency and high reliability, which greatly improves the message processing capacity of bigo, reduces the operation and maintenance cost of message queue, and saves nearly 50% of the hardware cost. At present, we have deployed hundreds of pulsar broker and bookie processes on dozens of physical hosts. Using the mixed mode of bookie and broker on the same node, we have migrated ETL from Kafka to pulsar, and gradually migrated the businesses consuming Kafka clusters in the production environment (such as Flink, Flink SQL, Clickhouse, etc.) to pulsar. With the migration of more services, the traffic on pulsar will continue to rise.
Our ETL task has more than 10000 topics, each topic has an average of 3 partitions, and uses the storage strategy of 3 copies. Before using Kafka, with the increase of the number of partitions, the disk gradually degenerates from sequential read-write to random read-write, and the read-write performance degrades seriously. Apache pulsar’s storage tiering design can easily support millions of topics, providing elegant support for our ETL scenario.
Bigo has done a lot of work in pulsar broker load balancing, broker cache hit rate optimization, broker related monitoring, bookkeeper read-write performance optimization, bookkeeper disk IO performance optimization, pulsar and Flink, pulsar and Flink SQL combination, which not only improves the stability and throughput of pulsar, but also reduces the threshold for the combination of Flink and pulsar, It has laid a solid foundation for the promotion of pulsar.
In the future, we will increase the scenario application of pulsar in bigo to help the community further optimize and improve pulsar functions, as follows:
- Develop new features for Apache pulsar, such as support for topic policy related features.
- Migrate more tasks to pulsar. This work involves two aspects. One is to use Kafka’s task to pulsar before migration. Second, new services are directly connected to pulsar.
- Bigo plans to use Kop to ensure smooth transition of data migration. Because bigo consumes a lot of Flink tasks of Kafka cluster, we hope to make a layer of Kop directly in pulsar to simplify the migration process.
- Continuously optimize the performance of pulsar and bookkeeper. Due to the high flow in the production environment, bigo requires high reliability and stability of the system.
- Continuously optimize the IO protocol stack of bookkeeper. Pulsar’s underlying storage itself is an IO intensive system. Only by ensuring the high throughput of the underlying IO can the throughput of the upper layer be improved and the performance be stable.
Introduction to the author
Chen Hang, Apache pulsar Committee, head of bigo big data messaging platform team, is responsible for creating and developing a centralized publish subscribe messaging platform carrying large-scale services and applications. He introduced Apache pulsar into the bigo messaging platform and opened up upstream and downstream systems, such as Flink, Clickhouse and other real-time recommendation and analysis systems. At present, he is mainly responsible for pulsar performance tuning, new function development and pulsar ecological integration.
- Performance tuning practice of Apache pulsar in bigo (Part I)
- Performance tuning practice of Apache pulsar in bigo (Part 2)
- Apache pulsar’s landing practice in the field of energy Internet
clicklink, get Apache pulsar hard core dry goods information!