Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

Time:2021-12-1

Editor’s recommendation:

The MQ team of Tencent Data Platform Department has conducted in-depth research on pulsar and optimized a lot of performance and stability. At present, it has been launched in Tencent cloud message queue tdmq. This paper mainly briefly combs some traditional message queue application scenarios supported by pulsar and the support of pulsar’s new features for more scenarios.

The following article comes from Tencent cloud middleware  , Author Zhang Chao

This article is transferred from Tencent cloud middleware by Zhang Chao, senior engineer of MQ team of Tencent Data Platform Department, Apache tubemq (incubating) PMC, Kafka on pulsar maintainer and Apache pulsar contributor.
Typesetting: [email protected]

About Apache pulsar

Apache pulsar is a top-level project of the Apache Software Foundation. It is a native distributed message flow platform for the next generation cloud. It integrates message, storage and lightweight functional computing. It adopts a separate architecture design of computing and storage, supports multi tenant, persistent storage, multi machine room cross regional data replication, and has strong consistency, high throughput Stream data storage features such as low latency and high scalability.
GitHub address:http://github.com/apache/pulsar/

Message queuing overview

What is a message queue

Message queue (MQ for short) refers to the container or service that saves messages in message transmission. It is an asynchronous communication mode between services. It is suitable for serverless and micro service architecture. It is an important component of distributed system to realize high-performance, high availability, scalability and other advanced special effects.

Common mainstream message queues include ActiveMQ, rabbitmq, zeromq, Kafka, metamq, rocketmq, pulsar, etc. In the company, there are tubemq, ckafka, tdmq, CMQ, cdmq, hippo, etc.

Message queue characteristics

Distributed

Message queues are distributed, so they can provide asynchronous, decoupling and other functions.

reliability

Message based communication is reliable and messages are not lost. Most message queues provide the ability to persist messages to disk.

asynchronous

Through message queuing, remote synchronous calls can be disassembled into asynchronous calls. For application scenarios that do not need to obtain remote call results, the performance is significantly improved.

loose coupling

Messages are stored and distributed directly by middleware. The message producer only needs to pay attention to how to send the message to the message mediation server; Consumers only need to focus on how to subscribe from the mediation server. Producers and consumers are completely decoupled and do not need to know each other’s existence.

event driven

Complex application systems can be reconstructed into event driven systems. Event sourcing refers to the multiple states that an object will go through from creation to extinction. If the state changes of the object are stored, not only the current state of the object can be obtained according to the state change record, but also the change process of the object can be traced back. Message queuing can well support such a system design method, and put the events that trigger the change of object state into the message queue.

Message queue classification

In the JMS (Java Message Service) standard, there are two message models: P2P (point to point) and publish / subscribe (Pub / sub).

P2P

P2P is characterized by only one consumer per message. The message producer sends the message to the message queue. Only one consumer can consume the message. After consumption, the message is deleted. Any consumer can consume this message, but the message will never be consumed repeatedly by two consumers.

Pub/Sub

The feature of pub / sub is that the messages published to topic will be consumed by all subscribers. The message producer sends the message to the message topic. All consumers who subscribe to the topic can consume the message. The message can be deleted only after all subscribers consume it.

There is a time dependency between producers and consumers of messages. Only consumers who subscribe to this topic in advance can consume. If you send a message first and then subscribe to a topic, the messages before subscription will not be consumed by this subscriber.

The traditional enterprise message queue ActiveMQ follows the JMS specification and implements the point-to-point and publish subscribe model, but other popular message queues rabbitmq and Kafka do not follow the JMS specification.

And inReal time streaming architectureIn, message delivery of message queue can be divided intoQueueandStreamTwo categories.

Queue model

The queue model mainly consumes messages in an unordered or shared manner. Through the queue model, users can create multiple consumers to receive messages from a single pipeline; When a message is sent from the queue, only one of multiple consumers (any one is possible) receives and consumes the message. The specific implementation of the message system determines which consumer actually receives the message.

The queue model is often used in conjunction with stateless applications. Stateless applications don’t care about sorting, but they do need the ability to acknowledge (ACK) or delete a single message, as well as the ability to expand consumption parallelism as much as possible. Typical message systems based on queue model include rabbitmq and rocketmq.

Stream model

In contrast, the flow model requires the consumption of messages to be strictly sorted or exclusive. For a pipeline, using the flow model, there will always be only one consumer use and consumption message. The consumer receives messages sent from the pipeline in the exact order in which messages are written to the pipeline.

Flow models are often associated with stateful applications. Stateful applications pay more attention to the order of messages and their status. The order in which messages are consumed determines the state of a stateful application. The order of messages will affect the correctness of the application processing logic. Typical message systems based on flow model include Kafka and tubemq.

Application scenario of traditional message queue

Asynchronous call

Suppose a system call link takes 20ms for a to call B, 20ms for B to call C, and 2S for C to call D. in this way, the whole call takes 2040ms. But in fact, it only takes 40ms for a to call B and B to call C, while the introduction of D system directly leads to a 50 times reduction in system performance. At this time, we can consider introducing the message queue to pull out the call of system D and make an asynchronous call: system a ends directly after system B and then system C. system C sends the message to the message queue, and system D takes the message from the message queue for consumption. In this way, the performance of our system has been improved by nearly 50 times.

Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

system decoupling

Each business system only needs to process its own business logic and send event messages to the message queue. The downstream business system directly subscribes to the queue or topic of the message queue to obtain events. Message queue can be used for communication between different microservices after a single application is disassembled into microservices. The advantage of system decoupling is that the iterations of different systems are no longer interdependent, which can effectively shorten the length of data link and improve the efficiency of data processing.

Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

Peak cutting and valley filling

When large-scale activities bring high traffic, failure to do the corresponding protection can easily lead to system overload or even collapse, while too much restriction will lead to a large number of requests failure and affect the user experience. Message queue service has high-performance message processing capability, which can undertake traffic pulses without being destroyed. While ensuring system availability, it improves user experience through fast and effective request response technology. Its massive message accumulation capacity ensures the smooth and stable operation of downstream services within the safe water level and avoids the impact of traffic peak.

Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

Broadcast notification

The change of a system state needs to be notified to multiple related systems, which can be pushed to each subscriber system through message subscription. For example, if the database value changes, you need to notify all cache system updates. You can send a message about the database value change to the message queue, and then each cache subscribes to relevant topics, and updates its own cache after receiving the message.

Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

Distributed cache

In the big data scenario, log analysis often needs to process a large number of logs, which cannot be stored on a physical machine. The message queue can provide a cluster to store massive messages and cache them to the message queue for further analysis of logs by the real-time analysis system. Kafka and tubemq often act as distributed cache in big data processing.

Message communication

Message queues generally have built-in efficient communication mechanisms, so they can also be used in pure message communication. For example, implement point-to-point message queue, chat room, etc.

Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

Application scenarios of pulsar

Pulsar is a Message Queuing service of a new generation storage computing separation architecture,It is not only applicable to the application scenarios of the traditional message queue mentioned above, but also some of its new features make it possible for more application scenarios.

Convergence of queues and streams – maintaining a set of MQ services is enough

Apache pulsar abstracts a unified producer topic subscription consumer consumption model, which supports both queue model and flow model. In pulsar’s message consumption model, topic is the channel used to send messages. Each topic corresponds to a distributed log in Apache bookkeeper. Each message published by the publisher is stored only once in the topic; During storage, bookkeeper will copy and store messages on multiple storage nodes; Each message in topic can be used multiple times according to the subscription needs of consumers, and each subscription corresponds to a consumer group. Although messages are stored only once on a topic, users can have different subscription methods to consume these messages:

  • Consumers are grouped together to consume messages, and each consumption group is a subscription.
  • Each topic can have different consumption groups.
  • Each group of consumers is a subscription to the topic.
  • Each group of consumers can have their own different consumption methods: exclusive, failover or share.

Through this model, pulsar combines the queue model and flow model, and provides a unified API interface. This model will neither affect the performance of the message system nor bring additional overhead. At the same time, it also provides users with more flexibility and facilitates user programs to use the message system with the best matching pattern.

Multiple MQ protocols compatible – easily migrate traditional MQ services

In pulsar architecture, in order to handle bookie stored messages and prevent message loss, a set of distributed process encapsulation is implemented based on managed Leger. The pulsar protocol handler handles TCP requests sent by producers and consumers in pulsar and converts them into readable states. After pulsar version 2.5, the protocol handler interface is separated separately. Using this framework, the conversion of user-defined protocols can be realized separately, such as Kafka and AMQP, which can help the stock of MQ services easily migrate to pulsar.

Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

At present, Kafka protocol handler is an independent project under maintenance – Kafka on pulse (hereinafter referred to as Kop). Due to the large amount of memory in the company, Kafka business, the datalevel MQ team has done a lot of optimization work for Kop. Other teams of Tencent are also more deeply involved in the Kop project. Please refer to the detailsTencent joining: Kafka on pulsar project welcomes two Tencent maintainers!

Enterprise class multi tenant feature – guaranteed data security

As the information hub of the enterprise, Apache pulsar has supported multi tenancy since its birth, because the project was originally designed to meet the strict needs of Yahoo. At that time, there was no available open source system on the market to provide multi tenancy. In pulsar’s design, tenants can be distributed across clusters, and each tenant can have a separate authentication and authorization mechanism; Tenants are also snap INS for storing quotas, message TTLS, and isolation policies. Pulsar meets the data security in multi tenant scenarios in the following ways:

  • Obtain the required security through authentication, authorization and ACL (access control list) for each tenant.
  • Enforce storage quotas for each tenant.
  • All isolation mechanisms are defined in the form of policies, which can be changed during operation, so as to reduce operation and maintenance costs and simplify management.

Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

Cross region replication – built-in cross machine room redundancy capability

In large-scale distributed systems, it will involve the requirements of spanning multiple data centers. In scenarios with higher requirements for service quality and disaster recovery, it is planned to deploy the computer room in multiple data centers with geographically dispersed locations. In this kind of multi data center deployment, cross regional replication mechanism is usually used to provide additional redundancy to prevent a data center failure, natural invasion or other events from causing the service to fail to operate normally. At the beginning of its design, Apache pulsar added the demand for cross regional replication of more than 10 computer rooms around the world. Apache pulsar’s cross regional multi machine room mutual backup feature is an important part of pulsar’s enterprise feature. It not only ensures the stability and reliability of data, but also provides users with convenient operation and management.

Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

In the above figure, whenever producer P1, P2 and P3 publish messages to topic T1 in cluster A, cluster B and cluster C respectively, these messages will be copied to the whole cluster immediately. Once replication is complete, consumers C1 and C2 can consume these messages from their own cluster.

Pulsar’s cross regional replication is not only applied to the scenario of cross data center data backup, but also used as communication services in the powerfl federal learning platform.

Cloud native support – help services on the cloud

Cloud native, that is, the possibility of running in the cloud in the future is considered at the beginning of software design, so the characteristics of cloud resources are fully utilized at the design level, typically distributed and elastic scalability. Pulsar is said to be a cloud native messaging platform. The core is that its architecture design can make full use of distributed and elastic cloud resources. Taking pulsar on kubernetes as an example, bookie is a stateful node, but the nodes are peer-to-peer, so statefulset can be used for deployment; As a stateless node, broker can directly use replicaset, and each pod supports horizontal expansion.

Recommended by the blog | Tencent experts deeply analyze the five application scenarios of Apache pulsar

At present, the company already uses pulsar on kubernetes. If bookie uses local storage volume, it will have no impact on the performance of pulsar.

Related reading

clicklink, praise pulsar!