\#About Apache pulsar
Apache pulsar is the top project of Apache Software Foundation. It is the next generation of cloud native distributed message flow platform. It integrates message, storage and lightweight functional computing. It adopts the architecture design of separation of computing and storage, supports multi tenant, persistent storage, multi machine room cross regional data replication, and has strong consistency, high throughput and high efficiency Low latency, high scalability and other stream data storage characteristics.
Apache pulsar has many unique advantages, such as hierarchical storage, stateless broker, cross regional replication, multi tenant and so on. These characteristics make pulsar better than Kafka.
If you are still hesitating on the choice of pulsar and Kafka, I hope the ten advantages of pulsar summarized in this paper can help you make decisions.
Stateless broker (extensible)
When using Kafka, you need to set the number of brokers first. Because Kafka stores the data in the broker, when it is found that the setting value is too small and more brokers are needed to expand the application, it is necessary to make full use of the new partition and re partition the topic.
Pulsar stores the state of the broker in a separate layer（Apache BookKeeper）In the middle. The broker layer is decoupled from the storage layer, and the broker can be added or used without moving data. In other words, the new broker can be fully utilized without re partitioning the existing data.
Tiered storage (persistent storage of messages, lower storage costs)
The default data retention time of Kafka is 7 days, that is, the data will be deleted after one week. By default, pulsar keeps all the data that has not been acked and immediately removes the data that has been acked.
Both Kafka and pulsar support modifying the retention period of data through custom retention policy. However, the amount of data that can be stored in primary storage will be limited, and increasing the amount of data will also increase the storage cost. Tiered storage supports the selection of cost-effective and appropriate storage for different types of data. For example, historical data is only used in bootstrap (backfill) applications, so you can choose different storage types for historical data.
The storage layer of pulsar adopts fragmentation architecture, which is distributed on the storage nodes. With pulsar, you can write to the main memory or unload to other types of memory. Therefore, pulsar supports hierarchical storage, but Kafka does not support it at present. Hierarchical storage provides multiple storage layers, such as primary storage (SSD based) and historical storage (S3), so it is easy to obtain the storage status of each layer.
Replication based on quorum (improving latency consistency)
Pulsar uses the algorithm based on quorum to copy, while Kafka uses the algorithm based on leader follower. Although the guarantees of pulsar and Kafka are the same, the delay consistency based on quorum is higher. Delay consistency is important for many applications, such as obtaining certain SLAs (such as the response time of queries).
Cross region replication (highly available)
Pulsar natively supports cross regional replication, so pulsar can replicate data across data centers in different geographical locations. When the data center is interrupted or the network is partitioned, it is particularly important to have copies of messages in multiple data centers to improve the availability.
Multi tenant (simplified architecture and management)
Pulsar supports multi tenancy, that is, multiple user groups share the same cluster through access control or in completely different namespace. Kafka does not support multi tenancy at present, so to share the cluster, we need to build an abstract layer based on the message system, or each user group uses a cluster separately.
Information encryption (improve security)
Pulsar provides end-to-end full encryption from client to storage node. Complete encryption is generally the requirement of data security. Kafka currently does not support end-to-end encryption.
Multi protocol support (easy to integrate with existing applications)
Pulsar not only supports a variety of protocols (such as rabbitmq, AMQP, Kafka), but also supports the use ofPrestoRead history stream events in parallel.
Pulsar functions (one stop flow processing)
Pulsar functions is a lightweight stream processing method based on pulsar, and its concept is similar to Kafka streams. Pulsar functions are deployed directly on broker nodes (or as containers in kubernetes clusters), while Kafka streams is a separate application. Through pulsar functions, pulsar can directly solve many stream processing tasks and simplify operation.
Apache Flink integration (batch and stream processing)
Pulsar community has launched a series of public discussions on the limitations of pulsar functions, such as state management, DAG process, etc. If pulsar functions is not suitable for your scenario, you can consider another popular open source tool——Pulsar Flink connector。
Pulsar has been tested in practice (used in mass production environment)
Pulsar has many advantages in design. Originally developed by the Yahoo team for use within Yahoo. In 2016, Yahoo donated pulsar toApache Software Foundation. After that, many mission critical applications use pulsar, such astencent、Splunkwait。
Pulsar is not perfect
Pulsar needs two systems: Apache bookkeeper and Apache zookeeper, while Kafka “only” needs zookeeper. Multiple systems will increase operational complexity, but it is precisely because of multiple systems that pulsar is more flexible. Because Kafka and pulsar use other systems, they need to be set up and maintained.
It is not easy to choose between pulsar and Kafka, and this decision will have a series of effects. In this article, I summarize the main differences between pulsar and Kafka, hoping that this information can help you and your team make choices. To learn more about Apache pulsar, visit pulsar.apache.org orSubscribe to email notification。