Pulsar functions is a lightweight and functional computing architecture launched by Apache pulsar. With the help of pulsar functions, complex processing logic can be created based on a single message without deploying a separate system, simplifying event flow and introducing serverless to reduce the burden of operation and maintenance. This article is compiled from the speech “function mesh: innovative practice of serverless in message and stream data scenarios” delivered by Zhai Jia, co-founder of streamnative and Tencent cloud TVP, at the techo TVP developer summit serverlesdays China 2021.
Click the link to watch the wonderful speech video:Techo TVP developer summit serverlessdays China 2021 “serverless, empower more” Zhai Jia
1、 What is Apache pulsar?
Function mesh is the latest open source project of streamnational. Serverless and k8s are closely combined, and function mesh has the same original intention. The pulsar functions we did before are to better integrate pulsar and serverless. Function mesh facilitates the use of cloud resources to better manage functions.
Today’s sharing is mainly carried out in four directions. Introduction to pulsar, pulsar functions and function mesh, pulsar community.
Why was pulsar community born and what do you want to do? Pulsar began as a messaging system, which was born inside Yahoo. What kind of problems was it to solve at that time? In the message scenario, potential infrastructure partners will understand that due to architecture technology, the requirements are naturally divided into two directions according to different scenarios. One is peak cutting and valley filling, which is an internal interactive MQ; Another scenario requires a big data engine and a data transmission pipeline. The usage scenarios, data consistency and technical architecture of the two scenarios are completely different. In 2012, the main problem Yahoo faced at that time was that various departments would also maintain multiple sets of systems, including three or four sets internally, which is equivalent to that the bottleneck of operation and maintenance has been found to be particularly severe in the whole interior, and the isolated island of data has become particularly severe among various departments. Therefore, the main original intention of Yahoo as pulsar at that time was: for users, they wanted to be a unified data platform, provide unified operation and maintenance and management, reduce the pressure of operation and maintenance and improve resource utilization; For business departments, 2012 is also the beginning of stream computing. We hope that more data can be accessed, real-time computing can capture more data sources, get more accurate calculation results and give better play to the value of data. From these two aspects,The birth of pulsar mainly provides the unification of two scenarios, and uses the same platform to solve the previous two sets of applications in message related scenarios。
With such a demand, why can pulsar do such a thing? It is related to the following two aspects:
First, cloud native architecture。 There are several points behind it. First, the service layer – the computing layer and the storage layer are completely isolated. In the service layer, no data will be saved, and all data will be handed over to the underlying storage layer. At the same time, the concept of logical partition is exposed to users. Unlike other systems that directly bind file systems – bind single node folders, this partition divides the partition into a series of partitions according to the size or time specified by the user. The partition mode ensures that the data of a partition can be placed evenly on multiple storage nodes. Through this strategy, it realizes distributed storage and more distributed logic for each partition. Expansion and contraction will not bring any data relocation, which is the advantage of cloud native architecture.
In terms of architecture, the second point is the peer-to-peer architecture。 This is inseparable from Yahoo’s demand for bigger clusters and multi tenants. Only when the state between nodes is simple enough and the state maintenance is simple enough can a relatively large cluster be maintained. Inside twitter, the underlying storage layer has two computer rooms, each with 1500 nodes. For this kind of peer-to-peer node, the upper broker is well understood and does not store any data, so it is the concept of leaderless, and there is no distinction between master and slave. When multiple backups fall to the underlying storage nodes, each storage node is also in a peer-to-peer state. To write one data, multiple nodes will be written concurrently. The internal consistency of a single node is maintained through CRC, but multiple copies of data of multiple nodes are written concurrently, so multiple storage nodes are also in a peer-to-peer architecture. Through such a mechanism, through its own cap design and consistency, it has the above-mentioned architecture of separation of storage and computing and the foundation of node equivalence, so it will bring a better experience for expansion, operation and maintenance and fault tolerance.
Another feature of pulsar is that it has Apache bookkeeper, a storage engine dedicated to message flow. Bookkeeper is an older system, a product born in 2008 and 2009 and an open source system of Yahoo. Bookkeeper was born mainly to solve the ha of HDFS layer. It was born to save every change of namenode, Metadata was born to save metadata. Therefore, there are particularly high requirements for consistency, low latency, throughput and reliability. However, its model is very simple, which is an abstraction of write ahead log. This matches our message very well, because the main mode of the message is append only. With the passage of time, the value of the old data may become lower and lower, and then delete it as a whole.
With such a bookkeeper, which can provide relatively stable service quality and particularly high consistency support, pulsar has the ability to support the MQ scenario just mentioned; At the same time, due to the simple abstraction of log, the data additional write mode improves the data bandwidth and supports the stream mode. Both MQ and Kafka scenarios are well supported by the underlying storage layer and guaranteed by the underlying storage layer.
With the previous foundation, it will be particularly easy to build enterprise features that are particularly useful to users at the bottom of pulsar. Pulsar was born because of the need for a large cluster and multi tenancy. In this layer, for users, each topic is no longer a single-level concept, but is similar to the folder in the file system. It is the management of primary directory and secondary directory levels. The first level directory is our tenants. It mainly provides users with more isolation strategies. Each tenant can be divided into different permissions. The administrator of each tenant manages the management of permissions with other tenants and internal users. For example, do you want the first tenant to access the information of the second tenant? Information similar to this authentication.
Further down, there are various strategies in the namespace layer, which can facilitate many enterprise level controls, such as flow control; The bottom layer is what we call topic. Through the concept of hierarchy and the support of large clusters, it is more convenient to get through the data between various organizations and departments within users.
In addition, pulsar has good data consistency, and many users use it in the scenario of cross cluster replication. Pulsar’s cross region replication is an embedded function of broker. If users need this function, they can simply call pulsar’s command to complete the construction of cross level groups. The built-in producer can directly synchronize the data just dropped locally to the remote computer room, with high timeliness. The user experience is very simple to configure, very efficient to use, and very low latency. At the same time, it can provide a good guarantee of data consistency. Therefore, there are rich applications in many scenarios, including Tencent and Suning. Many users choose pulsar because of a single scenario.
Because of these foundations, pulsar’s growth in the community is also particularly significant.Now there are 403 contributors in the community, and the number of GitHub stars is close to 9000. Thank you very much. Many partners of Tencent cloud have done a lot of useful and rich scene tests on pulsar。
2、 Pulsar functions
At the beginning of pulsar’s birth, we started from the field of information, and we connected with the whole ecology through the cloud. Today’s discussion mainly focuses on the functions under the computing layer, which will be expanded in detail. In our common big data computing, there are roughly three types: interactive query, Presto is a commonly used scenario; Next, for example, batch processing, stream processing, corresponding spark, Flink, etc. are commonly used by users. In the above two types, pulsar provides support for corresponding connectors, so that these engines can understand the schema of pulsar and directly read and use a topic of pulsar as a table. The concept of functions is what I want to focus on today. It is a lightweight computing, which is not the same concept as the complex computing scenarios above. This diagram may be more intuitive and internal, so it abstracts the simple calculation scenarios commonly used by users in the message scenario and provides the abstraction of functions. The embedded consumer on the left side of function will subscribe to the generated message. The function of intermediate user physical examination provides calculation. The producer on the right will write the function and calculation results transmitted by the user back to the topic of the destination. Through such a mode, the user’s commonly used information such as the number of intermediate copies to be created, managed and scheduled is provided as a unified infrastructure.
Some students asked what topic is? Topic is related to the message field. It is an abstraction of the pipeline and a carrier. All data is cached through topic, and the producer will generate messages and give them to topic. Consumers use topic for consumption according to the production order, which is a cache carrier and pipeline.
Instead of being a complex computing engine, pulsar functions mainly wants to better combine the concept of serverless with the message system, so that pulsar itself can process a lot of lightweight data on the message side and data side. Common scenarios, such as ETL and aggregation, account for about 60% – 70% of the total data processing capacity. Especially in IOT scenarios, it will account for 80% and 90%. For such simple scenarios, through simple function mesh processing, there is no need to build complex clusters. Simple calculations can be processed on our message side. These resources can be saved, and the transmission resources and computing resources can be well used.
Let’s give you a simple demonstration. What is the experience of functions for users? What this function needs to handle is very simple. For a topic, I add an exclamation point to the data “hello” thrown in by the user. For such a function, no matter what language, we can have corresponding runtime support in our functions. In this process, users do not need to learn any new logic or new APIs. They can write in whatever language they are familiar with. After writing, submit it to pulsar functions, which will subscribe to all the data passed in and do the corresponding processing in the function according to the data.
Functions are related to serverless. Everyone’s idea is the same. They are well combined with messages. They process messages and calculations in a serverless way. But what is different from serverless is that it is related to data processing, and there will be various semantic support. Within pulsar function, three semantic flexible supports are also provided.
At the same time, state storage is embedded in function mesh to save the intermediate calculation results to bookkeeper itself. For example, if you want to make a statistics, the user will send in a sentence and divide the sentence into multiple word segments. Each word can count its occurrence times. In this way, the information of a number of times can be recorded in pulsar itself. In this way, you can complete the statistics to be done in topic through a simple function and update it in real time. At the same time, pulsar has built an admin interface based on rest, which makes it easier for users to use, schedule and manage pulsar functions. Behind it is actually a rest API, which can directly call the interface through its own programming and better integrate with user applications.
To sum up,Pulse functions simply means to provide better experience for all small partners in your application and ecology. For example, for your developers, it can support a variety of languages。 We are also doing web support recently. In addition, different models can be supported. The simplest way is to throw it on the broker and run it into process mode at the same time. The deployment mode is also very flexible. If the resources are limited, deploy it on the broker and let it run together with the broker. If you need better isolation, you can take it out and make a separate cluster to run your functions through this cluster. Before function mesh, we provided very simple kubernetes support.
The benefits it brings to everyone will be easier for users, because users may be experts in big data. If they are familiar with various languages, they can write this logic according to their familiar languages. Its operation is also particularly simple, because pulsar is required for processing big data. Now that you are familiar with pulsar, you can also do a good integration with pulsar. With pulsar, it runs together with broker without another server. In our development and deployment, we also provide a local run mode. Users can easily debug functions. For each user on the whole computing path, pulsar functions provides a good experience and rich tool support.
3、 Function mesh
However, although there is k8s support, it was not native before. How did the user call functions before? Functions can be deployed with brokers. Now there is a functions woker in each broker, which corresponds to all the management and operation and maintenance interfaces of functions. Users submit functions to the functions worker, and then save the information of some source data of functions to the topic inside the pulsar. During scheduling, k8s is told to get the source data from topic, have several copies, read them from the source data, and then start an instance of the corresponding data functions.
There are some unfriendly points in this process. Its source data itself is stored in the topic of the pulsar, which will bring a problem. Many users raise the functions woker, read the data of the topic itself and obtain the information of the source data. If the broker of the topic service does not get up, there will be a circular crush. It will not get up until the broker that really serves the source data of the functions worker gets up. In addition, there are two parts of source data management in this process. The first part is submitted to the functions worker and saved in the pulsar itself. At the same time, kubernetes will be called to hand in another copy of the source data. In this way, source data management will be more troublesome. As a very simple example, kubernetes has given functions to kubernetes, and there is no coordination mechanism between the two sides. Third, capacity expansion, dynamic management and elastic scaling are great advantages of kubernetes. If you do this again, it may be a repeated process with kubernetes.
The second problem is also mentioned by many users of pulsar functions. Pulsar functions run in a cluster. Many scenarios are not limited to the interior of the cluster and need to span multiple clusters. At this time, the interaction will become more complex. For example, in the scenario of Federated learning, users want to give the data to functions for training, and then write the model to the cluster, but there are many scenarios. Federated learning functions trains the data of user a and writes the training results to user B. at this time, cross cluster operation is required. Previous operations are bound within a cluster, so it is difficult to share functions across cluster levels.
Another problem is the most direct and main reason why we did pulsar functions at that time. We found that users do not only use one function to deal with a simple problem. It may be necessary to connect multiple functions in series. We hope to take multiple functions as a whole for operation and control. Using the previous mode, many such commands will be written, and the management of each command will be particularly complex. Moreover, the relationship between the topic of command subscription and output is difficult to control, and an intuitive description cannot be made, so the management and operation and maintenance will be particularly troublesome.
The main purpose of function mesh is not to make a more complex, full-scale and common framework for all calculations, but to provide better management and make it easier for users to use function。 For example, multiple functions mentioned just now need to be connected in series to provide services to users as a whole. Therefore, with such a simple requirement, we put forward a proposal in August and September 2020, which is very simple: we hope to have a unified place to describe the relationship between input and output, so that we can see at a glance that the output of the first function is the input of the second function, and the logic between them can be well described through yaml file, The user knows the combination relationship between the two functions at a glance.
If the logic mentioned just now is better integrated with k8s, it can be combined with kubernetes’s original scheduling and elastic strategies to provide users with a better management and use experience. Pulse functions and function mesh mainly take kubernetes CRD as the core, output the data generated by the subscribed topic to the specified place, or output the data from the specified source (such as from the database), which is a special case of function, by using the type of each function, such as our common function, as well as source and sink (equivalent to the special case of function).
CRD describes several concurrent functions, how to run them, and the serial relationship between front and rear topics. In addition to CRD, there will be a controller of function mesh, which is responsible for specific scheduling and execution. In this way, for the user experience, first of all, from the far left, the user gives it to k8s to help you describe the series relationship between various functions, as well as the maximum and minimum concurrency. The information of some resources required can be described through yaml file. After the yaml file is handed over to k8s, the API server will start scheduling internal resources and monitor the changes. If the CRD description changes, the pod information will be changed according to the changes, that is, expanding and shrinking pod, pod and pulsar The information of the cluster is clearer and does not save any information. As the source of data or the outlet of data, it is only a data pipeline and does not involve the management of all metadata. Its feature is just mentioned. We want to combine k8s to bring users a better experience. With k8s, we can well realize the elastic capacity expansion and contraction based on CPU.
K8s flexible scheduling can bring a better experience to the operation and maintenance of function. Once the CRD is changed, you can control the addition, deletion and modification of pod according to the description of CRD. It is also through such a mode, which runs on k8s and is completely decoupled from the cluster of a single pulsar. Through this mode, functions among multiple clusters can be shared and opened up.
Recently, we are doing a work to reduce the difficulty of users in operation through the tool of function package management. We should meet you in version 2.8. Our original intention of function mesh is to facilitate users to use pulsar function. On this basis, the previous interfaces are accessed through the rest interface, so we have also made forward compatibility. Based on the implementation of k8s API, function admin has been opened up, and users can control through the previous interface. Previous old users, if they are not used to submitting CRDs directly and providing change methods, can also have the same operation experience as before through this mode.
4、 Pulsar community
Then there is the situation of the community. Tencent is a very important contributor to pulsar community. Previously, in the first key business scenario, Tencent’s billing platform was mentioned. All businesses go through pulsar, including wechat red envelopes and many billing of Tencent games. At that time, Tencent also investigated other systems, and finally made such a trade-off, because pulsar has good consistency, good data accumulation and operation and maintenance capabilities, especially the cloud native architecture, which can reduce the pain points of large-scale cluster operation and maintenance.
Another typical scenario is that in the big data scenario, Kafka needs to be used. For Kafka, this is a common problem for large cluster users, including storage, calculation and binding. A previous article introduced some cases and summarized them. For example, storage computing binding brings inconvenience to operation and maintenance, and capacity expansion and contraction will reduce cluster performance. A headache in Kafka is rebrance. Once the capacity needs to be expanded or reduced, rebrance will be triggered automatically to move the topic from one node to another to achieve data rebalancing. In this process, moving data may have a certain impact on online services, because the bandwidth between clusters or network bandwidth is occupied, and the response to external services may not be timely. Data loss occurred. The performance and stability of mirror maker. In fact, the most important problem is that we mentioned the problem of capacity expansion and contraction. Bigo found that capacity expansion and contraction consumes a lot of manpower. It is also for these reasons that they switched from Kafka cluster to pulsar cluster. Recently, they contributed a very important feature to pulsar with Tencent’s small partner, called Kop. They analyzed the Kafka protocol on the server side. In this way, Let users get zero cost migration.
This figure is mainly to introduce some users using function. Many of its scenarios are lightweight, especially in the IOT scenario. For example, EMQ is an early user of pulsar function. The previous graffiti intelligence and Toyota intelligence are IOT scenarios, and many functions are used in the application.
There is a noteworthy point in the growth of the community. It will grow faster from 2019. This is a common phenomenon in the open source community. Behind each open source community, there will be a commercial company. Our company was established in 2019. The purpose of commercialization is different from that of Yahoo, which previously opened pulsar. Yahoo’s purpose is to let more users use pulsar to help polish, but it doesn’t have a strong motivation to maintain the community and spend more energy on developing more features to attract community users. But this is the purpose of our commercial company. We rely on the community to do our own commercialization and growth. Therefore, after the establishment of the company, we will do a lot of communication and coordination among developers to help developers use pulsar more conveniently and provide more functions to meet the needs of users.
Finally, relevant community information,Welcome small partners who want to know more information to find more resources of pulsar through these channels。 These resources include many rich video resources on site B and other commonly used mailing lists of Apache; There are more than 4000 users in slack, with China and the United States accounting for about half. On the right is the two WeChat official account maintained. Pulsar community and our company are interested in Pulsar or interested in community work. Welcome to scan code for more information. This is the main content shared today. Thank you for your time.
Co founder of streamnative, Tencent cloud TVP
Co founder of streamnative, Tencent cloud TVP. Prior to that, he worked at EMC as the technical director of EMC real-time processing platform in Beijing. He is mainly engaged in the related development of real-time computing and distributed storage system, and continuously contributes code in open source projects such as Apache bookkeeper and Apache pulsar. He is a PMC member and committer of open source projects Apache pulsar and Apache bookkeeper.
- The gospel of start-ups and such a considerate cloud native database?
- After using serverless for so long, here is some experience in its underlying technology
- Serverless + low code, so that technical Xiaobai can also become a full stack engineer?
- Left ear mouse: what exactly is serverless?
Regret for missing the live broadcast? The video review of the speeches of all guests at the summit is online. ClicklinkCan watch~
Do you still want dry goods ppt if you don’t enjoy watching video? Scan the two-dimensional code, pay attention to the official account and get back the keywords “serverless” in the background.