Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

Time:2022-1-3

About Apache pulsar

Apache pulsar is a top-level project of the Apache Software Foundation. It is a native distributed message flow platform for the next generation cloud. It integrates message, storage and lightweight functional computing. It adopts a separate architecture design of computing and storage, supports multi tenant, persistent storage, multi machine room cross regional data replication, and has strong consistency, high throughput Stream data storage features such as low latency and high scalability.
GitHub address:http://github.com/apache/pulsar/

Introduction to intellectual education

Chuanzhi Education (formerly Chuanzhi podcast) is an IT training company committed to the cultivation of high-quality software development talents. It covers sub brands such as dark horse programmer, erudite Valley, chuanzhihui, kudingyu children’s programming, Chuanzhi specialized college, xuebang and so on.

Chuanzhi education is the first education enterprise to realize A-share IPO. The company is committed to cultivating high-precision digital talents, mainly digital professionals such as artificial intelligence, big data, intelligent manufacturing, software, Internet and blockchain, as well as digital application talents such as data analysis, network marketing and new media.

In order to benefit more students with better educational resources, CIIC has opened 19 branches across the country and trained more than 300000 IT practitioners; 111 books were published, covering more than 2 million college students nationwide; Release 120000 video tutorials, with an average annual download and playback volume of 40 million times; 1500 + free live public classes were held, with an average annual attendance of nearly one million.

Erudite valley was officially established in July 2016. Relying on the 15 year it education precipitation of intelligence education, taking the employment course as the core and adopting the personalized, on-demand and on-demand adaptive learning mode, it online learning services integrating zero basic introduction, skill improvement and career planning are provided for students. Focus on integrating advantageous IT teaching resources to create high-quality teaching products and services more suitable for online learning.

Problems faced

In 2020, the epidemic has brought great changes to our lives and work. Due to the needs of epidemic prevention and control, many offline courses cannot be carried out normally. More users choose to improve their knowledge reserves and expand their professional ability through online learning. Erudite Valley provides online teaching services, which has become the best choice for more users. With the sharp rise of user consultation and learning behavior, the pressure of erudite Valley online system has increased, which puts forward new challenges to the original systemNew challenges

  • The original system only supports offline synchronization, and the response is slow.
  • It is necessary to synchronize the old data collected by the original system, collect the new data offline and in real time, and conduct link data cleaning and aggregation analysis based on all data.
  • At present, Alibaba cloud DTS (data transmission service) synchronization method is used to synchronize business tables, which is expensive and cannot clean and convert data during synchronization.

Facing the growth of scale and the adjustment of mode, erudite Valley needs a more flexible and efficient system to process the business data of large-scale growth, ensure the normal operation of the business system, support the adjustment of business mode, and use the data more for decision analysis.

Why pulsar?

We hope to solve these challenges with the help of message oriented middleware. Our team members have experience in using rabbitmq and Kafka: rabbitmq is more suitable for lightweight scenarios, and Apache Kafka is suitable for scenarios with a large amount of logs. We need a more comprehensive solution for application scenarios and source code reading. During the investigation, we know that there is another popular messaging system Apache pulsar in the market. For the operation and maintenance team, learning these three kinds of message middleware has a certain learning cost problem, and it is not easy to change the infrastructure once it is implemented, so we have conducted full research on the middleware selection of intelligence education. Research perspectives mainly include:

  • Support message flow processing to ensure message processing order
  • Support “only once” semantic message processing
  • It supports permanent message persistence, and the storage scale is easy to expand
  • Cloud native deployment is friendly and the operation and maintenance cost is low
  • Good source code quality and high community activity

We found that pulsar is a cloud native message and event flow platform, and many built-in features just meet our needs. For example, pulsar adopts the architecture design of separating computing from storage, stores data on Apache bookkeeper, and performs pub / sub related calculations on broker, which has the characteristics of IO isolation. Compared with traditional messaging platforms such as Kafka,Pulsar’s architecture has obvious advantages

  • Broker and bookie are independent of each other. They can expand and fault tolerance independently to improve the availability of the system.
  • Partitioned storage is not limited by the storage capacity of a single node, and the data distribution is more uniform.
  • Bookkeeper storage is safe and reliable to ensure that messages are not lost. At the same time, it supports batch disk brushing to obtain higher throughput.
  • The peak value of reading will not affect the writing performance. Reading and writing use different physical storage, and the persistence of data becomes more convenient and cheap.

From April to September 2020, we conducted functional tests on pulsar, including message sequential consumption, data consistency and loss rate. The test results show that pulsar can consume messages and data in an orderly manner without loss. In the application scenario without considering ordering, pulsar can be directly used as a message queue, and multiple subscription modes and subscription levels do not act on the topic, so that multiple consumers can consume the topic in order or disorder at the same time.

In terms of operation and maintenance, we can use k8s (helm) to deploy pulsar, pulsar IO and pulsar functions; Use pulsar admin to simplify the deployment and management complexity of the operation and maintenance team.

In a commercial company, there are certain risks in adopting any new technology (including open source technology), even if this technology has significant advantages. After careful consideration and full research, we finally decided to introduce Apache pulsar.

Practical application of pulsar in intelligence education

As an online education platform, we need to exchange a lot of data with the outside world. We use the third-party message system Ronglian Qimo to collect online customer service data, and use Zhuge IO system to collect user behavior data for analysis. Therefore, we need a system to summarize the external data, after secondary processing, persist it into the data warehouse, and finally get a set of data in line with business analysis.

We built the erudite Valley data processing system based on Apache pulsar, isolated the data and configuration of various applications through multiple namespaces, and realized data collection and processing through pulsar IO and pulsar functions. According to business needs, some namespaces are configured to never expire and permanently retain messages. Thanks to the design of separation of computing and storage in pulsar message system, the system can flexibly expand the storage capacity. Currently, the pulsar deployed in the production environment is based on the official v2 For the modified version of 6.1, all repair codes involving problems have been shared with the community through GitHub and will be repaired in future releases.

Build a source cluster to collect multi-dimensional data, and use pulse functions to clean the collected data in real time. In the whole link process, pulse topic adopts persistent storage and usesPulsar SQL[1] It is convenient to trace the data of each stage. Sink clusters persist the cleaned data.

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

In the above link, we use pulsar’s delay topic to identify the completion status of the session, and dead letter topic records the consumption failure message of sink end.

During the development process, we found that in the real-time flow (ordered) scenario, pulse functions will not interrupt the process after receiving fail response. Then we contacted pulsar community, submitted issue and PR, and received rapid response and support from streamnative team. This problem is currently marked for repair in pulsar 2.8.0, and we have internally repaired it based on pulsar 2.6.1.

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

Online consulting clue analysis

The erudite valley system uses the third-party online customer service system to realize the online consultation function of web end and mobile end. Previously, the use of online consultation session data was limited by the third-party service interface. With the growth of business and the adjustment of mode, the team hopes to combine this part of data with customer management system (CMS) to better tap customer needs and improve the efficiency of consultation and feedback.

The third-party system uses HTTP API to provide data query interface to the access party, and limits the flow of interface access, which affects the use of session data in CMS system.

After analysis and discussion, we designed and developed the HTTP polling source component and common JDBC sink component based on pulsar IO to efficiently capture the session data to the internal MySQL database for persistent storage. At the same time, we support the cleaning and conversion of data in the process of data collection, which greatly improves the utilization efficiency and use scenario of session data.

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

HTTP polling source is a data collection message source based on HTTP polling mechanism. It circularly executes HTTP requests based on configuration templates, updates the offset state to state storage after each request, and writes the request results to the downstream pulse topic.

Common JDBC sink uses the JDBC interface to persist structured object data and supports a variety of JDBC driven general structured document storage and processing. It not only covers all data types of H2, mysql, MariaDB and PostgreSQL databases, but also supports insert, update, update, delete and schema migration operations.

User interaction behavior collection

The erudite valley system uses a third-party system to realize the client user behavior analysis function. The user behavior analysis function of the commercial system is limited, and it is not convenient to combine the analysis dimension with the concepts in the business system. The erudite valley system needs to make the user behavior data generate greater value in order to provide better services for customers.

The commercial system provides a data subscription service based on the earlier version of Apache Kafka (V0.8), which is not supported by pulsar’s built-in Kafka source. Through the scheme evaluation, we will support Kafka v0 The subscription program of version 8 is packaged as pulsar IO source interface, namely legacy Kafka source. This interface supports Kafka v0 Version 8 log message source is used to efficiently save the data subscribed from Kafka to pulsar topic to support downstream flexible data processing, such as abnormal behavior research and judgment, learning effect evaluation and other functions.

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

Data change log collection

With the evolution of the business system, collecting business change logs has gradually become a burden for the R & D team. At present, the R & D team records the change history of business data through additional database tables, such as order change records, process flow records, etc. Developers need to be familiar with the design of database tables and carefully adjust the logging function when the table structure changes; In order to ensure the integrity of key data, data changes and logs need to be written in the same transaction, which has a certain impact on the system performance.

Through the MySQL binlog connector based on MySQL replication protocol, the data change events in the business system database can be synchronized to pulsar topic in real time, and the streaming message processing mechanism of pulsar can be used to ensure that the messages are processed in sequence in the downstream once. In this way, the data change log is automatically generated, the automatic migration of DDL changes is supported, and the downstream uses a variety of log storage mechanisms (mysql, elasticsearch, etc.) to persist the business log, so as to reduce the intrusion into the business system code and reduce the impact on the business system performance.

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

MySQL binlog connector has two components: MySQL binlog source and MySQL binlog sink. MySQL binlog source is used to collect the original binlog event data, send messages to the downstream in transaction units, and save the binlog filename / position or gtid set as the offset of synchronization data to state storage. MySQL binlog sink processes the binlog event messages by playing back (in transaction units) in the downstream database, and synchronizes the DML or DDL changes to the downstream database instance.

Data real-time desensitization synchronization

When developing the data processing system, data security has always been the focus of the R & D team. How to better mine the value of data on the premise of ensuring that sensitive information is not illegally accessed has become an urgent problem. At present, our team uses Alibaba cloud DTS or internal ETL tools to synchronize business data to analytical database (OLAP) to meet data analysis requirements, but such schemes can not desensitize sensitive information in the synchronization process.

Based on the work accumulation of the data change log collection module, a real-time data desensitization synchronization scheme based on MySQL binlog source is designed and implemented. The scheme uses the binlog event information saved in pulse topic, develops the desensitization processing function based on pulse functions, matches the desensitization processing method according to the rule engine, and then persists the desensitized data into the analytical database through common JDBC sink, which improves the scalability and flexibility of the data synchronization scheme.

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

We use pulsar to solve the problems of low collection efficiency and high delay rate of the original collection system, and are compatible with different collection methods for multiple data sources; At the same time, in terms of synchronous production business library, pulsar is used to replace the original DTS scheme with high cost, link data desensitization, ensure data security, and facilitate the data analysis team to make better and more efficient use of data.

Future planning

Based on the overall plan for the information construction of intelligence education and the actual needs of erudite Valley, we will continue to tap the value of data processing system in the future and make better use of Apache pulsar, an excellent message system to support system operation and business development.

  • Simplify the development of business log function through data change log collection scheme
  • Replace alicloud DTS with real-time data desensitization synchronization scheme
  • Realize the research and judgment of user abnormal behavior, evaluation of learning effect and playback of operation history
  • Build a cross departmental data exchange system

thank

Thanks to the support of Apache pulsar community and streamnative team, the construction and future development of erudite Valley data processing system are inseparable from the excellent contributions of the open source community. The R & D team of erudite valley will continue to promote the application of Apache pulsar system in the construction of the company’s business system, and encourage team members to participate more in open source community activities and grow together with everyone.

summary

In the process of researching and using pulsars, we have made full use of the native features of pulsars such as pulsar functions and pulsar IO, and also partially optimized them according to requirements. As the next generation cloud native distributed message flow platform, pulsar’s community is very active and growing. In the future, we plan to build a multi-dimensional data flow rule engine based on pulsar, use pulsar to build the basic middleware services of the group’s e-commerce platform, and increase the application scenarios of pulsar in intelligence education.

Author profile and photos

Sun Changyu, R & D director of Chuanzhi education erudite Valley

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

Liu Zilin, infrastructure R & D Engineer of erudite Valley

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

Related reading

Reference link

[1] Pulsar SQL : https://pulsar.apache.org/doc…

[2] Official website of intelligence education:http://www.itcast.cn/

[3] Pulsar official documents:https://pulsar.apache.org/doc…

[4] Debezium official website:https://debezium.io/

[5] Trino official website:https://trino.io/

[6] Binlog Connector : https://github.com/shyiko/mys…

[7] Ronglian Qimo:https://www.7moor.com/

[8] Zhuge IO:https://zhugeio.com/

[9] DTS : https://help.aliyun.com/produ…

Blog recommendation | Intelligence Education x pulsar: the future of Internet Education

click link , get Apache pulsar hard core dry goods information!

Recommended Today

Vue2 technology finishing 3 – Advanced chapter – update completed

3. Advanced chapter preface Links to basic chapters:https://www.cnblogs.com/xiegongzi/p/15782921.html Link to component development:https://www.cnblogs.com/xiegongzi/p/15823605.html 3.1. Custom events of components 3.1.1. Binding custom events There are two implementation methods here: one is to use v-on with vuecomponent$ Emit implementation [PS: this method is a little similar to passing from child to parent]; The other is to use ref […]