1、 Business background
Bigo global audio and video services require more and more real-time data. Data analysts hope to see the business data of new users and active users in multi-dimensional real-time, so as to grasp the market trend as soon as possible. Machine learning engineers hope to get users’ browsing, clicking and other data in real time, and then quickly add user preferences to the model through online learning, so as to push users’ information For the most interesting content, app development engineers hope to be able to monitor the success rate and crash rate of APP opening in real time. The ability of these real-time data depends on the real-time computing platform. From the industry point of view, the trend of real-time is accelerating. This paper will introduce the construction experience and achievements of bigo real-time computing platform based on Flink.
2、 Introduction to the platform
At that time, the number of real-time computing scenarios was less than that of spark in the past two years, and it was mainly divided into two stages: real-time computing and real-time computing. Since 2018, after comprehensively considering the advantages of Flink compared with spark streaming, bigo technology decided to switch the real-time computing platform to the technology route based on Flink. After nearly two years of development, the bigo real-time computing platform has become increasingly perfect, basically supporting the mainstream real-time computing scenarios in the company. The following figure shows the architecture of the bigo real-time computing platform
The data sources of real-time computing can be divided into two categories: one is the user’s behavior logs of browsing and clicking in app or browser, which are collected by Kafka; the other is the changes recorded in relational database caused by user’s behavior, and the biglog generated by these changes is extracted into real-time calculation by BDP.
As can be seen from the figure, the bottom layer of bigo real-time computing platform is based on yarn for cluster resource management. With the help of yarn’s distributed scheduling ability, large-scale cluster scheduling can be realized. Based on the open source Flink, the computing engine of the real-time platform is specially customized and developed for bigo scenarios. The top layer of the real-time platform is bigoflow, a one-stop development platform developed by bigo, where users can easily develop, debug and monitor operation and maintenance. Bigflow provides perfect SQL development ability, automatic monitoring and configuration ability, as well as log automatic collection and query ability, so that users can complete a business job with only one SQL. It has the following functions:
1. It provides powerful SQL editor, which can check syntax and prompt automatically.
2. It can connect all the data sources and data storage of the company, thus eliminating the customized work of the business side.
3. The logs are automatically collected into es, so that users can easily retrieve and query, and locate errors quickly.
4. The key indicators of operation are automatically connected to the monitoring and alarm platform of the company, and users do not need to configure them by themselves.
5. Collect the resource usage of all operations and analyze them automatically to help identify and manage unreasonable operations.
The results of real-time calculation will be stored in different storage according to business requirements. The results of ETL class jobs are usually stored in hive, and the data that needs ad hoc query is usually put into the Clickhouse. Monitoring alarm and other types of operations can directly output the results to the Prometheus database of the alarm platform for direct use by the alarm platform.
3、 Business application
With the development of real-time computing platform, more and more scenarios are moved to bigflow platform, and real-time computing also brings a lot of benefits to these scenarios. The following bigo technology takes several typical scenarios as examples to illustrate the capability or performance enhancement brought by real-time computing.
Data extraction and transformation is a typical real-time scene. The user’s behavior logs in app and browser are generated in real-time and uninterrupted manner. They should be collected in real time, extracted and transformed, and finally entered into the database. The ETL scenario data path before bigo is usually Kafka > flume > hive. There are several problems in the path of flume storage
1. Flume has poor fault tolerance, which may lead to data loss or data duplication.
2. Flume’s dynamic expansion ability is poor, and it is difficult to expand immediately when the traffic suddenly comes.
3. Once the data field or format changes, flume is difficult to adjust flexibly.
Flink provides a powerful fault tolerance capability based on state, which can be exactly once end-to-end, and the concurrency can be flexibly adjusted. Flink SQL can flexibly adjust the logic. Therefore, most ETL scenarios have been migrated to Flink architecture.
Real time statistics
As a company with multiple app products, bigo needs a large number of statistical indicators to reflect the daily life, revenue and other indicators of products. Traditionally, these indicators are calculated every day or every hour through offline spark jobs. It is difficult to guarantee the timeliness of data generation in offline computing. The delay of important indicators often occurs. Therefore, we slowly generate important indicators through real-time calculation, which greatly ensures the timeliness of data generation. The most significant is that the previous important indicator is often delayed, leading to its downstream output in the afternoon, which has brought a lot of trouble to data analysts. After transforming it into a real-time link, the final indicator can be output at 7:00 a.m., and the data analyst can use it at work.
With the explosive development of information, the user’s interest shifts more and more quickly, which requires machine learning to recommend the videos that the user is interested in according to the user’s behavior at that time. Traditional machine learning is based on batch processing, which usually needs to reach the fastest hour level to update the model. Today, the sample training based on real-time calculation can continuously train the samples into real-time models and apply them to online learning. It really realizes online learning, and updates the recommendation based on user behavior at minute level. At present, machine learning has accounted for more than 50% of the real-time computing cluster.
Real time monitoring
Real time monitoring is also a very important real-time scene. App developers need to monitor the success rate of APP opening and other indicators in real-time. If there is an exception, it should be alerted and notified in time. In the previous practice, the original data is usually stored in hive or Clickhouse. According to the configuration rules of grafana based monitoring platform, Presto or Clickhouse is used to query every certain time to determine whether an alarm is needed according to the calculated results. There are several problems with this approach
1. Although Presto or Clickhouse are OLAP engines with good performance, they do not guarantee the high availability and real-time performance of the cluster. However, real-time and high availability are required for monitoring.
2. In this way, all the data of the day must be calculated every time the index is calculated. There is a great waste of calculation.
Through the monitoring scheme of real-time calculation, the indicators can be calculated in real time and directly output to the database of grafana, which not only ensures the real-time performance, but also reduces the amount of calculated data by thousands of times.
4、 Bigo real time platform features
In the process of development, bigo real-time computing platform has formed its own characteristics and advantages according to the use characteristics of internal business of bigo. It is mainly reflected in the following aspects:
A common situation is that the data producers and users are not the same group. The data will be reported to Kafka or hive by colleagues in charge, and the data analysts will use these data to calculate. They don’t know the details of Kafka, they only know the hive table name to use. In order to reduce the trouble of users using real-time computing, bigflow connects metadata with Kafka, hive, Clickhouse and other storage. Users can directly use the tables of hive and Clickhouse in their jobs. Without writing DDL, bigflow automatically parses them and automatically converts them into DDL statements in Flink according to the information of metadata, which greatly reduces the development work of users. This is due to the unified planning of bigo computing platform, which is impossible for many companies with separate offline and real-time systems.
End to end product solution
Bigflow is not only a real-time computing platform, but also provides end-to-end solutions according to business scenarios in order to facilitate users to use or migrate. Like the monitoring scenario described above, users have many monitoring services that need to be migrated. In order to minimize the work, bigflow provides a solution for monitoring scenarios. Users only need to migrate the SQL that calculates the monitoring indicators to Flink SQL and other work, including DDL of Flink job, data sink to monitoring platform, are not required to be done. They are all automatically implemented by bigflow, and the user’s original configuration rules do not need to be changed. This allows users to complete the migration with minimal effort.
In addition, as mentioned above, bigflow automatically adds alarms to the key indicators of user jobs, which basically meets the needs of the vast majority of users, allowing them to focus on business logic without worrying about other things. The user’s log will also be automatically collected into es for users to view. In ES, there is a search query with some summarized investigation questions. Users can click query directly according to the phenomenon.
Strong hive capability
Since most of the data in bigo is stored in hive, real-time jobs often need to write the results to hive, and many scenarios need to be able to read data from hive. Therefore, the integration of bigflow and hive has always been in the forefront of the industry. Before community 1.11, bigo technology realized the ability to write data to hive and update meta dynamically. 1.11 has not been officially released. On the basis of 1.11, we have developed streaming read hive table, support eventtime, dynamic filtering partition, support TXT format compression and other functions, which are ahead of the open source community.
This is a unified batch flow scenario implemented by Flink on abtest. Under normal circumstances, the real-time data of Kafka is consumed by Flink, and the real-time calculation results are stored in hive. However, operations often encounter business logic adjustment and need to trace data again for logarithm. Due to the large amount of data, if the data is consumed from Kafka, it will bring great pressure on Kafka and affect the online stability. Since a copy of the data is stored in hive, we choose to read it from hive when we trace the data. In this way, with the same code, we can go offline and online, and minimize the impact of data tracking on online.
Automatic ETL job generation
Flink currently takes over most of the ETL scenarios. The logic of ETL job is generally simple, but there are many jobs. Moreover, the data format reported by users will change frequently, or the fields will be increased or decreased. In order to reduce the cost of developing and maintaining ETL jobs, we develop the function of automatically generating ETL jobs. Users only need to provide the topic and format of the reported data to automatically generate ETL jobs and write the results to hive. After the format or field of the report data changes, the job can also be updated automatically. At present, it supports various data formats such as JSON and Pb.
With the rapid development of bigo business, bigoflow real-time computing platform is also growing and improving, but there are still many areas to be improved and improved. In the future, bigo technology will focus on two aspects: platform improvement and business support
Platform improvement: focus on improving the product level of the platform. It mainly includes several aspects: development of automatic resource configuration, automatic optimization and other functions. According to the real-time data volume of the job, it can automatically configure the resources needed by the job, automatically expand at the peak of the traffic, and automatically shrink the volume at the low flow; support the display of blood relationship between tables, so that users can analyze the dependency relationship between jobs; it supports multi clusters in different places, and many of them are supported by Flink Key business needs high SLA guarantee. We will guarantee the reliability of key business through multi machine rooms in different places. Explore scenarios such as streaming batch unification and data lake.
Support more business scenarios: develop more machine learning and real-time data warehouse scenarios, and further promote the use of Flink SQL.
6、 Team profile
On big data, big data can be used for fast analysis of big data in gopb. Specifically responsible for the construction of EB level distributed file storage, daily average trillion message queue and 50pb scale big data computing for all businesses of the company, including batch, stream, MPP and other computing architectures, covering all link technology stacks from data definition, channel, storage and calculation, data warehouse and Bi. Team technology atmosphere is strong, there are many open source software developers, looking forward to excellent talents to join us!
The manuscript comes from bigo technology we media