Diversified exploration and practice of Apache Flink in bilibilibili

Time:2021-7-24

This article is shared by Zheng Zhisheng, head of bilibilibili big data real-time platform. The core of this sharing is to explain the implementation of trillions of transmission and distribution architecture and how to build a complete set of preprocessing real-time pipeline based on Flink in the AI field. This sharing mainly focuses on the following four aspects:

1、 Real time past and present life of station B

2、 Flink on Yan’s incremental pipeline scheme

3、 Some engineering practices in Flink and AI

4、 Future development and thinking

1、 Real time past and present life of station B

1. Ecological scene radiation

Speaking of the future of real-time computing, the key word lies in the effectiveness of data. First, from the perspective of the whole ecology of big data development, its core scenario radiation: in the early stage of big data development, the core is the off-line computing scenario with day oriented granularity. At that time, most of the data effectiveness was calculated in days, and it paid more attention to the balance of time and cost.

With the popularization and improvement of data application, data analysis and data warehouse, more and more people put forward higher requirements for the effectiveness of data. For example, when you need to make real-time recommendations of some data, the effectiveness of the data will determine its value. In this case, the whole scene of real-time computing is generally born.

However, in the actual operation process, we also encounter many scenarios. In fact, there are no very high real-time requirements for the data. In this case, there are bound to be some new scenarios with data ranging from milliseconds, seconds or days. The real-time scenario data is mostly calculated in increments with the granularity of minutes. For off-line computing, it pays more attention to cost; For real-time computing, it pays more attention to value and effectiveness; For incremental computing, it pays more attention to balancing cost and comprehensive value and time.

Diversified exploration and practice of Apache Flink in bilibilibili

2. Timeliness of station B

What is the division of station B in the three dimensions? For station B, at present, 75% of the data is supported by offline calculation, another 20% of the scenarios are calculated in real time and 5% are calculated in increments.

  • For real-time computing scenarios, it is mainly applied to the whole real-time machine learning, real-time recommendation, advertising search, data application, real-time channel analysis and delivery, report, OLAP, monitoring, etc;
  • For off-line calculation, data radiation is wide, mainly data warehouse;
  • For incremental computing, some new scenarios were launched this year, such as binlog’s incremental upsert scenario.

Diversified exploration and practice of Apache Flink in bilibilibili

3. ETL has poor timeliness

In fact, we encountered many pain points in the early stage of effectiveness, focusing on three aspects:

  • First, the transmission pipeline lacks computing power. In the early scheme, the data basically fell to ODS by day, and the DW layer scanned the data of all ODS layers the day before, that is, the overall data could not be pre cleaned;
  • Second, the concentration of resources with a large number of operations will break out after the early morning, and the pressure on the whole resource arrangement will be very great;
  • Third, real-time and off-line gap is difficult to meet, because for most data, the cost of pure real-time is too high, and the effectiveness of pure off-line is too poor. At the same time, the warehousing timeliness of MySQL data is not enough. For example, like the barrage data of station B, its volume is very exaggerated. The synchronization of this business table often takes more than ten hours, and it is very unstable.

Diversified exploration and practice of Apache Flink in bilibilibili

4. AI real-time engineering is complex

In addition to the problem of effectiveness, we also encountered complex problems of AI real-time engineering in the early stage:

  • The first is the calculation efficiency of the whole characteristic engineering. For the same real-time computing scenario, data backtracking is also required in the offline scenario, and the computing logic will be repeatedly developed;
  • Second, the whole real-time link is relatively long. A complete real-time recommendation link is composed of N real-time and M offline jobs. Sometimes problems are found, and the operation, maintenance and control costs of the whole link are very high;
  • Third, with the increase of AI personnel and the investment of algorithm personnel, it is difficult to expand the experimental iteration horizontally.

Diversified exploration and practice of Apache Flink in bilibilibili

5. Flink has done ecological practice

In the context of these key pain points, we focus on the ecological practice of Flink, including the application of the whole real-time data warehouse, the whole incremental ETL pipeline, and some scenes of AI oriented machine learning. This sharing will focus more on incremental pipeline and AI plus Flink. The following figure shows the overall scale. At present, the whole transmission and computing volume has 30000 + computing cores, 1000 + jobs and more than 100 users in the trillion message scale.

Diversified exploration and practice of Apache Flink in bilibilibili

IIFlink On Yarn Incremental pipeline scheme

1. Early architecture

Let’s take a look at the early architecture of the whole pipeline. As can be seen from the figure below, the data is mainly consumed by flume, Kafka and HDFS. Flume uses its transaction mechanism to ensure the consistency of data from source to channel and then to sink. Finally, after the data falls into HDFS, the downstream scheduler will judge whether the data is ready by scanning the TMP file in the directory, so as to schedule and pull up the downstream ETL offline job.

Diversified exploration and practice of Apache Flink in bilibilibili

2. Pain points

Many pain points were encountered in the early stage:

  • The first key is data quality.

    • Memorychannel was first used, which would cause data loss. Later, I tried to use filechannel mode, but the performance could not meet the requirements. In addition, when HDFS is unstable, flume’s transaction mechanism will cause data to be rolled back to the channel, which will lead to continuous data duplication to a certain extent. When HDFS is extremely unstable, the highest repetition rate will reach the probability of percentile;
    • LZO row storage. In the early stage, the whole transmission was in the form of delimiters. The schema of this delimiter is relatively weakly constrained, and does not support nested format.
  • The second point is the timeliness of the whole data, which can not provide minute level query, because flume does not have a checkpoint chopping mechanism like Flink, but more controls the closing of files through the idle mechanism;
  • The third point is the downstream ETL linkage. As mentioned earlier, we mainly scan the TMP directory to see if it is ready. In this case, the scheduler will call the API of Hadoop list with namenode in a large number, which will cause great pressure on namenode.

Diversified exploration and practice of Apache Flink in bilibilibili

3. Stability related pain points

There are also many problems in stability:

  • First, flume is stateless, and TMP cannot be shut down normally after the node is abnormal or restarted;
  • Second, in the early stage, there was no environment attached to big data, which was a physical deployment model. It was difficult to control resource scaling, and the cost would be relatively high;
  • Third, flume and HDFS have communication problems. For example, when writing HDFS is blocked, the blocking of a node will back pressure to the channel, which will lead to the source not going to Kafka to consume data and stopping pulling offset, which will lead to Kafka rebalancing to a certain extent, and finally lead to the global offset not moving forward, resulting in data accumulation.

Diversified exploration and practice of Apache Flink in bilibilibili

4. Trillion incremental pipeline DAG view

Under the above pain points, the core scheme builds a trillions of incremental pipelines based on Flink. The following figure is the DAG view of the whole runtime.

Firstly, under the Flink architecture, kafkasource eliminates the avalanche of rebalance. Even if there is a blocking of data writing to HDFS at a certain concurrency in the whole DAG view, it will not lead to the blocking of all global Kafka partitions. In addition, the essence of the whole scheme is to realize scalable nodes through the module of transform.

  • The first layer node is parser, which mainly performs parsing operations such as data decompression and deserialization;
  • The second layer introduces the customized ETL module provided to users, which can realize the customized cleaning of data in the pipeline;
  • The third layer is the exporter module, which supports exporting data to different storage media. For example, when writing to HDFS, it will be exported as parquet; When it is written to Kafka, it will be exported to Pb format. At the same time, the configbroadcast module is introduced on the whole DAG link to solve the problems of real-time updating and hot loading of pipeline metadata. In addition, throughout the link, a checkpoint is performed every minute to append the incremental actual data, so that minute level queries can be provided.

Diversified exploration and practice of Apache Flink in bilibilibili

5. Overall view of trillion incremental pipeline

It can be seen that the overall view of the burn is divided by the overall view of the burn. Each topic of Kafka represents the distribution of a certain data terminal, and Flink job will be specially responsible for the write processing of various terminal types. It can also be seen in the view that the assembly of the whole pipeline is also realized for the blinlog data, and the operation of the pipeline can be realized by multiple nodes.

Diversified exploration and practice of Apache Flink in bilibilibili

6. Technical highlights

Next, let’s take a look at some technical highlights of the core of the whole architecture scheme. The first three are some features at the real-time functional level, and the last three are mainly some optimizations at some non functional levels.

  • For the data model, the format convergence is mainly realized through parquet and the mapping from protobuf to parquet;
  • Partition notification is mainly because one pipeline is actually processing multiple streams. The core solution is the partition ready notification mechanism of multiple stream data;
  • CDC pipeline uses binlog and Hudi to solve the problem of upsert;
  • Small files mainly solve the problem of file merging through DAG topology at runtime;
  • HDFS communication is actually the optimization of many key problems in the trillion scale;
  • Finally, some optimization of partition fault tolerance.

Diversified exploration and practice of Apache Flink in bilibilibili

6.1 data model

Business development is mainly through the assembly of strings to assemble the reporting of data records. The later stage is organized through the definition and management of the model and its development. It is mainly provided to users at the entrance of the platform to record each stream and table. Its schema will generate protobuf files. Users can download the HDFS model files corresponding to protobuf on the platform, The development of client side can be constrained from Pb through strong schema.

Let’s take a look at the runtime process. First, Kafka’s source will consume each rawevent record actually transmitted from the upstream. There will be pbevent objects in the rawevent. Pbevents are actually protobuf records one by one. The data flows from the source to the parser module. After parsing, a pbevent will be formed. The pbevent will store the entire schema model entered by the user on the platform on the OSS object system, and the exporter module will dynamically load the changes of the model. Then the Pb file is used to reflect the generated specific event object, and the event object can be mapped into parquet format. A lot of cache reflection optimization has been done here to improve the dynamic parsing performance of the whole Pb by six times. Finally, we will land the data to HDFS to form parquet format.

Diversified exploration and practice of Apache Flink in bilibilibili

6.2 partition notification optimization

As mentioned earlier, the pipeline will process hundreds of streams. In the early flume architecture, in fact, it is difficult for each flume node to sense its own processing progress. At the same time, flume can’t handle the global progress. However, based on Flink, it can be solved through watermark’s mechanism.

Firstly, the source will generate watermark based on the eventtime in the message, and the watermark will be passed to sink through each layer of processing. Finally, the progress of all watermark messages will be summarized in a single thread through the committer module. When it finds that the global watermark has been pushed to the partition of the next hour, it will send a message to hive metstore or write to Kafka to notify the partition data of the previous hour that it is ready, so that the downstream scheduler can pull up the operation of the job faster in a message driven manner.

Diversified exploration and practice of Apache Flink in bilibilibili

6.3 optimization on CDC pipeline

The right side of the figure below is actually the complete link of the entire CDC pipeline. To realize the complete mapping from MySQL data to hive data, we need to solve the problems of flow and batch processing.

The first is to synchronize the full amount of MySQL data to HDFS at one time through dataX. Next, initialize the data into the initial snapshot of Hudi through Spark’s job, then drag the binlog data of Mysql to Kafka’s topic through canal, then update the initial snapshot data with incremental data through Flink’s job, and finally form the Hudi table.

The whole link is to solve the problem of no loss and no weight of data. The key point is to open the transaction mechanism for the Kafka written by canal to ensure that the data is not lost or heavy in the transmission process when it falls into the Kafka topic. In addition, data duplication and loss may occur in the upper layer of transmission. At this time, it is more through the global unique ID plus the millisecond timestamp. In the whole streaming job, the data is de duplicated for the global ID and sorted for the millisecond time, which can ensure that the data can be updated to the Hudi in an orderly manner.

Then, trace’s system stores data based on Clickhouse to count the number of incoming and outgoing data of each node, so as to achieve accurate comparison of data.

Diversified exploration and practice of Apache Flink in bilibilibili

6.4 stability – consolidation of small files

As mentioned earlier, after being transformed into Flink, we did checkpoint every minute, and the number of files was magnified very seriously. It is mainly to introduce the merge operator into the whole DAG to realize file merging. The merge method is mainly horizontal merging based on concurrency, and one writer will correspond to one merge. In this way, every five minutes of checkpoint and 12 files in an hour will be merged. In this way, the number of files can be greatly controlled within a reasonable range.

Diversified exploration and practice of Apache Flink in bilibilibili

6.5 HDFS communication

In the actual operation process, we often encounter the problem of serious accumulation of the whole operation. The actual analysis is mainly related to HDFS communication.

In fact, HDFS communication involves four key steps: initializing state, invoke, snapshot and notify checkpoint complete.

The core problem mainly occurs in the invoke phase. Invoke will reach the file scrolling condition, which will trigger flush and close. When the close actually communicates with the namenode, it is often blocked.

There will also be a problem in the snapshot stage. Once hundreds of streams in a pipeline trigger the snapshot, the serial execution of flush and close will also be very slow.

Core optimization focuses on three aspects:

  • First, it reduces the cutting of files, that is, the frequency of closing. In the snapshot phase, the file is not closed, but more through file renewal. In this way, in the phase of initializing the state, you need to truncate the file for recovery recovery.
  • Second, the asynchronous close is improved. It can be said that the close action will not block the processing of the whole total link. For the close of invoke and snapshot, the state will be managed to the state, and the file will be recovered by initializing the state.
  • Third, for multiple streams, snapshot also performs parallel processing. For checkpoint every 5 minutes, multiple streams are actually multiple buckets, which will be processed serially through loops. Then, through multi-threaded transformation, the occurrence of checkpoint timeout can be reduced.

Diversified exploration and practice of Apache Flink in bilibilibili

6.6 some optimization of partition fault tolerance

In fact, when there are multiple streams in the pipeline, the data of some streams are not continuous every hour.

This situation will lead to partition. Its watermark cannot advance normally, causing the problem of empty partition. Therefore, during the operation of the pipeline, we introduce the partitionrecover module, which will promote the partition notification according to the watermark. For watermarks of some streams, if the ideltimeout has not been updated, the recover module will add partitions. When the end of each partition arrives, it will add delay time to scan the watermarks of all streams, so as to reveal the bottom.

During the transmission process, when the Flink job is restarted, we will encounter a wave of zombie files. We clean and delete the zombie files before the whole partition notification at the DAG commit node to clean up the whole zombie files. These belong to some optimization at the non functional level.

Diversified exploration and practice of Apache Flink in bilibilibili

3、 Some engineering practices in Flink and AI

1. Architecture evolution schedule

The following figure shows the complete timeline of AI direction in real-time architecture.

  • As early as 2018, many algorithmic personnel’s experimental development was workshop style. Each algorithmic staff will choose different languages to develop different experimental projects according to their familiar languages, such as python, PHP or C + +. Its maintenance cost is very large and prone to failure;
  • In the first half of 2019, it is mainly based on the jar package mode provided by Flink to do some engineering support for the whole algorithm. It can be said that at the beginning of the first half of the year, it is more about stability and universality;
  • In the second half of 2019, through the self-developed bsql, the threshold of model training is greatly reduced, and the real-time of label and instance is solved to improve the efficiency of the whole experimental iteration;
  • In the first half of 2020, some improvements will be made around the calculation of the whole feature, the opening of flow batch calculation and the improvement of feature engineering efficiency;
  • By the second half of 2020, we will focus more on the flow of the whole experiment and the introduction of aiflow to facilitate the flow batch DAG.

Diversified exploration and practice of Apache Flink in bilibilibili

2. AI Engineering Architecture Review

Looking back on the whole AI project, its early architecture diagram actually reflects the architecture view of the whole AI in early 2019. Its essence is to support the link of the whole model training through some single tasks and some computing nodes composed of various mixed languages. After 2019 iteration, the whole near line training will be completely replaced with bsql mode for development and iteration.

Diversified exploration and practice of Apache Flink in bilibilibili

3. Current pain points

At the end of 2019, we actually encountered some new problems, which mainly focused on the functional and non functional dimensions.

  • At the functional level:

    • Firstly, the whole link is very long and complex from label to instance stream, model training, online prediction and even real experimental results;
    • Second, the integration of the whole real-time feature, offline feature and stream batch involves a lot of job composition, and the whole link is very complex. At the same time, feature calculation is required for both experiment and online. Inconsistent results will lead to problems in the final effect. In addition, it is difficult to find where the features exist, and there is no way to trace them.

Diversified exploration and practice of Apache Flink in bilibilibili

  • At the non functional level, algorithm students often encounter problems. They don’t know what checkpoint is, whether to open it or not, and what configuration it has. In addition, it’s not easy to troubleshoot online problems. The whole link is very long.

    • So the third point is that the complete experimental progress needs to involve a lot of resources, but for the algorithm, it doesn’t know what these resources are and how much they need. In fact, these problems all have a great confusion about the algorithm.

4. Pain point resolution

In the final analysis, it focuses on three aspects:

  • The first is the issue of consistency. From data preprocessing to model training to prediction, each link is actually fault. Including inconsistency of data and calculation logic;
  • Second, the whole experimental iteration is very slow. A complete experimental link, in fact, for algorithm students, he needs to master a lot of things. At the same time, the materials behind the experiment cannot be shared. For example, some features should be repeatedly developed behind each experiment;
  • Third, the cost of operation and maintenance and control is relatively high.

The complete experimental link is actually composed of a real-time project and an offline project link. It is difficult to troubleshoot online problems.

Diversified exploration and practice of Apache Flink in bilibilibili

5. Prototype of real-time AI Engineering

Under such pain points, in the past 20 years, it has mainly focused on the direction of AI to build the prototype of real-time engineering. The core is to make a breakthrough through the following three aspects.

  • First, in terms of some capabilities of bsql, we hope to develop the algorithm through SQL, so as to reduce the project investment;
  • The second is feature engineering, which can solve some problems of feature calculation through the core to meet some support of features;
  • The third is the cooperation of the whole experiment. The purpose of the algorithm is actually experiment. I hope to create a set of end-to-end experimental cooperation, and finally hope to achieve “one click experiment” for the algorithm.

Diversified exploration and practice of Apache Flink in bilibilibili

6. Characteristic Engineering – difficulties

We have encountered some difficulties in feature engineering.

  • The first is in the real-time feature calculation, because it needs to use the results to the prediction service on the whole line, so it has very high requirements for delay and stability;
  • Second, the whole real-time and offline computing logic is consistent. We often encounter a real-time feature, which needs to trace the offline data of the past 30 to 60 days. How can the real-time feature computing logic be reused in the offline feature computing;
  • Third, it is difficult to get through the flow batch integration of the whole offline feature. The computational logic of real-time features often has some streaming concepts such as window timing, but offline features do not have these semantics.

Diversified exploration and practice of Apache Flink in bilibilibili

7. Real time features

Let’s take a look at how we do real-time features. The right side of the figure is the most typical scenes. For example, I want to make real-time statistics on the number of times users play each up main related video in the last minute, 6 hours, 12 hours and 24 hours. For such a scenario, there are actually two points:

  • First, it needs to use the sliding window to calculate the whole user’s past history. In addition, during the sliding calculation of the data, it also needs to associate some basic information dimension tables of the up master to obtain some videos of the up master to count his playback times. In the final analysis, I actually encountered two big pains.

    • With Flink’s native sliding window, minute sliding will lead to more windows and large performance loss.
    • At the same time, fine-grained windows will also lead to too many timers and poor cleaning efficiency.
  • The second is dimension table query. Multiple keys need to query multiple corresponding values of HBase. In this case, it is necessary to support concurrent query of arrays.

Under the two pain points, the sliding window is mainly transformed into the group by mode and the UDF mode of AGG to store some window data of one hour, six hours, twelve hours and twenty-four hours in the whole rocksdb. In this way, through the UDF mode, the whole data triggering mechanism can realize record level triggering based on group by, and the whole semantics and timeliness will be greatly improved. At the same time, in the UDF function of the whole AGG, the state is done through rocksdb to maintain the data life cycle in UDF. In addition, the whole SQL is extended to realize array level dimension table query. Finally, the whole effect can actually support various computing scenarios through the mode of super large window in the direction of real-time features.

Diversified exploration and practice of Apache Flink in bilibilibili

8. Features – offline

Next, take a look at the offline. The upper part of the left view is a complete real-time computing link. It can be seen that to solve the problem that the same SQL can be reused in offline computing, we need to solve the problem that some computing IO can be reused. For example, Kafka is used to input data in streaming mode, and HDFS is used to input data offline. Streaming is supported by some kV engines such as KFC or avbase. Offline, it needs to be solved by hive engine. In the final analysis, three problems need to be solved:

  • First, the ability to simulate the whole streaming consumption is needed to support the consumption of HDFS data in offline scenarios;
  • Second, we need to solve the problem of orderly partition of HDFS data in the consumption process, similar to Kafka’s partition consumption;
  • Third, we need to simulate the consumption of dimension table of kV engine and realize the consumption of dimension table based on hive. Another problem needs to be solved. When each record pulled from HDFS actually consumes the hive table, it has a corresponding snapshot, which is equivalent to the timestamp of each data and the partition of the corresponding data timestamp.

Diversified exploration and practice of Apache Flink in bilibilibili

9. Optimization

9.9 – offline ordered partition

In fact, the scheme of orderly zoning is mainly based on some modifications made in the front when the data falls into HDFS. First, before the data falls into HDFS, it is the transmission pipeline and consumes data through Kafka. After Flink’s job pulls data from Kafka, the watermark of the data is extracted through eventtime. The concurrency of each Kafka source will report the watermark to the globalwatermark module in jobmanager. Globalagg will summarize the progress of watermark from each concurrency, so as to count the progress of globalwatermark. According to the progress of globalwatermark, calculate the problem of too fast calculation of watermark with concurrency, so as to send the control information to Kafka source through globalagg. When Kafka source has too fast concurrency, its whole partition propulsion will slow down. In this way, in the HDFS sink module, the entire event time of the data records received on the same time slice is basically orderly. Finally, when it falls to HDFS, it will also identify its corresponding partition and corresponding time slice range on the file name. Finally, under the HDFS partition directory, the orderly directory of data partition can be realized.

Diversified exploration and practice of Apache Flink in bilibilibili

9.2 offline – partition incremental consumption

After the data is incrementally ordered in HDFS, hdfstreamingsource is implemented. It will do fecher partition for files. There are fecher threads for each file, and each fecher thread will count each file. It offsets the progress of the cursor and updates the state to the state according to the checkpoint process.

In this way, the orderly promotion of the whole file consumption can be realized. When tracing back the historical data, the offline job will involve the stop of the whole job. In fact, the end of a partition is introduced into the whole filefetcher module, and when each thread counts each partition, it will sense the end of its partition. The status after the partition ends will be summarized to the cancellationmanager, and further summarized to the job manager to update the progress of the global partition, When all global partitions reach the end cursor, the whole Flink job will be cancelled and closed.

Diversified exploration and practice of Apache Flink in bilibilibili

9.3 offline – snapshot dimension table

As mentioned earlier, the whole offline data is actually on hive. There will be a lot of information about the whole table field of the HDFS table data of hive, but in fact, very little information is required for offline features. Therefore, it is necessary to cut the offline field first in the process of hive, clean an ODS table into a DW table, and the DW table will finally run a job through Flink, There will be a reload scheduler inside, which will periodically pull the table information corresponding to each partition in hive according to the partition of watermark currently pushed by the data. By downloading some data from the hive directory of an HDFS, it will be reloaded into a rocksdb file in the whole memory. Rocksdb is actually the component used to provide dimension table kV query.

The component will contain the build process of multiple rocksdbs, which mainly depends on the eventtime in the whole data flow process. If it is found that the eventtime advance is approaching the end of the hour partition, it will actively reload through the lazy loading mode to build the partition of rocksdb for the next hour. In this way, To switch the reading of the entire rocksdb.

Diversified exploration and practice of Apache Flink in bilibilibili

10. Experimental flow batch integration

Based on the above three optimizations, namely, partition ordered increment, Kafka like partition fetch consumption, and dimension table snapshot, the real-time feature and offline feature are finally realized, and a set of SQL scheme is shared to open up the flow batch calculation of features. Next, let’s take a look at the whole experiment. The complete flow batch integrated link. It can be seen from the figure that the top granularity is the complete offline calculation process. The second is the whole near line process. In fact, the semantics of calculation used in the off-line process are completely consistent with the semantics of real-time consumption used in the near line process. Flink is used to provide SQL calculation.

Take a look at the near line. In fact, label join uses a click stream and display stream of Kafka. When it comes to the whole offline computing link, it uses an HDFS click directory and HDFS display directory. The same is true for feature data processing. Kafka playback data and some manuscript data of HBase are used in real time. For offline, hive’s manuscript data and hive’s playback data are used. In addition to the flow batch connection of the whole offline and near line, it also summarizes the real-time data effects generated by the whole near line to the OLAP engine, and provides the whole real-time index visualization through superset. In fact, it can be seen from the figure that the complete complex flow batch integrated computing link contains very complex and numerous computing nodes.

Diversified exploration and practice of Apache Flink in bilibilibili

11. Experimental collaboration – Challenges

The challenge in the next stage is more about experimental cooperation. The figure below is the abstraction after simplifying the whole link in front. As can be seen from the figure, the three dotted area boxes are offline links plus two real-time links. The three complete links constitute the flow batch of jobs, which is actually the most basic process of a workflow. It needs to complete the complete abstraction of workflow, including the driving mechanism of flow batch events. For the algorithm in the AI field, it is more desirable to use Python to define the complete flow. In addition, the whole input, output and its whole calculation tend to be templated, which can facilitate the cloning of the whole experiment.

Diversified exploration and practice of Apache Flink in bilibilibili

12. Introduce aiflow

In the whole workflow, in the second half of the year, it cooperated more with the community and introduced the whole scheme of aiflow.

On the right is actually the DAG view of the entire aiflow link. You can see the entire node. In fact, there are no restrictions on the types it supports. It can be a streaming node or an offline node. In addition, the entire communication edge between nodes can support data-driven and event driven. The main advantage of introducing aiflow is that aiflow provides Python semantics to facilitate the definition of a complete aiflow workflow, as well as the scheduling of the progress of the whole workflow.

At the edge of the node, compared with some flow schemes in the native industry, it also supports the whole mechanism based on event driven. The advantage is that it can help send an event driven message between two Flink jobs through the progress of watermark processing data partition in Flink to pull up the next offline or real-time job.

In addition, it also supports some supporting services around, including some message module services for notification, metadata services, and some model center services in the AI field.

Diversified exploration and practice of Apache Flink in bilibilibili

13. Python defines flow

Let’s take a look at how aiflow is finally defined as a python workflow. The view on the right is the definition of the complete workflow of an online project. The first is the definition of spark job, which describes the whole downstream dependency by configuring dependency. It will send an event driven message to pull up the following Flink streaming job. Streaming jobs can also pull up the following spark jobs in a message driven manner. The definition of the whole semantics is very simple. It only needs four steps to configure the conf information of each node, define the operation behavior of each node, and its dependency. Finally, run the topology view of the whole flow.

Diversified exploration and practice of Apache Flink in bilibilibili

14. Event driven flow batch

Next, let’s take a look at the complete driving mechanism of flow batch scheduling. The right side of the figure below is the driving view of the complete three work nodes. The first is from source to SQL to sink. The yellow box introduced is the extended supervisor, which can collect the global watermark progress. When the whole streaming job finds that watermark can be pushed to the partition of the next hour, it will send a message to notifyservice. After notifyservice gets this message, it will send it to the next job. The next job will mainly introduce the flow operator into the DAG of the whole Flink. Before the operator receives the message sent by the last job, it will block the operation of the whole job. Until the message driver is received, it means that the upstream partition has been completed in the last hour. At this time, the next flow node can drive and pull up for operation. Similarly, the next workflow node also introduces the module of globalwatermark collector to summarize the processing progress of collecting it. After the partition of the last hour is completed, it will also send a message to notifyservice, which will drive the module calling aischeduler to pull up the spark offline job to finish the spark offline. As you can see, the whole link actually supports four scenarios: batch to batch, batch to stream, stream to stream, and stream to batch.

Diversified exploration and practice of Apache Flink in bilibilibili

15. Prototype of real-time AI full link

Based on the whole flow definition and scheduling of flow and batch, the prototype of real-time AI full link is preliminarily constructed in 2020, and the core is experiment oriented. Algorithm students can also develop node nodes based on SQL. Python can define a complete DAG workflow. Monitoring, alarm and operation and maintenance are integrated.

At the same time, it supports offline to real-time communication, from data processing to model training, from model training to experimental effect, and end-to-end communication. On the right is the link of the whole near line experiment. The following is the service of providing the material data produced by the whole experimental link to online prediction training. There will be three supporting aspects as a whole:

  • First, some basic platform functions, including experiment management, model management, feature management and so on;
  • Secondly, it also includes the services of some services at the bottom of the whole aiflow;
  • Then there are some metadata services of platform level metadata.

Diversified exploration and practice of Apache Flink in bilibilibili

4、 Some prospects for the future

In the coming year, we will focus more on two aspects.

  • The first is the direction of the data lake, which will focus on some incremental computing scenarios from ods to DW layer and breakthroughs in some scenarios from DW to ads layer. The core will combine Flink, iceberg and Hudi as the landing in this direction.
  • On the real-time AI platform, we will further face the experiment to provide a set of real-time AI cooperation platform. The core is to create an efficient engineering platform that can refine and simplify algorithm personnel.

Diversified exploration and practice of Apache Flink in bilibilibili