More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

Time:2020-2-13

Youxin financial services company implements a global data system strategy. By connecting and integrating the data of each business line of the group, it uses big data, artificial intelligence and other technologies to build a unified data asset, such as ID mapping, user tag, etc. The user portrait project of Youxin Jinfu was established based on this background to realize the group strategy of “data driven business and operation”. At present, the system supports daily data processing of more than 1 billion and access to hundreds of compliance data sources.

Author: Yang Yi, Mu Chaofeng, he Xiaobing, Hu Xi

Guidance:Nowadays, the pace of life is speeding up day by day. Facing the increasing mass of information, enterprises are increasingly troubled by the low efficiency of information screening and processing. Due to the lack of detailed user marketing, many inappropriate or undesirable message push in the enterprise app have a great impact on the user experience, and even lead to the loss of users. In this context, Youxin financial services company implements a global data system strategy, through opening up and integrating the data of each business line of the group, and using big data, artificial intelligence and other technologies to build a unified data asset, such as ID mapping, user tags and so on. The user portrait project of Youxin Jinfu was established based on this background to realize the group strategy of “data driven business and operation”. At present, the system supports daily data processing of more than 1 billion and access to hundreds of compliance data sources.

1、 Technology selection

The traditional Hadoop based off-line data storage computing scheme has been widely used in the industry. However, due to the high time delay of off-line computing, more and more data application scenarios have changed from off-line to real-time. Here, a table is cited to compare the current mainstream real-time computing framework.

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

Apache storm’s fault tolerance mechanism needs to answer each data (ACK), so its throughput is greatly affected, and there will be problems in the scenario of large data throughput, so it is not applicable to the requirements of this project.

The overall ecosystem of Apache spark is more complete, and it is temporarily leading in the integration and application of machine learning. However, the underlying spark is still in the form of micro batching.

Apache Flink has obvious advantages in streaming Computing: first of all, its streaming computing belongs to a real single processing, that is, every piece of data will trigger computing. At this point, it is obviously different from Spark’s Micro batch flow processing. Secondly, Flink’s fault-tolerant mechanism is lighter and has less impact on throughput, which makes Flink achieve high throughput. Finally, Flink has the advantages of high usability and simple deployment. In contrast, we finally decided to adopt the Flink based architecture.

2、 User portrait business architecture

At present, user portrait system provides real-time label data service for group online business. For this reason, our service needs to get through a variety of data sources, clean, cluster and analyze the massive digital information in real time, so as to abstract them into tags, and finally provide high-quality tag services for the application. In this context, the overall architecture of our user portrait system is as follows:

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

The overall structure is divided into five layers:

  1. Access layer: access and process the original data, such as Kafka, hive, file, etc.
  2. Computation level: Flink is selected as the real-time computing framework to clean and correlate the real-time data.
  3. Storage layer: data storage is carried out for the cleaned data. We layered and constructed the model of real-time user portrait, and stored the data of different application scenarios, such as Phoenix, HBase, HDFS, Kafka, etc.
  4. Service layer: provide unified data query service externally, and support multi-dimensional calculation service from bottom level detailed data to aggregation level data.
  5. application layer: support each business line data scenario with unified query service. At present, the business mainly includes user interest score, user quality score, user fact information and other data.

3、 User profile data processing flow

After the completion of the overall architecture design, we also designed a detailed processing scheme for the data. In the data processing stage, in view of Kafka’s characteristics of high throughput and high stability, Kafka is used as the distributed publish subscribe message system in our user portrait system. In the data cleaning stage, Flink is used to realize the unique identification of users, the cleaning of behavioral data, etc., and to remove redundant data. This process supports interactive computing and a variety of complex algorithms, and supports real-time / offline data computing. At present, we have iterated two versions of the data processing process. The specific scheme is as follows:

Version 1.0 data processing flow

Data access, calculation and storage three-tier processing flow

There are two overall data sources:

  1. historical data: massive historical business data accessed from external data sources. After access, it is processed by ETL and enters the bottom data table of user portrait.
  2. real-time data: real time business data accessed from external data sources, such as user behavior buried point data, risk control data, etc.

According to the indicator requirements of different businesses, we directly extract data from the group data warehouse and drop it into Kafka, or directly write it into Kafka in the way of CDC (capture data change) from the business side. In the computing layer, the data is imported into Flink to generate ID mapping, user tag fragments and other data through datastream, and then the generated data is stored in janusgraph (the image database medium with HBase as the back-end storage) and Kafka, and the user tag fragment data that falls into Kafka is consumed by Flink to aggregate to generate the latest user tag fragment (user) Label fragments are generated after the user portrait system obtains fragmented data blocks from various channels.

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

Data service layer processing flow

The service layer will store the user tag fragment data stored in the storage layer, and perform tinkerpop OLAP calculation to generate a full user YIDS list file through janusgraph spark on yarn mode. YID is the group level user ID defined in the user portrait system. Combined with YIDS list file, batch read HBase in Flink and aggregate it into complete user profile data, generate HDFS file, and then generate user rating and prediction labels through the newly generated data of Flink batch operation, drop the user rating and prediction labels into Phoenix, and then the data can be obtained through the unified data service interface. The figure below shows the whole process.

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

ID mapping data structure

In order to realize the integration of user tags and the strong connection between user IDs, we regard the user ID ID as the vertex of the graph and the ID pair relationship as the edge of the graph. For example, users who have identified the browser cookie use their mobile phone number to log in the company website to form the < cookie, mobile > corresponding relationship. In this way, all the user ID IDS constitute a large graph, in which each small connected sub graph / connected branch is all the ID information of a user.

The ID mapping data is constructed by the graph structure model, and the graph nodes include userkey, device, IDcard, phone and other types, which respectively represent the user’s business ID, device ID, ID card, phone and other information. The generation rule of the edge between nodes is to connect the nodes in a certain priority order by analyzing the node information contained in the data flow, so as to generate the edge between nodes. For example, after identifying the Android ID of the user’s mobile phone system, the user logs in to the company’s app using the email, finds the business line uid in the system and forms the ID pair of the < Android ID, mail > and < mail, uid > relationship, and then the system sorts the priority according to the node type to generate the Android ID, mail, uid relationship diagram. The data graph structure model is shown in the following figure:

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

<p style=”text-align:center”>Gephi</p>

Performance bottleneck of data processing process in version 1.0

At the beginning of the system, the data processing flow of version 1.0 meets our daily needs, but with the growth of data volume, the scheme encounters some performance bottlenecks:

  1. First of all, this version of data processing uses a self-developed Java program to achieve. With the increase of data volume, the JVM memory size of self-developed Java program is uncontrollable due to the explosion of data volume, and its maintenance cost is very high, so we decided to migrate all processing logic to Flink in the new version.
  2. Secondly, in the process of generating user tags, there are many large connected subgraphs in ID mapping (as shown in the figure below). This is usually because the user’s behavior data is relatively random and discrete, leading to the confusion between some nodes. This not only increases the difficulty of data maintenance, but also causes some data to be “polluted”. In addition, the query performance of janusgraph and HBase will be greatly reduced by such abnormally large subgraphs.

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

<p style=”text-align:center”>Gephi</p>

  1. Finally, the data in the scheme is serialized by protocol buffer (PB) and stored in HBase, which will lead to too many times of merging / updating user portrait label fragments, making a label need to read janusgraph and HBase multiple times, which will undoubtedly increase the reading pressure of HBase. In addition, due to the Pb serialization of data, its original storage format is unreadable, which increases the difficulty of troubleshooting.

In view of these problems, we propose a solution of version 2.0. In version 2.0, we try to solve the above three problems by using HBase columnar storage, modifying graph data structure and other optimization schemes.

Version 2.0 data processing flow

Version process optimization point

As shown in the figure below, version 2.0 data processing flow mostly inherits version 1.0. The data processing flow of the new version has been optimized in the following aspects:

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

Data processing flow of version 2.0</p>

  1. The offline supplementary recording method of historical data is changed from Java service to Flink.
  2. In order to optimize the data structure model of user portrait, the connection mode of edge is modified. Before that, we will determine the type of nodes and connect multiple nodes according to the preset priority order. In the new scheme, the user key centric connection is adopted. After this modification, the previous large connected subgraph (Figure 6) is optimized to the following small connected subgraph (Figure 8), which solves the problem of data pollution and ensures the accuracy of data. In addition, the situation that one piece of data needs to be read more than ten times in HBase in version 1.0 is greatly alleviated. After adopting the new scheme, only three times of HBase reading is needed for one piece of data on average, so as to reduce the reading pressure of HBase by six or seven times.

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink
<p style=”text-align:center”>Gephi</p>

  1. In the old version, protocol buffer was used as the storage object of user profile data, which was generated and stored in HBase as a whole column. In the new version, map is used to store user portrait label data. Each pair of kV in map is a separate label, and kV is also a separate column after being stored in HBase. In the new version storage mode, HBase is used to expand and merge columns, directly generate complete user portrait data, remove Flink and merge / update the user portrait label process, and optimize the data processing process. After using this scheme, the tag data stored in HBase has the function of ad hoc query. Data with ad hoc query refers to the function of directly viewing the details of specified label data with specific conditions in HBase. It is the basic condition for data governance to realize the functions of verifying data quality, data life cycle, data security, etc.
  2. In the data service layer, we use Flink to read the hive external table of HBase in batches to generate user quality grading data, and then store it in Phoenix. Compared with the old scheme, spark full read HBase causes too much reading pressure, resulting in cluster node downtime. The new scheme can effectively reduce the reading pressure of HBase. After our online verification, the reading load of the new scheme for HBase has been reduced by dozens of times (optimization here is different from 2 optimization, belonging to service layer optimization).

Four, problems

At present, most of the data in the user portrait system deployed online comes from Kafka’s real-time data. With the increasing amount of data, the pressure of the system is also increasing, so there are some problems such as Flink back pressure and checkpoint timeout, which lead to the failure of submitting Kafka displacement by Flink, thus affecting the data consistency. These online problems let us focus on the reliability, stability and performance of Flink. In view of these problems, we have carried on the detailed analysis and combined with our own business characteristics, explored and practiced some corresponding solutions.

Process analysis and performance optimization of checkpoint

Process analysis of checkpoint

The following figure shows the execution flow chart of checkpointing in Flink:

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

< P style = “text align: Center” > checkpoint execution process in Flink</p>

  1. Coordinator issues a barrier to all source nodes.
  2. After receiving all the barriers from the input, the task writes its own state to the persistent storage and continues to pass the barrier to its downstream.
  3. After the task completes the state persistence, the stored state address is notified to the coordinator.
  4. When the coordinator summarizes the status of all tasks, and writes the storage path of these data to the persistent storage, the checkpoint is completed.

Performance optimization scheme

Through the above process analysis, we can improve the performance of checkpoint in three ways. These are:

  1. Choose the right checkpoint storage method
  2. Increase the task parallelism reasonably
  3. Shorten the length of operator chains

Choose the right checkpoint storage method

Checkpoint is stored in memorystatebackend, fsstatebackend and rocksdbstatebackend. According to the official documents, the performance and security of different statebackend are quite different. Generally, memorystatebackend is suitable for test environment, and rocksdbstatebackend is the best choice for online environment.

There are two reasons for this: first, rocksdbstatebackend is external storage, and the other two checkpoint storage methods are JVM heap storage. Due to the size of JVM heap memory, checkpoint state size and security may be restricted to some extent; secondly, rocksdbstatebackend supports incremental checkpoints. Incremental checkpoints only record changes to previously completed checkpoints, rather than generating a full state. Compared with the full checkpoint, incremental checkpoint can significantly reduce the checkpoint time, but the cost is that it needs longer recovery time.

Increase the task parallelism reasonably

Checkpoint needs to collect data status of each task. The more status data of a single task, the slower the checkpoint. Therefore, we can reduce the time of checking points by increasing task parallelism and reducing the number of status data of a single task.

Shorten the length of operator chains

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

The longer the Flink operator chains are, the more tasks will be, the more status data will be, and the slower the checkpoint will be. By shortening the length of operator chain, the number of tasks can be reduced, so as to reduce the total amount of state data in the system and indirectly achieve the purpose of optimizing checkpoint. The following shows the merging rules of Flink operator chain:

  1. The parallelism of upstream and downstream is the same
  2. The penetration of downstream node is 1
  3. Upstream and downstream nodes are in the same slot group
  4. The chain policy of the downstream node is always
  5. The chain policy of the upstream node is always or head
  6. The data partition between two nodes is forward
  7. Chain is not disabled by the user

Based on the above rules, we combine some tasks with high correlation at the code level, which reduces the average chain length of operators by at least 60% ~ 70%.

Flink back pressure production process analysis and solution

Analysis of back pressure generation process

In the process of Flink running, each operator consumes a flow of intermediate / transitional states, transforms them, and then produces a new flow. This mechanism can be likened to Flink’s use of blocking queues as bounded buffers. Just like blocking queues in Java, once the queues reach the capacity limit, consumers with slower processing speed will block producers from sending new messages or events to the queues. The following figure shows the data transmission between two operators in Flink and how to sense the back pressure:

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

First, the event in source enters Flink and is processed by operator 1 and serialized into buffer, then operator 2 reads the event from the buffer. When the processing capacity of operator 2 is insufficient, the data in operator 1 cannot be put into buffer, thus forming back pressure. There are two possible causes of back pressure:

  1. The processing ability of downstream operators is insufficient;
  2. The data is skewed.

Back pressure solutions

In practice, we solve the problem of back pressure in the following ways. First of all, shortening the operator chain will reasonably merge operators and save resources. Secondly, shortening the operator chain will also reduce the switching between tasks (threads), the serialization / deserialization of messages and the number of data exchanges in the buffer, thus improving the overall throughput of the system. Finally, filter the unnecessary or temporarily unnecessary data according to the data characteristics, and then process the data separately according to the business requirements. For example, some data sources need to be processed in real time, and some data can be delayed. Finally, control the Flink time window size by using the keyby keyword, and merge as much data as possible in the upstream operator processing logic to reduce the cost Processing pressure of downstream operators.

Optimization results

After the above optimization, the user portrait can achieve real-time information processing without continuous back pressure under 100 million data per day, and the average checkpoint time is stable within 1 second.

5、 Thinking and Prospect of future work

End to end real-time stream processing

At present, part of the user portrait data is obtained from hive data warehouse, which is t + 1 mode and has a large data delay, so in order to improve the real-time data, it is necessary to process the end-to-end real-time flow.

End to end refers to the collection of raw data at one end, and the presentation and application of these logarithms at the other end in the form of reports / labels / interfaces. The middle real-time flow connects the two ends. In the follow-up work, we plan to switch all existing non real time data sources to real-time data sources, and import them to Phoenix / janusgraph / HBase after unified Kafka and Flink processing. One of the advantages of forcing all data source data into Kafka is that it can improve the stability and availability of the overall process: first, Kafka serves as the buffer of the downstream system, which can avoid the abnormal influence of the downstream system on the calculation of real-time flow and play the role of “peak cutting and valley filling”; second, Flink has officially supported Kafka since version 1.4 In the end-to-end processing of semantics, the consistency is more guaranteed.

More than 1 billion data per day: the practice of building a real-time user Portrait System Based on Flink

The author introduces:
Yang Yi: Java Engineer of Youxin Jinfu Computing Platform Department
Mu Chaofeng: Senior Data Development Engineer of Youxin Jinfu Computing Platform Department
He Xiaobing: Data Development Engineer of Youxin Jinfu Computing Platform Department
Hu Xi: technical director of Youxin Jinfu Computing Platform Department