Data stack product sharing: building real time big data processing platform based on streamworks

Time:2021-5-11

Counting stackWe have an interesting open source project on GitHub and giteeFlinkXFlinkXIt is a unified batch data synchronization tool based on Flink. It can collect both static data and real-time data. It is a global, heterogeneous and batch data synchronization engine. If you like, please give us a star! star! star!

GitHub open source project:https://github.com/DTStack/fl…

Gitee open source project:https://gitee.com/dtstack_dev…

During the Spring Festival in 2020, a sudden epidemic spread across the country, breaking everyone’s original rhythm of work and life. During the epidemic, people tiktok at home can see the real-time big data epidemic map at any time, and can brush their current video of vibrato at any time. The most important technology behind all these is the real-time big data processing technology.

Now that the epidemic is about to pass, the state proposes to speed up the construction of new infrastructure such as big data center, and the construction of real-time big data processing platform has become an increasingly important part in the process of digital transformation of enterprises.

1、 What is real time computing

In the field of big data processing, tasks are usually divided into real-time calculation and off-line calculation according to the different nature of data. Take the scene of temperature sensor as an example: suppose a city has installed a large number of temperature sensors, and each sensor uploads the collected temperature information every 1min, which is collected by the Meteorological Center and updated every 5 minutes, These data are constantly generated, and will not stop. Real time computing is mainly used in the scenario that “data is generated continuously and will not stop, and the calculation results need to be obtained with the minimum delay”. The minimum delay is usually seconds or minutes.

In order to meet the demand of large amount of data and high real-time performance, real-time computing technology is usually used. The “continuous flow of data” of real-time computing determines that its data processing mode is quite different from offline.
Data stack product sharing: building real time big data processing platform based on streamworks

Figure 1 difference between real-time computing and offline computing

The characteristics of offline computing are different, such as batch, high latency and active initiation. Real time computing is a continuous, low latency, event triggered computing task. Offline computing needs to load the data first, then submit the offline task, and finally return the result; In real-time computing, the first step is to submit a streaming task, then wait for the real-time stream data to access, and then calculate the real-time result stream.
Data stack product sharing: building real time big data processing platform based on streamworks

Figure 2 difference between real-time computing and offline computing (image)

Image point can be understood as offline calculation, which means driving a boat to fish in the lake (database), and real-time calculation is to build a dam on the river (data stream) to generate electricity. Furthermore, the formation of lakes depends on rivers, and the upper and lower boundaries of rivers are lakes; In fact, offline computing can be understood as a special case of real-time computing.

2、 Problems that real time computing can solve

Data stack product sharing: building real time big data processing platform based on streamworks
Figure 3 problems that real time computing can solve

In terms of technology, real-time computing is mainly used in the following scenarios:

  • Real time data ETL based on data pipeline: the purpose is to transmit data from point a to point B in real time. In the process of transmission, data cleaning and integration may be added, such as real-time index construction of search system, ETL process in real-time data warehouse, etc.
  • Real time data analysis based on data analysis: the process of extracting and integrating corresponding information from original data according to business objectives. For example, check the top 10 items in daily sales, average warehouse turnover time, average page click through rate, real-time push open rate, etc. Real time data analysis is the real-time of the above process, which is usually reflected as real-time report or real-time large screen in the terminal.
  • Event driven application based on data driven: a system that processes or responds to a series of subscription events. Event driven applications usually need to rely on internal state, such as click fraud detection, risk control system, operation and maintenance exception detection system, etc. When the user’s behavior triggers some risk control points, the system will capture this event and analyze it according to the user’s current and previous behavior to decide whether to control the user’s risk.

3、 Whole link process of real time development

Data stack product sharing: building real time big data processing platform based on streamworks
Figure 4 whole link process of real time development

Real time collection – the streaming data collection tool is used to collect and transmit data in streaming and real time to big data message storage (Kafka, etc.). As the upstream of real-time computing, streaming data storage provides a continuous stream of data to trigger the operation of streaming computing jobs. As the trigger source of real-time computing, stream data drives real-time computing. Therefore, a real-time computing job must use at least one stream data as the source. Each incoming stream data will directly trigger a stream computing process of real-time computing. The data is processed and analyzed in the real-time computing system and then randomly written to the downstream data storage. The downstream database is generally business-related and can be used for real-time reports, real-time large screen and other data consumption.

4、 Real time acquisition — the key of the whole link real time development platform

In the real-time development of the whole link, real-time acquisition is the upstream of real-time computing. For many enterprises, they already have data storage systems, but most of them are offline relational databases. How to provide the real-time incremental data of these offline relational databases for real-time calculation and analysis is an urgent problem. As shown in the figure below, it is the functional architecture of kangaroo cloud real-time data collection tool.Data stack product sharing: building real time big data processing platform based on streamworks

Figure 5 data flow of flinkx real time data acquisition tool

As a module of streamworks platform, kangaroo cloud real-time data acquisition has the following features.

  • Flinkx supports batch data extraction and real-time capture of changing data such as mysql, Oracle and SQL server to realize unified collection of batch streams.
  • The underlying layer is based on Flink distributed architecture, which supports large capacity and high concurrency synchronization. Compared with single point synchronization, it has better performance and higher stability.
  • It supports real-time synchronization by reading database binlog directly, and also supports real-time synchronization by interval polling.
  • Support breakpoint continuation and dirty data recording, real-time data acquisition of metric curve display.

5、 Introduction of streamworks real time development platform

Kangaroo cloud real-time development platform (streamworks) is a cloud native one-stop big data streaming computing platform based on Apache Flink, covering the whole link process from real-time data acquisition to real-time data ETL. The sub second level processing delay and the development of datastream API jobs are compatible with the existing big data components, helping enterprises transform their real-time data intelligently and help the construction of new infrastructure.

In the past data development technology stack, SQL language can solve most of the problems of business scenarios. The core function of streamworks is to focus on the ability of streaming data analysis (flinkstreamsql) with SQL semantics, which reduces the development threshold. Provide the semantic guarantee of exactly once processing to ensure the accuracy and consistency of business.Data stack product sharing: building real time big data processing platform based on streamworks

Figure 6 function architecture of streamworks

As shown in the above figure, streamworks consists of the following modules:

  • Real time collection: support real-time data collection of MySQL, SQL server, Oracle, polardb, Kafka, EMQ and other data sources, and help users control the collection process more accurately through the control of rate and concurrency.
  • Data development: supports flinksql and Flink task types. Flinksql jobs provide functions such as visual storage configuration, job development, syntax checking, etc; Flink task supports uploading jar package to run real-time development jobs.
  • Task operation and maintenance: task operation monitoring, data curve, operation log, data delay, ckeckpoint, failure, attribute parameters, alarm configuration and other functions.
  • Project management: user management, role management, overall project configuration, project member management, etc.

6、 Advantages of streamworks real time big data development platform

Data stack product sharing: building real time big data processing platform based on streamworks
Figure 7 streamworks platform level

As shown in the figure above, the streamworks real-time big data development platform is based on Apache Flink computing engine, with a layer of SQL encapsulation, and an IDE platform for online development at the top. The platform has the following advantages:

  • Easy to use: provide online IDE, customized development tools adapted to flinksql!
  • Visual DDL: provide visual table building tool, configure parameters to complete DDL!
  • Built in functions: provide rich built-in functions of flinksql to simplify the development work!
  • Efficient operation and maintenance: provide dozens of operation indicators to solve the open source operation and maintenance problems!
  • Real time acquisition: provide real-time acquisition tools, support full link real-time development platform!
  • Flinkx: a self-developed batch flow data acquisition tool, has been open source!
    Data stack product sharing: building real time big data processing platform based on streamworks
    Figure 8 traditional development mode vs streamworks development mode

    7、 Fourteen lines of code for real time business development

Having talked so much about how our products can facilitate the development of real-time business logic, let’s take the most common example of website traffic analysis to illustrate. For example, a website needs to analyze the access source:

As shown in the figure below, read the site access log from the log service, analyze the source in the log, check whether the source is in the list of interested websites (similar to the white list of source websites, which is saved in MySQL), count the traffic PV from each website, and write the final result to MySQL.
Data stack product sharing: building real time big data processing platform based on streamworks
Figure 9 business logic flow chart

Using streamsql code is very simple, only 14 lines of pseudo code can be done.

CREATE TABLE    
log_source(dt STRING, …)  
WITH (type=kafka); 
CREATE TABLE     
mysql_dim(url STRING, …, PRIMARY KEY(url))
WITH (type=mysql);  
CREATE TABLE     
mysql_result(url STRING, …, PRIMARY KEY(url))
WITH (type=mysql);  
INSERT INTO mysql_result
SELECT    
l.url, count(*) as pv …
FROM  log_source l JOIN mysql_dim  d ON l.url = d.url
group by l.url

8、 Building real time recommendation system based on streamworks

General recommendation systems are implemented on tags. Tags based recommendation is widely applied. Tiktok, such as jitter, is used in a large number of tags. Such recommendation system has many advantages, such as simple implementation and good interpretability. How to achieve real-time product or content recommendation through tags?

First of all, a new user will fill in some relatively fixed data when registering an app account, such as age, occupation and other information. These information can be calculated offline to analyze the results of long-term interest tags and stored in the long-term interest tag library. Users can calculate and analyze the short-term interest tag results through the real-time calculation of the recently interested content (such as the information points they pay attention to in the last 10 minutes), and then associate the short-term interest tag with the long-term interest tag library through the real-time development of the data stream association dimension table function, and finally generate new recommended content to the client, Form a closed loop of user data stream, so as to realize a simple real-time recommendation system. The specific process is shown in the figure below.

Data stack product sharing: building real time big data processing platform based on streamworks

Figure 10 real time recommendation system based on streamworks

9、 Conclusion: turn the future into the present

The epidemic will soon be over, and life will continue. With the deepening of “new infrastructure” construction, more and more real-time scenes will appear in our lives. As a new infrastructure solution provider, kangaroo cloud’s slogan is to turn the future into the present, enabling more enterprises to transform in real time in the future.