- Lesson 01: Flink’s application scenario and architecture model
- Lesson 02: introduction to Flink wordcount and SQL implementation
- Lesson 03: Flink’s programming model compared with other frameworks
- Lesson 04: Flink’s commonly used dataset and datastream APIs
- Lesson 05: Flink SQL & Table programming and cases
- Lesson 06: Flink cluster installation, deployment and ha configuration
- Lesson 07: analysis of common core concepts of Flink
- Lesson 08: Flink window, time and watermark
- Lesson 09: Flink state and fault tolerance
Hello, welcome to session 01. In this session, we mainly introduce the application scenario and architecture model of Flink.
The best era of real-time computing
In the past decade, facing the data ageReal time computing technologycome one after another. From storm we first knew to the sudden emergence of spark, it quickly occupied the whole field of real-time computing. Until the end of January 2019, Alibaba’s internal version Flink was officially open source! One stone provoked thousands of waves, and the news of Flink’s open source immediately burst the circle of friends. The whole field of big data computing has always been dominated by spark, and has instantly become an era of two powers competing for hegemony.
Apache Flink (hereinafter referred to as Flink) has attracted much attention for its advanced design concept and powerful computing power. How to quickly apply Flink in the production environment, better combine it with the existing big data ecological technology, and fully tap the potential of data has become a difficult problem faced by many developers.
Flink application scenario
Since its open source in early 2019, Flink has rapidly become a hot technical framework in the field of big data real-time computing.As a major contributor to Flink, Alibaba took the lead in promoting its use throughout the groupIn addition, due to Flink’s naturalFlow characteristics, the more advanced architecture design makes Flink set off an application boom in major companies as soon as it appeared.
Alibaba, Tencent, Baidu, bytechop, Didi, Huawei and many other Internet companies have taken Flink as an important force point for future technology, and are urgently upgrading and promoting its use within their respective companies. Meanwhile, Flink has become the Apache foundation and GitHub communityOne of the most active projects。
Let’s take a look at the many application scenarios supported by Flink.
Real time data calculation
If you are familiar with big data technology, you should be familiar with the following demand scenarios:
Alibaba will broadcast live on the double 11 every year. How can the large screen be monitored in real time?
The company wants to see the top 5 products with the best sales volume in the promotion?
I am the operation and maintenance department of the company. I hope to receive the load of the server in real time?
We can see that the data calculation scenario needs to extract valuable information and indicators from the original data, such as the above-mentioned real-time sales, the top 5 of sales, and the load of the server.
The traditional analysis method is usually usingBatch queryOr record events (generally messages in production) and form a limited data set (table) based on this to build an application. In order to get the calculation results of the latest data, you must first write them into the table and re execute the SQL query, then write the results into the storage system, such as mysql, and then generate the report.
Apache Flink supports both streaming and batch analysis applications, which is what we call itBatch flow integration。 Flink assumed the responsibility in the above demand scenarioReal time data acquisition、Real time calculationandDownstream transmission。
Real time data warehouse and ETL
ETLThe purpose of extract transform load is to load the data of the business system into the data warehouse after extraction, cleaning and transformation.
The traditional off-line data warehouse stores the business data in a centralized way, and carries out ETL and other post modeling output reports and other applications with fixed calculation logic. Offline data warehouse is mainly used to build T + 1 offline data, pull incremental data every day through scheduled tasks, and then create subject dimension data related to each business, providing T + 1 data query interface.
The figure above shows the difference between offline data warehouse ETL and real-time data warehouse. It can be seen that the calculation and real-time performance of offline data warehouse are poor. The value of data itself will gradually weaken with the passage of time, so the data must reach the hands of users as soon as possible,Construction of real-time data warehouseDemand also came into being.
The construction of real-time data warehouse is not only an essential part of “data intelligent Bi”, but also an inevitable challenge in large-scale data applications.
Flink has natural advantages in real-time data warehouse and real-time ETL:
- Status management,There are many aggregation calculations in the real-time data warehouse, which need to access and manage the state. Flink supports powerful state management;
- Rich APIs,Flink provides extremely rich multi-level APIs, including stream API, table API and Flink SQL;
- Ecological perfection, the real-time data warehouse is widely used, and Flink supports a variety of storage (HDFS, ES, etc.);
- Batch flow integration,Flink is already unifying the API of stream computing and batch computing.
Event driven applications
Do you have such needs:
Our company has tens of thousands of servers. We hope to separate CPU, MEM and load information from the messages reported by the server for analysis, and then trigger custom rules for alarm?
As a security operation and maintenance personnel of the company, I hope to identify the crawler program from the daily access log and restrict the IP?
Event driven application is a kind of application with state. It extracts data from one or more event streams and triggers calculation, state update or other external actions according to the incoming events.
In the traditional architecture, we need to read and write remote transactional databases, such as mysql. In event driven applications, data and computing will not be separated. Applications only need to access local (memory or disk) to obtain data, so it has higher throughput and lower latency.
Flink’s following features perfectly support event driven applications:
- Efficient state management, Flink’s own state backend can store intermediate state information well;
- Rich window support, Flink supports scrolling windows, sliding windows and other windows;
- Multiple temporal semantics, Flink supports event time, processing time and ingestion time;
- Different levels of fault tolerance, Flink supports at least once or exactly once fault tolerance levels.
Apache Flink supports application development for many different scenarios from the bottom.
Flink’s main features include: batch flow integration, exactly once, powerful state management, etc. Meanwhile, Flink also supports running on a variety of resource management frameworks including yarn, mesos and kubernetes. Alibaba has taken the lead in promoting the use of Flink in the whole group. Facts have proved that Flink can be extended to thousands of cores, and its state can reach the TB level, which can still be maintainedHigh throughput、Low delayCharacteristics of.
So Flink has become ourThe first choice in the field of real-time computing。
Flink’s architecture model
Flink’s hierarchical model
Flink itself provides different levels of abstraction to support our development of streaming or batch processing programs. The above figure describes the four different levels of abstraction supported by Flink.
For our developers, most applications do not need the lowest level of low-level abstraction in the figure above, but are programmed for core APIs, such as datastream API (bounded / unbounded flow) and dataset API (bounded dataset). These fluent APIs provide common building blocks for data processing, such as various forms of user specified transformation, connection, aggregation, window, state, etc.
Table API and SQLFlink SQL is a more advanced API operation provided by Flink. Flink SQL is a set of development language in line with standard SQL semantics designed by Flink real-time computing to simplify the computing model and reduce the threshold for users to use real-time computing.
Flink data flow model
The basic building block of Flink program isflow(streams) andtransformation(Transformations), each data flow starts from one or moreSourceAnd terminates in one or moreSink。 Data flow is similar toDirected acyclic graph（DAG）。
Let’s take one of the most classic wordcount counting programs as an example:
In the figure above, the program consumes Kafka data, which is our goalSourcepart.
Then, the logic calculation is carried out by map, keyword, timewindow and other methods, and this part is oursTransformationThe transformation part, and the methods such as map, keyword and timewindow are calledoperator。 Usually, in the programtransformationAnd data flowoperatorThere is a corresponding relationship between them. Sometimes a transformation may contain multiple transformation operators.
Finally, the calculated data will be written into the file we execute, which is ourSinkpart.
In fact, for complex production environments, Flink tasks are mostly carried out in parallel and distributed on various computing nodes. During the execution of Flink task, each data stream will have multiple data streamspartition, and each operator has more than oneOperator taskIn parallel. The number of operator subtasks is the of that particular operatorParallelism * * * * (parallelism), setting parallelism is an important means of tuning Flink tasks, which will be explained in detail in later courses.
As can be seen from the above figure, the data streams are reallocated between the map and keyby / window, and between keyby / window and sink due to the difference in parallelism.
Windows and time in Flink
windowandtimeIs one of the core concepts in Flink. In the actual production environment, aggregation on the data stream needs to be controlled bywindowTo delimit the scope, such as “calculate the past 5 minutes” or “the sum of the last 100 elements”.
Flink supports a variety of window models, such asRolling window, sliding windowandSession windowWait.
The following figure shows the various window models supported by Flink:
Meanwhile, Flink supportsEvent time、Ingestion timeandProcessing timeThree kinds of time semantics are used to meet the special needs of time in actual production.
In addition, Flink also supports more advanced features such as stateful operator operation, fault tolerance mechanism, checkpoint, exactly once semantics and so on to support users’ needs in different business scenarios.
Starting from the background of real-time computing, this class introduces the current development process of real-time computing. Flink, as a dark horse in the field of real-time computing, advanced design ideas, powerful performance and rich business scenario support, has become one of the skills that our developers must learn. Flink has become the sharpest weapon in the field of real-time computing!