Learning Flink from 0 to 1 – Introduction to Apache Flink

Time:2021-12-2

Learning Flink from 0 to 1 - Introduction to Apache Flink

preface

Flink is a streaming computing framework. Why did I come into contact with Flink? At present, I am responsible for the alarm part of the monitoring platform. The collected monitoring data will be directly inserted into Kafka. Then, on the alarm side, I need to read the monitoring data in real time from Kafka topic, aggregate / convert / calculate the read monitoring data, and then compare the calculated results with the threshold of the alarm rules, Then make corresponding alarm measures (nail group, email, SMS, telephone, etc.). A simple diagram is drawn as follows:

Learning Flink from 0 to 1 - Introduction to Apache Flink

At present, the structure of the alarm is such a structure. When we first entered the company, the structure is that all monitoring data are directly stored in elasticsearch, and then we alarm to search for the data required by our monitoring indicators in elasticsearch. Fortunately, the search ability of elasticsearch is strong enough. But have you found a problem, that is, all monitoring data are calculated / transformed / aggregated from the collected and collected data, then stored in elasticsearch through Kafka message queue, and then go to elasticsearch to find our monitoring data, and then make alarm strategies. The whole process seems to follow common sense for monitoring, but for alarms, if there is a problem in one of the intermediate links, such as the delay of Kafka message queue, the long time for monitoring data to be saved and written in elasticsearch, and your query posture is not written correctly, all these will lead to the delay of the data found in elasticsearch. It may be 30 seconds, a minute, or more, which will undoubtedly cause the alarm message to have no meaning for the alarm.

Why do you say that? Why do we need to monitor the alarm platform? It is nothing more than the hope that we can find the problems as soon as possible and alarm the problems, so that the development and operation and maintenance personnel can deal with and solve the online problems in time, so as not to cause huge losses to the company.

What’s more, now there are more companies doing that kind of early warning! How to do this? It is necessary to use big data and machine learning technology to analyze periodic historical data, and then sort out some periodic (one day / seven days / January / quarter / year) trend charts of some monitoring indicators according to these data, so that they can be roughly drawn. Then, according to this trend chart, you can compare the data usage of the monitoring indicators at the current time point with the trend chart. When the threshold of our alarm rules is about to be reached, you can report an early warning in advance, let the operation and maintenance know the early warning in advance, and then find the problem in advance, so as to find the problem in advance, avoid the loss and minimize the loss! Of course, this is what I intend to do, and I should be able to learn a lot.

Therefore, I am now in contact with the flow computing framework Flink, similar to the commonly used spark.

I have been in contact with Flink for some time. At present, there is only one very thin book in Chinese, and there are no more than three English books.

I have compiled some Flink learning materials, and I have put all the official account of WeChat. You can pay attention to my official account.zhisheng, and then reply to the keyword:FlinkYou can get it unconditionally.

In addition, some blogs are recommended here:

1. Official website:[https://flink.apache.org/]()

2、GitHub: [https://github.com/apache/flink]()

3、[https://blog.csdn.net/column/…]()

4、[https://blog.csdn.net/lmalds/…]()

5、[http://wuchong.me/]()

6、[https://blog.csdn.net/liguohu…]()

The following introduction may also have many references to all the above materials. Thank them! Before we introduce Flink, let’s take a lookDataset typeandData operation modelTypes of.

What are the data set types:

  • Infinite data set: an infinite set of continuously integrated data
  • Bounded data set: a finite set of data that will not change

So what are the common infinite data sets?

  • Real time interactive data between user and client
  • Application log generated in real time
  • Real time transactions in financial markets

What are the data operation models:

  • Streaming: as long as data is generated, the calculation continues
  • Batch: run the calculation within a predefined time and release computer resources when it is completed

Flink can process bounded data sets, unbounded data sets, streaming data and batch data.

What is Flink?

Learning Flink from 0 to 1 - Introduction to Apache Flink

Learning Flink from 0 to 1 - Introduction to Apache Flink

Learning Flink from 0 to 1 - Introduction to Apache Flink

The above three pictures are transferred from “Flink technology introduction and future outlook” of Yunxie Chengdu station, which are invaded and deleted.

From bottom to top, the overall structure of Flink

Learning Flink from 0 to 1 - Introduction to Apache Flink

Bottom up:

1. Deployment: Flink supports local operation, can run on an independent cluster or a cluster managed by yarn or mesos, and can also be deployed on the cloud.

2. Operation: the core of Flink is the distributed streaming data engine, which means that the data is processed one event at a time.

3、API:DataStream、DataSet、Table、SQL API。

4. Extension library: Flink also includes a dedicated code library for complex event processing, machine learning, graphics processing and Apache storm compatibility.

Flink data stream programming model

Abstraction level

Flink provides different levels of abstraction to develop streaming or batch applications.

Learning Flink from 0 to 1 - Introduction to Apache Flink

  • The bottom layer provides stateful flow. It will be embedded in the datastream API through the process function. It allows users to freely process events from one or more streams of data and use consistent, fault-tolerant states. In addition, users can register event time and handle event callback, so that the program can realize complex calculation.
  • Datastream / dataset API is the core API provided by Flink. Dataset handles bounded data sets, and datastream handles bounded or unbounded data streams. Users can convert / calculate data through various methods (map / flatmap / window / keyby / sum / max / min / AVG / join, etc.).
  • Table APISosurfaceIs the central declarative DSL, where the table may change dynamically (when expressing stream data). The table API provides operations such as select, project, join, group by, and aggregate, but it is simpler to use (less code).

You can watch withDataStream/DataSetSeamless switching between programs also allows programs toTable APIAndDataStreamas well asDataSetMixed use.

  • The highest level of abstraction Flink provides isSQL。 This level of abstraction is different from others in grammar and expression abilityTable APISimilar, but it represents the program in the form of SQL query expression. SQL abstraction interacts closely with the table API, and SQL queries can be executed directly on the tables defined by the table API.

Flink program and data flow structure

Learning Flink from 0 to 1 - Introduction to Apache Flink

The Flink application structure is as shown in the figure above:

1. Source: data source. Flink’s sources for stream processing and batch processing can be divided into four categories: source based on local collection, source based on file, source based on network socket and custom source. Common custom sources include Apache Kafka, Amazon kinesis streams, rabbitmq, twitter streaming API, Apache nifi, etc. of course, you can also define your own source.

2. Transformation: various operations of data conversion, including map / flatmap / filter / keyby / reduce / fold / aggregates / window / windowall / Union / window join / split / select / project, etc. there are many operations, which can convert and calculate the data into the data you want.

3. Sink: receiver, where Flink sends the converted data. You may need to store it. Flink’s common sink may include the following categories: write file, print out, write socket and custom sink. Common custom sinks include Apache Kafka, rabbitmq, mysql, elasticsearch, Apache Cassandra, Hadoop file system, etc. similarly, you can also define your own sink.

Why Flink?

Flink is an open source distributed streaming framework:

① Provide accurate results, even in the case of disordered or delayed loaded data.

② It is stateful and fault-tolerant, and can seamlessly repair errors when maintaining a complete application state.

③ Large scale operation, with good throughput and low latency when running on thousands of nodes.

Earlier, we discussed the matching of dataset types (bounded vs infinite) and operation models (batch vs streaming). Flink’s streaming computing model enables many functional features, such as state management, processing unordered data, and flexible windows. These functions are very important for obtaining accurate results from infinite data sets.

  • Flink ensures strong consistency in stateful computing. ” Statefulness means that the application can maintain the data aggregation that has been generated over time, and filnk’s checkpoint mechanism has strong consistency of an application state in a failed event.

Learning Flink from 0 to 1 - Introduction to Apache Flink

  • Flink supports streaming computing and windows with event time semantics. The event time mechanism enables those data streams whose events arrive disorderly or even delayed to calculate accurate results.

Learning Flink from 0 to 1 - Introduction to Apache Flink

  • In addition to providing data-driven windows, Flink also supports flexible windows based on time, count, session, etc. Windows can be customized with flexible trigger conditions to support complex streaming modes. Flink’s windows make it possible to simulate the real data creation environment.

Learning Flink from 0 to 1 - Introduction to Apache Flink

  • Flink’s fault tolerance is lightweight, allowing the system to maintain high concurrency while providing strong consistency at the same time. Flink recovers from failures with zero data loss, but does not consider the tradeoff between reliability and latency.

Learning Flink from 0 to 1 - Introduction to Apache Flink

  • Flink can meet the requirements of high concurrency and low latency (computing a large amount of data quickly). The following figure shows the performance comparison between Apache Flink and Apache storm in completing the distributed task of stream data cleaning.

Learning Flink from 0 to 1 - Introduction to Apache Flink

  • Flink savepoint provides a stateful version mechanism to update applications or rollback historical data in a way of no loss state and minimum downtime.

Learning Flink from 0 to 1 - Introduction to Apache Flink

  • Flink is designed to run on large clusters with thousands of points. In addition to supporting independent cluster deployment, Flink also supports yarn and mesos deployment.
  • Flink’s programs are inherently parallel and distributed, and data streams can be partitioned intostream partitions, operators are divided into operator subtasks; These subtasks run independently with different threads in different machines or containers; The number of operator subtasks is the number of parallel calculations in a specific operator, and there may be different parallel numbers in different operator stages of the program; As shown in the figure below, the number of parallelism of the source operator is 2, but the last sink operator is 1;

Learning Flink from 0 to 1 - Introduction to Apache Flink

  • Own memory management

    Flink provides its own memory management in the JVM, making it independent of Java’s default garbage collector. It effectively manages memory by using hashing, indexing, caching and sorting.

  • Rich library

    Flink has a rich library for machine learning, graphics processing, relational data processing, etc. Due to its architecture, it is easy to perform complex event processing and alerts.

Distributed operation

The Flink job submission architecture process is shown in the following figure:

Learning Flink from 0 to 1 - Introduction to Apache Flink

1. Program code: the Flink application code we wrote

2. Job client: job client is not an internal part of Flink program execution, but it is the starting point of task execution. The job client is responsible for accepting the user’s program code, creating a data flow, and submitting the data flow to the job manager for further execution. After execution, the job client returns the results to the user

Learning Flink from 0 to 1 - Introduction to Apache Flink

3. Job manager: the main process (also known as job manager) coordinates and manages the execution of programs. Its main responsibilities include arranging tasks, managing checkpoints, fault recovery, etc. There must be at least one master in the machine cluster. The master is responsible for scheduling tasks, coordinating checkpoints and disaster recovery. If the high availability setting is set, there can be multiple masters, but ensure that one is a leader and the other is a standby; Job manager contains three important components: actor system, scheduler and check pointing

4. Task Manager: receive the tasks to be deployed from the job manager. A task manager is a work node that executes tasks in one or more threads in the JVM. The parallelism of task execution is determined by the task slots available on each task manager. Each task represents a set of resources assigned to the task slot. For example, if task manager has four slots, it will allocate 25% of memory for each slot. You can run one or more threads in a task slot. Threads in the same slot share the same JVM. Tasks in the same JVM share TCP connections and heartbeat messages. A slot in task manager represents an available thread with fixed memory. Note that slot is only isolated from memory, not CPU. By default, Flink allows subtasks to share slots, even if they are subtasks of different tasks, as long as they come from the same job. This sharing can have better resource utilization.

Learning Flink from 0 to 1 - Introduction to Apache Flink

last

This article mainly talks about the reason why I came into contact with Flink, then starts with the data set type and data operation model, and then introduces what Flink is, the overall architecture of Flink, the API provided, the advantages of Flink and the running mode of Flink’s distributed jobs. Hydrology, I hope you can have a little concept of Flink.

Pay attention to me

Please indicate the original address for Reprint:http://www.54tianzhisheng.cn/2018/10/13/flink-introduction/

In addition, I have compiled some Flink learning materials, and I have put all the official account of WeChat. You can add my wechat: Zhisheng_ Tian, and then reply to the keyword: Flink, you can get it unconditionally.

Learning Flink from 0 to 1 - Introduction to Apache Flink

Related articles

1、Learning Flink from 0 to 1 – Introduction to Apache Flink

2、Learning Flink from 0 to 1 — an introduction to building Flink 1.6.0 environment and building and running simple programs on MAC

3、Learn Flink from 0 to 1 – detailed explanation of Flink profile

4、Learning Flink from 0 to 1 – Introduction to data source

5、Learn Flink from 0 to 1 – how to customize the data source?

6、Learning Flink from 0 to 1 – Introduction to data sink

7、Learn Flink from 0 to 1 – how to customize data sink?