Learning Flink from 0 to 1 – Introduction to data source

Time:2021-11-28

Learning Flink from 0 to 1 - Introduction to data source

<!–more–>

preface

What is data sources? Literally, you can know: data source.

As a streaming computing framework, Flink can be used for batch processing, that is, processing static data sets and historical data sets; It can also be used for stream processing, that is, to process some real-time data streams in real time and generate data stream results in real time. As long as the data flows in, Flink can calculate all the time. This data source is the source of the data.

You can use it in FlinkStreamExecutionEnvironment.addSource(sourceFunction)To add data sources to your program.

Flink has provided several implemented source functions. Of course, you can also define non parallel sources by implementing sourcefunction, implementing parallelsourcefunction interface or extending richparallelsourcefunction,

Flink

The following implemented stream sources can be used in the streamexecutionenvironment:,

Learning Flink from 0 to 1 - Introduction to data source

Generally speaking, it can be divided into the following categories:

Set based

1. From collection – creates a data stream from Java. Util. Collection in Java. All elements in the collection must be of the same type.

2. From collection (iterator, class) – creates a data stream from an iterator. Class specifies the type of element returned by the iterator.

3. From elements (t…) – creates a data stream from a given sequence of objects. All object types must be the same.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<Event> input = env.fromElements(
    new Event(1, "barfoo", 1.0),
    new Event(2, "start", 2.0),
    new Event(3, "foobar", 3.0),
    ...
);

4. From parallel collection (split table iterator, class) – creates a parallel data stream from an iterator. Class specifies the type of element returned by the iterator.

5. Generatesequence (from, to) – creates a parallel data stream that generates a sequence of numbers within a specified interval.

File based

1. Readtextfile (path) – reads a text file, which conforms to the textinputformat specification, and returns it as a string.

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> text = env.readTextFile("file:///path/to/file");

2. Readfile (file input format, path) – reads the file according to the specified file input format (once).

3. Readfile (fileinputformat, path, watchtype, interval, pathfilter, TypeInfo) – these are the methods called internally by the above two methods. It reads the file according to the given fileinputformat and read path. According to the watchtype provided, this source can regularly (every interval milliseconds) monitor the new data of a given path (fileprocessingmode. Process_continuously), or process the data of the file corresponding to the path once and exit (fileprocessingmode. Process_once). You can further exclude files to be processed through pathfilter.

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<MyEvent> stream = env.readFile(
        myFormat, myFilePath, FileProcessingMode.PROCESS_CONTINUOUSLY, 100,
        FilePathFilter.createDefaultFilter(), typeInfo);

realization:

In the specific implementation, Flink divides the file reading process into two sub tasks, namely directory monitoring and data reading. Each subtask is implemented by a separate entity. Directory monitoring is performed by a single non parallel task (parallelism of 1), while data reading is performed by multiple tasks running in parallel. The parallelism of the latter is equal to the parallelism of jobs. The function of a single directory monitoring task is to scan the directory (regularly or only once according to the watchtype), find the files to be processed, divide the files into splits, and then allocate these splits to downstream readers. The reader is responsible for reading data. Each slice can only be read by one reader, but one reader can read multiple slices one by one.

Important note:

If watchtype is set to fileprocessingmode.process_ Continuously, when the file is modified, its contents will be reprocessed. This breaks the “exactly once” semantics because attaching data to the end of the file will cause all its contents to be reprocessed.

If watchtype is set to fileprocessingmode.process_ Once, the source scans the path only once and then exits without waiting for the reader to finish reading the file content. Of course, the reader will continue to read until it reads all the contents of the file. After the source is closed, there will be no more checkpoints. This may result in slower recovery after a node failure because the job will resume reading from the last checkpoint.

Socket based:

Sockettextstream (string hostname, int port) – read from socket. Elements can be segmented with separators.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<Tuple2<String, Integer>> dataStream = env
        . sockettextstream ("localhost", 9999) // listen for data from port 9999 of localhost
        .flatMap(new Splitter())
        .keyBy(0)
        .timeWindow(Time.seconds(5))
        .sum(1);

This isLearning Flink from 0 to 1 — an introduction to building Flink 1.6.0 environment and building and running simple programs on MACThe word count program based on socket is used in the article.

Custom:

Addsource – adds a new source function. For example, you can addsource (New flinkkafkaconsumer011 < > (…)) to read data from Apache Kafka

Let’s talk about the above characteristics

1. Set based: bounded data set, more inclined to local test

2. File based: suitable for listening to file modification and reading its contents

3. Based on socket: listen to the host port of the host and obtain data from the socket

4. Custom addsource: most scene data is unbounded and will flow in an endless stream. For example, to consume data on a topic in Kafka, you need to use this addsource. It may be because you use it more. Flink directly provides flinkkafkaconsumer011 and other classes for you to use directly. You can take a look at the basic class flinkkafkaconsumerbase, which is the most fundamental class for Flink Kafka consumption.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<KafkaEvent> input = env
        .addSource(
            new FlinkKafkaConsumer011<>(
                Parametertool. Getrequired ("input topic"), // get the topic passed in from the parameter 
                new KafkaEventSchema(),
                parameterTool.getProperties())
            .assignTimestampsAndWatermarks(new CustomWatermarkExtractor()));

Flink currently supports the following common sources:

Learning Flink from 0 to 1 - Introduction to data source

What if you want to customize your own source?

Then you need to understand the sourcefunction interface, which is the root interface of all stream sources and inherits from a tag interface (empty interface) function.

Sourcefunction defines two interface methods:

Learning Flink from 0 to 1 - Introduction to data source

1. Run: start a source, that is, connect to an external data source, and then the emit element forms a stream (in most cases, the stream will be generated by running a while loop in this method).

2. Cancel: cancel a source, that is, the behavior of the circular emit element in run is terminated.

Under normal circumstances, one sourcefunction can implement the two interface methods. In fact, these two interface methods also fix an implementation template.

For example, to implement an xxxsourcefunction, the general template is as follows: (take an example of the Flink source code directly)

Learning Flink from 0 to 1 - Introduction to data source

last

This article mainly talks about the common sources of Flink, and briefly puts forward how to customize the source.

Pay attention to me

Please indicate the original address for Reprint:http://www.54tianzhisheng.cn/2018/10/28/flink-sources/

In addition, I have compiled some Flink learning materials, and I have put all the official account of WeChat. You can add my wechat: Zhisheng_ Tian, and then reply to the keyword: Flink, you can get it unconditionally.

Learning Flink from 0 to 1 - Introduction to data source

Related articles

1、Learning Flink from 0 to 1 – Introduction to Apache Flink

2、Learning Flink from 0 to 1 — an introduction to building Flink 1.6.0 environment and building and running simple programs on MAC

3、Learn Flink from 0 to 1 – detailed explanation of Flink profile

4、Learning Flink from 0 to 1 – Introduction to data source

5、Learn Flink from 0 to 1 – how to customize the data source?

6、Learning Flink from 0 to 1 – Introduction to data sink

7、Learn Flink from 0 to 1 – how to customize data sink?

Recommended Today

On the mutation mechanism of Clickhouse (with source code analysis)

Recently studied a bit of CH code.I found an interesting word, mutation.The word Google has the meaning of mutation, but more relevant articles translate this as “revision”. The previous article analyzed background_ pool_ Size parameter.This parameter is related to the background asynchronous worker pool merge.The asynchronous merge and mutation work in Clickhouse kernel is completed […]