Learning Flink from 0 to 1 – Flink data transformation

Time:2021-9-26

Learning Flink from 0 to 1 - Flink data transformation

preface

In the first article about FlinkLearning Flink from 0 to 1 – Introduction to Apache FlinkWe talked about the structure of Flink program in

Learning Flink from 0 to 1 - Flink data transformation

The Flink application structure is as shown in the figure above:

1. Source: data source. Flink’s sources for stream processing and batch processing can be divided into four categories: source based on local collection, source based on file, source based on network socket and custom source. Common custom sources include Apache Kafka, Amazon kinesis streams, rabbitmq, twitter streaming API, Apache nifi, etc. of course, you can also define your own source.

2. Transformation: various operations of data conversion, including map / flatmap / filter / keyby / reduce / fold / aggregates / window / windowall / Union / window join / split / select / project, etc. there are many operations, which can convert and calculate the data into the data you want.

3. Sink: receiver, where Flink sends the converted data. You may need to store it. Flink’s common sink may include the following categories: write file, print out, write socket and custom sink. Common custom sinks include Apache Kafka, rabbitmq, mysql, elasticsearch, Apache Cassandra, Hadoop file system, etc. similarly, you can also define your own sink.

Source and sink were introduced in the last four articles:

1、Learning Flink from 0 to 1 – Introduction to data source

2、Learn Flink from 0 to 1 – how to customize the data source?

3、Learning Flink from 0 to 1 – Introduction to data sink

4、Learn Flink from 0 to 1 – how to customize data sink?

In this article, let’s take a look at Flink data transformation. There are still a lot of data conversion operations, which need to be well explained!

Transformation

Map

This is one of the simplest transformations, in which the input is a data stream and the output is also a data stream:

Or take the example of an article to map the data:

SingleOutputStreamOperator<Student> map = student.map(new MapFunction<Student, Student>() {
    @Override
    public Student map(Student value) throws Exception {
        Student s1 = new Student();
        s1.id = value.id;
        s1.name = value.name;
        s1.password = value.password;
        s1.age = value.age + 5;
        return s1;
    }
});
map.print();

Increase everyone’s age by 5 years, the others remain the same.

Learning Flink from 0 to 1 - Flink data transformation

FlatMap

Flatmap takes one record and outputs zero, one or more records.

SingleOutputStreamOperator<Student> flatMap = student.flatMap(new FlatMapFunction<Student, Student>() {
    @Override
    public void flatMap(Student value, Collector<Student> out) throws Exception {
        if (value.id % 2 == 0) {
            out.collect(value);
        }
    }
});
flatMap.print();

Here, even IDs are aggregated.

Learning Flink from 0 to 1 - Flink data transformation

Filter

The filter function judges the result according to the condition.

SingleOutputStreamOperator<Student> filter = student.filter(new FilterFunction<Student>() {
    @Override
    public boolean filter(Student value) throws Exception {
        if (value.id > 95) {
            return true;
        }
        return false;
    }
});
filter.print();

Here, filter out those with ID greater than 95 and print them out.

Learning Flink from 0 to 1 - Flink data transformation

KeyBy

Keyby is logically partitioned based on the key stream. Internally, it uses hash functions to partition streams. It returns the keyeddatastream data stream.

KeyedStream<Student, Integer> keyBy = student.keyBy(new KeySelector<Student, Integer>() {
    @Override
    public Integer getKey(Student value) throws Exception {
        return value.age;
    }
});
keyBy.print();

Above, the student’s age is partitioned by keyby operation

Reduce

Reduce returns a single result value, and a new value is always created for each element processed by the reduce operation. The commonly used methods are average, sum, min, Max and count, which can be realized by using the reduce method.

SingleOutputStreamOperator<Student> reduce = student.keyBy(new KeySelector<Student, Integer>() {
    @Override
    public Integer getKey(Student value) throws Exception {
        return value.age;
    }
}).reduce(new ReduceFunction<Student>() {
    @Override
    public Student reduce(Student value1, Student value2) throws Exception {
        Student student1 = new Student();
        student1.name = value1.name + value2.name;
        student1.id = (value1.id + value2.id) / 2;
        student1.password = value1.password + value2.password;
        student1.age = (value1.age + value2.age) / 2;
        return student1;
    }
});
reduce.print();

First, keyby the data stream, because the reduce operation can only be keyedstream, and then average the age of the student object.

Fold

Fold pushes out keyedstream by combining the last folder stream with the current record. It sends back the data stream.

KeyedStream.fold("1", new FoldFunction<Integer, String>() {
    @Override
    public String fold(String accumulator, Integer value) throws Exception {
        return accumulator + "=" + value;
    }
})

Aggregations

The datastream API supports various aggregations, such as min, Max, sum, etc. These functions can be applied to keyedstream to obtain aggregates.

KeyedStream.sum(0) 
KeyedStream.sum("key") 
KeyedStream.min(0) 
KeyedStream.min("key") 
KeyedStream.max(0) 
KeyedStream.max("key") 
KeyedStream.minBy(0) 
KeyedStream.minBy("key") 
KeyedStream.maxBy(0) 
KeyedStream.maxBy("key")

The difference between Max and maxby is that Max returns the maximum value in the stream, but maxby returns the key with the maximum value. Min and minby are the same.

Window

The window function allows you to group existing keyedstreams by time or other criteria. The following is the aggregation in a 10 second time window:

inputStream.keyBy(0).window(Time.seconds(10));

Flink defines data fragments to (possibly) handle infinite data streams. These slices are called windows. This slice helps to process data blocks by applying transformations. To windowize a stream, we need to assign a key that can be distributed and a function that describes which transformations to perform on the windowed stream

To slice a stream into a window, we can use Flink’s own window allocator. We have options, such as tumbling windows, sliding windows, global and session windows. Flink also allows you to write custom window allocators by extending the windowassginer class. Here is an article to explain how these different windows work.

WindowAll

The windowall function allows you to group regular data streams. Typically, this is a non parallel data conversion because it runs on a non partitioned data stream.

Similar to the conventional data flow function, we also have the window data flow function. The only difference is that they handle window data streams. Therefore, window reduction is like the reduce function, window fold is like the fold function, and there is aggregation.

inputStream.keyBy(0).windowAll(Time.seconds(10));

Union

The union function combines two or more data streams. In this way, data streams can be combined in parallel. If we combine a stream with itself, it will output each record twice.

inputStream.union(inputStream1, inputStream2, ...);

Window join

We can join two data streams in the same window through some keys.

inputStream.join(inputStream1)
           .where(0).equalTo(1)
           .window(Time.seconds(5))     
           .apply (new JoinFunction () {...});

The above example is to connect two streams in a 5-second window, in which the connection condition of the first attribute of the first stream is equal to the second attribute of the other stream.

Split

This function splits a flow into two or more flows based on conditions. You can use this method when you get mixed streams and you may want to process each data stream separately.

SplitStream<Integer> split = inputStream.split(new OutputSelector<Integer>() {
    @Override
    public Iterable<String> select(Integer value) {
        List<String> output = new ArrayList<String>(); 
        if (value % 2 == 0) {
            output.add("even");
        }
        else {
            output.add("odd");
        }
        return output;
    }
});

Select

This feature allows you to select a specific stream from the split streams.

SplitStream<Integer> split;
DataStream<Integer> even = split.select("even"); 
DataStream<Integer> odd = split.select("odd"); 
DataStream<Integer> all = split.select("even","odd");

Project

The project function allows you to select a subset of attributes from the event flow and send only the selected elements to the next processing flow.

DataStream<Tuple4<Integer, Double, String, String>> in = // [...] 
DataStream<Tuple2<String, String>> out = in.project(3,2);

The above function selects attribute numbers 2 and 3 from a given record. The following are sample input and output records:

(1,10.0,A,B)=> (B,A)
(2,20.0,C,D)=> (D,C)

last

This paper mainly introduces the common conversion methods of Flink data: map, flatmap, filter, keyby, reduce, fold, aggregates, window, windowall, union, window join, split, select, project, etc. It also uses a simple demo to introduce how to use it and how to convert the data stream into the format we want in the project, which needs to be treated according to the actual situation.

Pay attention to me

Please indicate the original address for Reprint:http://www.54tianzhisheng.cn/2018/11/04/Flink-Data-transformation/

The official account of WeChat:zhisheng

In addition, I have compiled some Flink learning materials, and I have put all the official account of WeChat. You can add my wechat:zhisheng_tian, and then reply to the keyword:FlinkYou can get it unconditionally.

Learning Flink from 0 to 1 - Flink data transformation

GitHub code warehouse

https://github.com/zhisheng17/flink-learning/

In the future, all the code of this project will be put in this warehouse, including some demos and blogs for learning Flink

Related articles

1、Learning Flink from 0 to 1 – Introduction to Apache Flink

2、Learning Flink from 0 to 1 — an introduction to building Flink 1.6.0 environment and building and running simple programs on MAC

3、Learn Flink from 0 to 1 – detailed explanation of Flink profile

4、Learning Flink from 0 to 1 – Introduction to data source

5、Learn Flink from 0 to 1 – how to customize the data source?

6、Learning Flink from 0 to 1 – Introduction to data sink

7、Learn Flink from 0 to 1 – how to customize data sink?

8、Learning Flink from 0 to 1 – Flink data transformation

9、“Learning Flink from 0 to 1” — introducing stream windows in Flink

10、Learning Flink from 0 to 1 — detailed explanations of several times in Flink

11、Learning Flink from 0 to 1 – Flink writes data to elasticsearch

12、Learning Flink from 0 to 1 – how does the Flink project run?

13、Learning Flink from 0 to 1 – Flink writes data to Kafka