Quick building of new version Flink Java environment development

Time:2020-9-29

Flink is a computing framework that is very suitable for streaming batch processing. Flink1.10.0 perfectly integrates Alibaba’s blink, supports cross task resource sharing under the yarn mode, and strengthens support for hive. Let’s take a simple example to learn about the environment development of Flink

Maven create project

Flink supports Maven to build template projects directly. You can use this command on the terminal:

mvn archetype:generate                               \
      -DarchetypeGroupId=org.apache.flink              \
      -DarchetypeArtifactId=flink-quickstart-java      \
      -DarchetypeVersion=1.10.0

In the process of execution, it will prompt you to enter groupid, artifact ID, and package name. You can enter as required, and finally you can create a project successfully.

Enter the directory and you can see that the project has been created. The structure is as follows:

[[email protected] eqxiu-flink]# tree
.
├── pom.xml
└── src
    └── main
        ├── java
        │   └── com
        │       └── eqxiu
        │           ├── BatchJob.java
        │           └── StreamingJob.java
        └── resources
            └── log4j.properties

6 directories, 4 files

The project contains two classes batchjob and streamingjob, as well as a log4j. Properties configuration file. Then you can import the project into idea.

You can execute it in this directorymvn clean packageThe project can be compiled. After successful compilation, a jar package of job will be generated in the target directory. However, this job cannot be executed because the main method in the streamingjob class simply creates the streamexecutionenvironment environment environment, and then executes the execute method. This is not an executable job in Flink Therefore, if you submit it to Flink UI, you will also report an error.

Upload jar:

Operation error:

Server Response Message:
Internal server error.

We can see from Flink job manager’s log:

2020-03-27 14:36:30,150 ERROR org.apache.flink.runtime.webmonitor.handlers.JarRunHandler    - Unhandled exception.
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: No operators defined in streaming topology. Cannot execute.

Since we need to supplement some operator operations of our job before executing the method, it is normal to report errors. The complete code will be provided below.

Idea create project

Generally, our project may be composed of multiple jobs, and the code is managed under the same project. The above one is suitable for a single job to execute. However, if multiple people cooperate, they still have to create projects under the same project. Each Flink job has a module. Next, we will explain how to create a Flink project with idea.

Next, we need to work in the parent project pom.xml Add the following attributes (including encoding, Flink version, JDK version, Scala version, Maven compilation version) in the

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <! -- Flink version -- >
    <flink.version>1.10.0</flink.version>
    <! -- JDK version -- >
    <java.version>1.8</java.version>
    <! -- Scala version 2.11 -- >
    <scala.binary.version>2.11</scala.binary.version>
    <maven.compiler.source>${java.version}</maven.compiler.source>
    <maven.compiler.target>${java.version}</maven.compiler.target>
</properties>

Then add dependencies:

<dependencies>
    <!-- Apache Flink dependencies -->
    <!-- These dependencies are provided, because they should not be packaged into the JAR file. -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>

    <!-- Add logging framework, to produce console output when running in the IDE. -->
    <!-- These dependencies are excluded from the application JAR by default. -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.7</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.17</version>
        <scope>runtime</scope>
    </dependency>
</dependencies>

Among the above dependencies, link Java and link streaming Java are the core dependencies necessary for Flink. Why set the scope to provided (the default is compile)?

It’s because Flink is actually in the Lib folder of your installation directorylib/flink-dist_2.11-1.10.0.jarThese required jars are already included, so when we add dependencies to our Flink job, the jar package that we finally typed does not want to type these repeated dependencies into it. There are two benefits:

  • Reduced the size of Flink job jar packets we played
  • It will not lead to class loading conflicts due to different versions of Flink core dependencies

But the problem is coming again. We need to debug our Job in IDEA. If we set scope to provided, we will report it wrong.

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/flink/api/common/ExecutionConfig$GlobalJobParameters
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
    at java.lang.Class.getMethod0(Class.java:3018)
    at java.lang.Class.getMethod(Class.java:1784)
    at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.flink.api.common.ExecutionConfig$GlobalJobParameters
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 7 more

If the default scope is compile, there will be no error when debugging locally.

The test found that: when the scope is provided, the jar package is only 7.5K, while when the scope is compiled, the jar package is 45m. You have to think that this is just a simple wordcount program, and the difference is so big. When we type the Flink job as a fat jar, the upload time to the UI can be clearly compared (the smaller the jar package, the shorter the upload time). Therefore, it is necessary to set the scope to provided.

Some people will think, isn’t this in conflict with the above? What if I want to print a small jar package and run and debug jobs in a local idea? Here I offer a method: in the parent project pom.xml Introduce the following profiles.

<profiles>
    <profile>
        <id>add-dependencies-for-IDEA</id>

        <activation>
            <property>
                <name>idea.version</name>
            </property>
        </activation>

        <dependencies>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-java</artifactId>
                <version>${flink.version}</version>
                <scope>compile</scope>
            </dependency>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
                <version>${flink.version}</version>
                <scope>compile</scope>
            </dependency>
        </dependencies>
    </profile>
</profiles>

When you run a job in idea, it will introduce Flink Java and Flink streaming java to you, and the scope is set to compile. However, when you type a jar package, it does not work. If you add this profile or report an error, it may not be recognized in the idea. You can check the following two configurations in idea to make sure (one of the configurations works).

1. Check whether the profile in Maven has been checked by default. If it is not checked, it will work only if it is manually checked

2. Whether include dependencies with “provided” scope is checked. If it is not checked, it will take effect after being checked manually

Stream computing wordcount application code

Back to the point, after creating the wordcount application with idea, we started to write code.

Main class

public class Main {
    public static void main(String[] args) throws Exception {
        //Creating a stream running environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.getConfig().setGlobalJobParameters(ParameterTool.fromArgs(args));
        env.fromElements(WORDS)
                .flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
                    @Override
                    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
                        String[] splits = value.toLowerCase().split("\W+");

                        for (String split : splits) {
                            if (split.length() > 0) {
                                out.collect(new Tuple2<>(split, 1));
                            }
                        }
                    }
                })
                .keyBy(0)
                .reduce(new ReduceFunction<Tuple2<String, Integer>>() {
                    @Override
                    public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
                        return new Tuple2<>(value1.f0, value1.f1 + value1.f1);
                    }
                })
                .print();
        //The streaming program must add this to start the program, otherwise there will be no results
        env.execute("## word count streaming demo");
    }

    private static final String[] WORDS = new String[]{
            "To be, or not to be,--that is the question:--",
            "Whether 'tis nobler in the mind to suffer"
    };
}

pom.xmlThe build plug-in is introduced into the file and replaced with the mainclass in your own project

<build>
    <plugins>
        <!-- Java Compiler -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.1</version>
            <configuration>
                <source>${java.version}</source>
                <target>${java.version}</target>
            </configuration>
        </plugin>

        <! -- create a fat jar with all the necessary dependencies using the Maven shade plug-in -- >
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.0.0</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <artifactSet>
                            <excludes>
                                <exclude>org.apache.flink:force-shading</exclude>
                                <exclude>com.google.code.findbugs:jsr305</exclude>
                                <exclude>org.slf4j:*</exclude>
                                <exclude>log4j:*</exclude>
                            </excludes>
                        </artifactSet>
                        <filters>
                            <filter>
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>
                        <transformers>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <! -- Note: you must change it to your own startup class of job main method -- >
                                <mainClass>com.eqxiu.StreamingJob</mainClass>
                            </transformer>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Note: remember to add the above build plug-in, otherwise the jar package will be incomplete. If it is submitted and run, it will report classnotfoundexception. This problem is very easy for beginners to encounter. Many people have consulted the author about this problem.

Wordcount application running

Local ide running

After compiling the wordcount program, we can run the job by right clicking the run main method in idea. The results are as follows:

In the figure, each word and the corresponding number are printed out line by line. There is no problem running in the local idea. Next, we use the commandmvn clean packagePackage as a jar (eqxu-flink-1.0- SNAPSHOT.jar )Then upload it to the Flink UI and run it to see the effect.

UI running job

stayhttp://localhost:8081/#/submitPage upload eqxu-flink-1.0- SNAPSHOT.jar After that, click submit to run it.

The UI to run the job is as follows:

The results of the job are in stdout of the task manager

Wordcount application code analysis

We have written the wordcount program code and run the job in idea and Flink UI, and the results of the program are normal.

Let’s analyze the wordcount program code:

1. Create a good stream execution environment

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

2. Set global configuration for the running environment of streaming program (obtained from parameter args)

env.getConfig().setGlobalJobParameters(ParameterTool.fromArgs(args));

3. To build a data source, words is an array of strings

env.fromElements(WORDS)

4. The string is separated and collected. The data format after assembly is (word, 1), and 1 represents the number of times word appears

flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
    @Override
    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
        String[] splits = value.toLowerCase().split("\W+");

        for (String split : splits) {
            if (split.length() > 0) {
                out.collect(new Tuple2<>(split, 1));
            }
        }
    }
})

5. Group according to word keyword (0 means to group the first field, that is to group word)

keyBy(0)

6. Count a single word

reduce(new ReduceFunction<Tuple2<String, Integer>>() {
    @Override
    public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
        return new Tuple2<>(value1.f0, value1.f1 + value2.f1);
    }
})

7. Print all data streams in the format (word, count). Count represents the number of times word appears

print()

8. Start job execution

env.execute("## word count streaming demo");

Yarn, engineer of yiqixiu