Pyflink development environment tool: Zeppelin notebook

Time:2022-5-11

As the python language entrance of Flink, pyflink’s Python language is indeed very simple and easy to learn, but the development environment of pyflink is not easy to build. If you are careless, the pyflink environment will be disordered, and it is difficult to find out the reasons. Today, I’d like to introduce a powerful tool of pyflink development environment that can help you solve these problems: Zeppelin notebook. The main contents are as follows:

  1. preparation
  2. Build pyflink environment
  3. Summary and future

You may have heard of Zeppelin for a long time, but previous articles focus on how to develop Flink SQL in Zeppelin. Today, let’s introduce how to efficiently develop pyflink jobs in Zeppelin, especially to solve the environmental problems of pyflink.

One sentence to summarize the theme of this article is to use CONDA to create Python env in Zeppelin notebook and automatically deploy it to Yan cluster. You don’t need to manually install any pyflink package on the cluster, and you can use multiple isolated versions of pyflink in a Yan cluster at the same time. Finally, you can see the effect is like this:

1. Be able to use third-party Python libraries on the pyflink client, such as Matplotlib:

Pyflink development environment tool: Zeppelin notebook

2. Third party Python libraries can be used in pyflink UDF, such as:

Pyflink development environment tool: Zeppelin notebook

Next, let’s see how to implement it.

1、 Preparatory work

Step 1.

Get ready to build the latest version of Zeppelin. This will not be carried out here. If you have any questions, you can join Flink on Zeppelin nail group (34517043) for consultation. In addition, it should be noted that the Zeppelin deployment cluster needs to be Linux. If it is a Mac, the CONDA environment on the MAC machine cannot be used in the yarn cluster (because the CONDA package is incompatible between different systems).

Step 2.

Download Flink 1.13. It should be noted that the functions of this article can only be used in versions above Flink 1.13, and then:

  • holdflink-Python-*.jarCopy the jar package to the Lib folder of Flink;
  • holdopt/PythonCopy this folder to Flink’s lib folder.

Step 3.

Install the following software (which is used to create CONDA Env):

2、 Build pyflink environment

Next, you can build and use pyflink in Zeppelin.

Step 1. Create pyflink CONDA environment on jobmanager

Because Zeppelin naturally supports shell, you can use shell to make pyflink environment in Zeppelin. Note that the python third-party packages here are those needed in the pyflink client (jobmanager), such as Matplotlib, and ensure that at least the following packages are installed:

  • A version ofPython(3.7 is used here)
  • apache-flink(1.13.1 is used here)
  • jupyter,grpcio,protobuf(these three packages are needed by Zeppelin)

The remaining packages can be specified as needed:

%sh

# make sure you have conda and momba installed.
# install miniconda: https://docs.conda.io/en/latest/miniconda.html
# install mamba: https://github.com/mamba-org/mamba

echo "name: pyflink_env
channels:
  - conda-forge
  - defaults
dependencies:
  - Python=3.7
  - pip
  - pip:
    - apache-flink==1.13.1
  - jupyter
  - grpcio
  - protobuf
  - matplotlib
  - pandasql
  - pandas
  - scipy
  - seaborn
  - plotnine
 " > pyflink_env.yml
    
mamba env remove -n pyflink_env
mamba env create -f pyflink_env.yml

Run the following code to package the CONDA environment of pyflink and upload it to HDFS (note that the file format packaged here is tar. GZ):

%sh

rm -rf pyflink_env.tar.gz
conda pack --ignore-missing-files -n pyflink_env -o pyflink_env.tar.gz

hadoop fs -rmr /tmp/pyflink_env.tar.gz
hadoop fs -put pyflink_env.tar.gz /tmp
# The Python conda tar should be public accessible, so need to change permission here.
hadoop fs -chmod 644 /tmp/pyflink_env.tar.gz

Step 2. Create pyflink CONDA environment on taskmanager

Run the following code to create the pyflink CONDA environment on taskmanager. The pyflink environment on taskmanager contains at least the following 2 packages:

  • A version ofPython(3.7 is used here)
  • apache-flink(1.13.1 is used here)

The remaining packages are those that Python UDF needs to rely on. For example, pandas is specified here:

echo "name: pyflink_tm_env
channels:
  - conda-forge
  - defaults
dependencies:
  - Python=3.7
  - pip
  - pip:
    - apache-flink==1.13.1
  - pandas
 " > pyflink_tm_env.yml
    
mamba env remove -n pyflink_tm_env
mamba env create -f pyflink_tm_env.yml

Run the following code to package pyflink’s CONDA environment and upload it to HDFS (note that the ZIP format is used here)

%sh

rm -rf pyflink_tm_env.zip
conda pack --ignore-missing-files --zip-symlinks -n pyflink_tm_env -o pyflink_tm_env.zip

hadoop fs -rmr /tmp/pyflink_tm_env.zip
hadoop fs -put pyflink_tm_env.zip /tmp
# The Python conda tar should be public accessible, so need to change permission here.
hadoop fs -chmod 644 /tmp/pyflink_tm_env.zip

Step 3. Using CONDA environment in pyflink

Next, you can use the CONDA environment created above in Zeppelin. First, you need to configure Flink in Zeppelin. The main configuration options are:

  • flink. execution. Mode is yarn application, and the method described in this paper is only applicable to yarn application mode;
  • Specify yarn ship-archives,zeppelin. pyflink. Python and Zeppelin interpreter. conda. env. Name to configure the pyflink CONDA environment on the jobmanager side;
  • Specify python Archives and python Executable to specify the pyflink CONDA environment on the taskmanager side;
  • Specify other optional Flink configurations, such as Flink jm. Memory and flick tm. memory。
%flink.conf


flink.execution.mode yarn-application

yarn.ship-archives /mnt/disk1/jzhang/zeppelin/pyflink_env.tar.gz
zeppelin.pyflink.Python pyflink_env.tar.gz/bin/Python
zeppelin.interpreter.conda.env.name pyflink_env.tar.gz

Python.archives hdfs:///tmp/pyflink_tm_env.zip
Python.executable  pyflink_tm_env.zip/bin/Python3.7

flink.jm.memory 2048
flink.tm.memory 2048

Next, you can use pyflink and the specified CONDA environment in Zeppelin as mentioned at the beginning. There are two scenarios:

  • In the following example, you canPyflink client(jobmanager side) use the CONDA environment on the jobmanager side created above. For example, Matplotlib is used below.

    Pyflink development environment tool: Zeppelin notebook

  • The following example is inPyFlink UDFUse the library in the CONDA environment on the taskmanager side created above. For example, use pandas in UDF below.

    Pyflink development environment tool: Zeppelin notebook

3、 Summary and future

The content of this article is to use CONDA to create Python env in Zeppelin notebook and automatically deploy it to Yan cluster. There is no need to manually install any pyflink package on the cluster, and multiple versions of pyflink can be used in a Yan cluster at the same time.

The environment of each pyflink is isolated, and the CONDA environment can be customized and changed at any time. You can download the following note and import it into Zeppelin to reproduce today’s content:http://23.254.161.240/#/noteb…

In addition, there are many areas for improvement:

  • At present, we need to create two CONDA envs because Zeppelin supports tar GZ format, while Flink only supports ZIP format. After the two sides are unified in the later stage, just create a CONDA env;
  • Apache Flink now includes Flink’s jar package, which leads to a very large CONDA env. The initialization of the yarn container will take a long time. This requires the Flink community to provide a lightweight Python package (excluding Flink jar package), which can greatly reduce the size of the CONDA env.

Registration for the 3rd Apache Flink geek challenge begins! 300000 bonus waiting for you!

With the impact of massive data, the value of data processing and analysis ability in business is increasing day by day. The exploration of data processing timeliness in all walks of life is also deepening. Apache Flink, as a computing engine focusing on real-time computing, came into being.

In order to bring more ideas of real-time computing enabling practice to the industry and encourage technology loving developers to deepen their mastery of Flink, Apache Flink community, together with Alibaba cloud, Intel, Alibaba artificial intelligence governance and sustainable development laboratory (aaig) and cluster, jointly organized the “third Apache Flink geek challenge and aaig Cup”, which will be officially launched from now on.

👉 Click to learn more about the event 👈

Pyflink development environment tool: Zeppelin notebook