How to use Python UDF in Apache Flink 1.10?

Time:2020-10-17

Author: Sun Jincheng (Jinzhu)

In Apache Flink version 1.9, we introduced pyflink module to support Python table API. Python users can complete data conversion and data analysis. However, you may find that pyflink 1.9 does not support the definition of Python UDFs, which may be inconvenient for Python users who want to extend the system’s built-in features.

In the just released Apache Flink 1.10, pyflink added support for Python UDFs. This means that you can write UDFs in Python and extend the functionality of the system from now on. In addition, this release also supports Python UDF environment and dependency management, so you can use third-party libraries in UDF to take advantage of Python’s ecological rich third-party library resources.

Pyflink supports Python UDFs architecture

Before we learn more about how to define and use Python UDFs, we will explain the architecture and background of UDFs working in pyflink, and provide some details about our underlying implementation.

Beam on Flink

Apache beam is a unified programming model framework, which implements batch and stream processing jobs that can be developed in any language and can run on any execution engine. This benefits from beam’s portability framework, as shown in the following figure:

How to use Python UDF in Apache Flink 1.10?

<p style=”text-align:center”>Portability Framework</p>

The above figure shows the architecture of beam’s portability framework. It describes how beam supports multiple languages and multiple engines. About Flink runner, we can say beam on Flink. So what does this have to do with pyflink’s support for Python UDF? This will be described next in “Flink on beam.”.

Flink on Beam

Apache Flink is an open source project, so its community also uses it more. For example, python UDF support in pyflink is built on top of the luxury sports car Apache beam. )

How to use Python UDF in Apache Flink 1.10?

<p style=”text-align:center”>Flink on Beam</p>

Pyflink supports Python UDFs, the management of Python running environment and the communication between Python VM and Java runtime environment JVM are very important. Fortunately, Apache beam’s portability framework solves this problem perfectly. The architecture of portability bean is as follows:

How to use Python UDF in Apache Flink 1.10?

<p style=”text-align:center”>PyFlink on Beam Portability Framework</p>

Beam portability framework is a mature multi language support framework. The framework highly abstracts the communication protocol between languages (grpc), defines the data transmission format (protobuf), and abstracts various services, such as DataService, stateservice, metricservice, etc., according to the components required by the general stream computing framework.

In such a mature framework, pyflink can quickly build its own Python operators, and reuse the existing SDK harness components in Apache beam portability framework, which can support a variety of Python running modes, such as process, docker, etc., which makes pyflink’s support for Python UDF very easy. In Apache Flink 1.10 The function in the is also very stable and complete. So why did Apache Flink and Apache beam jointly build it? Because I found that there is a lot of room for optimization in the current framework of Apache beam portability framework, so I discussed the optimization in the beam community, and contributed 30 + optimization supplements in the beam community.

Communication between JVM and python VM

Since Python UDF cannot run directly in the JVM, a python process started by the Apache Flink operator at initialization time is required to prepare the python execution environment. The python env service is responsible for starting, managing, and terminating Python processes. As shown in Figure 4, the communication between Apache Flink operator and python execution environment involves multiple components:

How to use Python UDF in Apache Flink 1.10?
<p style=”text-align:center”>Communication between JVM and Python VM</p>

  • Environmental management services: responsible for starting and terminating Python execution environment.
  • data service : responsible for transmitting input data and receiving user UDF execution results between Apache Flink operator and python execution environment.
  • Log service: is a mechanism to record support for user UDF log output. It can transfer the logs generated by user UDF to Apache Flink operator and integrate with Apache Flink log system.

explain:Metrics service is planned to be supported in Apache Flink 1.11.

The following figure describes the general process of initializing and executing UDFs from Java operators to Python processes.

How to use Python UDF in Apache Flink 1.10?

<p style=”text-align:center”>High-level flow between Python VM and JVM </p>

The overall process can be summarized as follows:

  • Initialize the python execution environment.

    • Python UDF runner starts the required grpc services, such as data service, log service, etc.
    • Python UDF runner starts another process and starts the python execution environment.
    • Python worker registers with Python User Defined Function runner.
    • Python UDF runner sends Python worker user-defined functions that need to be executed in the python process.
    • Python worker converts user-defined functions into bean executors (Note: Currently, pyflink uses beam’s portability framework [1] to execute Python UDFs).
    • A grpc connection is established between Python worker and Flink operator, such as data connection, log connection, etc.
  • Process input elements.

    • Python UDF runner sends input elements to Python worker for execution through grpc data service.
    • Python user-defined functions can also collect logs and metrics to Python UDF runner through grpc log service and metrics service during execution.
    • The execution results can be sent to Python UDF runner through grpc data service.

How to use UDFs in pyflink of Apache Flink 1.10

How to use Python UDF in Apache Flink 1.10?

This section describes how users define UDFs, and shows you how to install pyflink, how to define / register / invoke UDFs in pyflink, and how to perform jobs.

Install pyflink

We need to install pyflink first, which can be obtained through pypi, and can be installed conveniently by using PIP install.

Note: Python 3.5 or later is required to install and run pyflink.

$ python -m pip install apache-Apache Flink

Define a UDF

In addition to extending the base class scalarfunction, there are many ways to define Python UDFs. The following example shows different ways to define Python UDFs, which take two columns of bigint type as input parameters and return their sum as a result.

  • Option 1: extending the base class ScalarFunction
class Add(ScalarFunction):
  def eval(self, i, j):
    return i + j

add = udf(Add(), [DataTypes.BIGINT(), DataTypes.BIGINT()], DataTypes.BIGINT())
  • Option 2: Python function
@udf(input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], result_type=DataTypes.BIGINT())
def add(i, j):
  return i + j
  • Option 3: lambda function
add = udf(lambda i, j: i + j, [DataTypes.BIGINT(), DataTypes.BIGINT()], DataTypes.BIGINT())
  • Option 4: callable function
class CallableAdd(object):
  def __call__(self, i, j):
    return i + j

add = udf(CallableAdd(), [DataTypes.BIGINT(), DataTypes.BIGINT()], DataTypes.BIGINT())
  • Option 5: partial function
return i + j + k

add = udf(functools.partial(partial_add, k=1), [DataTypes.BIGINT(), DataTypes.BIGINT()],
          DataTypes.BIGINT())

Register a UDF

  • register the Python function
table_env.register_function("add", add)
  • Invoke a Python UDF
my_table.select(```js
"add(a, b)")
  • Example Code

Here is a complete example of using Python UDF.

from PyFlink.table import StreamTableEnvironment, DataTypes
from PyFlink.table.descriptors import Schema, OldCsv, FileSystem
from PyFlink.table.udf import udf

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
t_env = StreamTableEnvironment.create(env)

t_env.register_function("add", udf(lambda i, j: i + j, [DataTypes.BIGINT(), DataTypes.BIGINT()], DataTypes.BIGINT()))

t_env.connect(FileSystem().path('/tmp/input')) \
    .with_format(OldCsv()
                 .field('a', DataTypes.BIGINT())
                 .field('b', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('a', DataTypes.BIGINT())
                 .field('b', DataTypes.BIGINT())) \
    .create_temporary_table('mySource')

t_env.connect(FileSystem().path('/tmp/output')) \
    .with_format(OldCsv()
                 .field('sum', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('sum', DataTypes.BIGINT())) \
    .create_temporary_table('mySink')

t_env.from_path('mySource')\
    .select("add(a, b)") \
    .insert_into('mySink')

t_env.execute("tutorial_job")
  • Submit assignment

First, you need to prepare the input data in the / TMP / input file. For example,

$ echo "1,2" > /tmp/input

Next, you can run this example on the command line:

$ python python_udf_sum.py

With this command, you can build and run Python table API programs in a local small cluster. You can also submit Python table API programs to the remote cluster using different command lines.

Finally, you can view the execution results on the command line:

$ cat /tmp/output
3

Dependency management of Python UDF

In many cases, you may want to import third-party dependencies into Python UDFs. The following example will guide you how to manage dependencies.

Suppose you want to use mpmath to perform the sum of the two numbers in the above example. Python UDF logic may be as follows:

@udf(input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], result_type=DataTypes.BIGINT())
def add(i, j):
    from mpmath import fadd # add third-party dependency
    return int(fadd(1, 2))

To run on a work node that does not contain dependencies, you can specify the dependencies using the following API:

# echo mpmath==1.1.0 > requirements.txt
# pip download -d cached_dir -r requirements.txt --no-binary :all:
t_env.set_python_requirements("/path/of/requirements.txt", "/path/of/cached_dir")

Users need to provide a requirements.txt File and state the third party dependencies used. If the dependency cannot be installed in the cluster (network problem), the parameter “requirements” can be used_ cached_ Dir “to specify the directory of the installation package that contains these dependencies, as shown in the example above. Dependencies will be uploaded to the cluster and installed offline.

Here is a complete example of using dependency management:

from PyFlink.datastream import StreamExecutionEnvironment
from PyFlink.table import StreamTableEnvironment, DataTypes
from PyFlink.table.descriptors import Schema, OldCsv, FileSystem
from PyFlink.table.udf import udf

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
t_env = StreamTableEnvironment.create(env)

@udf(input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], result_type=DataTypes.BIGINT())
def add(i, j):
    from mpmath import fadd
    return int(fadd(1, 2))

t_env.set_python_requirements("/tmp/requirements.txt", "/tmp/cached_dir")
t_env.register_function("add", add)

t_env.connect(FileSystem().path('/tmp/input')) \
    .with_format(OldCsv()
                 .field('a', DataTypes.BIGINT())
                 .field('b', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('a', DataTypes.BIGINT())
                 .field('b', DataTypes.BIGINT())) \
    .create_temporary_table('mySource')

t_env.connect(FileSystem().path('/tmp/output')) \
    .with_format(OldCsv()
                 .field('sum', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('sum', DataTypes.BIGINT())) \
    .create_temporary_table('mySink')

t_env.from_path('mySource')\
    .select("add(a, b)") \
    .insert_into('mySink')

t_env.execute("tutorial_job")
  • Submit assignment

First, you need to prepare the input data in the / TMP / input file. For example,

echo "1,2" > /tmp/input
1
2

Second, you can prepare the dependency requirements file and cache directory:

$ echo "mpmath==1.1.0" > /tmp/requirements.txt
$ pip download -d /tmp/cached_dir -r /tmp/requirements.txt --no-binary :all:

Next, you can run this example on the command line:

$ python python_udf_sum.py

Finally, you can view the execution results on the command line:

$ cat /tmp/output
3

Get started quickly

Pyflink provides you with a very convenient development experience – pyflink shell. After the successful execution of Python – M PIP install Apache Flink, you can use the- shell.sh Local to launch a pyflink shell for development experience, as shown below:

More scenes

Not only simple ETL scenario support, pyflink can meet the business scenario requirements of many complex fields, such as the scenario with double 11 screens that we are most familiar with, as follows:

How to use Python UDF in Apache Flink 1.10?

For more details on the above example, see:

https://enjoyment.cool/2019/1…

Summary and future planning

In this blog, we introduce the python UDF architecture in pyflink, and give examples of how to define, register, call, and run UDFs. With the release of 1.10, it will provide more possibilities for Python users to write Python job logic. At the same time, we have been actively working with the community to continuously improve the functionality and performance of pyflink. In the future, we plan to introduce support for pandas in scalar and aggregate functions; increase support for Python UDF through SQL client to expand the use range of Python UDF; and make more performance improvements. Recently, there was a discussion on new feature support on the mailing list, you can view and find more details.

With the continuous efforts of community contributors, pyflink can quickly change from a seedling to a big tree as shown in the above figure:

How to use Python UDF in Apache Flink 1.10?

Pyflink needs you to join

Pyflink is a new component that still needs to be done. So everyone is welcome to contribute to pyflink, including asking questions, submitting error reports, proposing new features, joining discussions, and contributing code or documentation. Looking forward to seeing you in pyflink!