Introduction:This paper introduces the objectives and development history of pyflink project, as well as the current core functions of pyflink, including Python table API, python UDF, vectorized Python UDF, python UDF metrics, pyflink dependency management and python UDF execution optimization. At the same time, it also shows the related demo for functions.
Author Fu Dian
This paper introduces the objectives and development history of pyflink project, as well as the current core functions of pyflink, including Python table API, python UDF, vectorized Python UDF, python UDF metrics, pyflink dependency management and python UDF execution optimization. At the same time, it also shows the related demo for functions. This paper is divided into four parts
- Introduction to pyflink
- Pyflink related functions
- Pyflink function demonstration
- Pyflink’s next step
Introduction to pyflink
Pyflink is a sub module of Flink and a part of the whole Flink project. Its main purpose is to provide Python language support for Flink. Because in the field of machine learning and data analysis, python language is very important, even the most important development language. Therefore, in order to meet the needs of more users and broaden the ecology of Flink, we launched the pyflink project.
There are two main goals of pyflink project. The first one is to output Flink’s computing power to Python users. In other words, we will provide a series of Python APIs in Flink to facilitate users who are familiar with Python to develop Flink jobs.
The second point is to distribute Python ecology based on Flink. Although we will provide a series of Python APIs in Flink for Python users to use, there is a learning cost for users, because users need to learn how to use Flink’s Python API and understand the purpose of each API. So we hope that users can use their familiar Python library API in the API layer, but the underlying computing engine uses Flink, so as to reduce their learning costs. This is what we are going to do in the future. It is in the start-up stage.
The figure below shows the development of pyflink project. At present, three versions have been released, and the supporting content is becoming richer and richer.
Introduction of pyflink related functions
We mainly introduce the following functions of pyflink: Python table API, python UDF, vectorized Python UDF, python UDF metrics, pyflink dependency management and python UDF execution optimization.
Python Table API
The purpose of Python table API is to enable users to use Python language to develop Flink jobs. There are three types of API in Flink: process API, function API and table API. The first two are lower level APIs. The logic of jobs based on process and function will be executed strictly according to user-defined behavior, while table API is higher level API. The logic of jobs based on table API will be executed after a series of optimization.
Python table API, as the name suggests, provides Python language support for table API.
The following is a Flink job developed by Python table API. The job logic is to read the file, calculate word count, and then write the calculation results to the file. This example is simple, but it includes all the basic processes of developing a python table API job.
First of all, we need to define the execution mode of the job, such as batch mode or stream mode. What is the concurrency of the job? What is the configuration of the job. Next, we need to define the source table and the sink table. The source table defines where the data source of the job comes from and what the data format is; The sink table defines where the execution result of the job is written and what the data format is. Finally, we need to define the execution logic of the job. In this example, we calculate the written count.
The following is a partial screenshot of the python table API. You can see that it has a relatively complete number and functions.
Python table API is a relational API, its function can be compared to SQL, and the user-defined function in SQL is a very important function, which can greatly expand the scope of use of SQL. The main purpose of Python UDF is to allow users to use Python language to develop custom functions, so as to expand the use scenarios of Python table API. At the same time, python UDF can be used not only in Python table API jobs, but also in Java table API jobs and SQL jobs.
In pyflink, we support many ways to define Python UDF. Users can define a python class, inherit scalarfunction, or define an ordinary Python function or lambda function to implement the logic of user-defined functions. In addition, we also support defining Python UDF through callable function and partial function. Users can choose the most suitable way according to their own needs.
Pyflink provides a variety of ways to use Python UDF, including Python table API, Java table API and SQL.
Python UDF is used in Python table API. After defining Python UDF, users need to register Python UDF first. They can call table environment register to register and then name it. Then they can use Python UDF by this name in the job.
In the Java table API, it is used in a similar way, but the registration method is different. The Java table API job needs to register through DDL statements.
In addition, users can also use Python UDF in SQL jobs. Similar to the previous two methods, users first need to register Python UDF, which can be registered in SQL scripts through DDL statements, or in the environment configuration file of SQL client.
Python UDF architecture
This paper briefly introduces the implementation architecture of Python UDF. Flink is written in Java language and runs in Java virtual machine, while Python UDF runs in Python virtual machine, so java process and python process need data communication. In addition, they need to transfer state, log and metrics, and their transport protocols need to support four types.
Vectorization Python UDF
The main purpose of vectorized Python UDF is to enable Python users to develop high-performance Python UDF by using Python libraries commonly used in data analysis fields such as pandas or numpy.
Vectorized Python UDF is relative to ordinary Python UDF. We can see the difference between the two in the figure below.
The following figure shows the execution of vectorized Python UDF. First of all, on the Java side, Java will convert to arrow format after saving multiple pieces of data, and then send them to the python process. After receiving the data, the Python process converts it into the data structure of Pandas, and then calls the user defined quantized Python UDF. At the same time, the execution result of Python UDF will be converted into arrow format data and sent to Java process.
In terms of usage, vectorized Python UDF is similar to ordinary Python UDF, with the following slight differences. First of all, you need to add a UDF type to the declaration mode of vectorized Python UDF, and declare that it is a vectorized Python UDF, and the input and output type of UDF is pandas series.
Python UDF Metrics
We mentioned earlier that Python UDF can be defined in many ways, but if you need to use metrics in Python UDF, python UDF must inherit scalarfunction to define it. A function context parameter is provided in the open method of Python UDF. Users can register metrics through the function context parameter, and then report through the registered metrics object.
Pyflink dependency management
In terms of types, pyflink dependencies mainly include the following types: ordinary pyflink files, archive files, third-party libraries, pyflink interpreters, or Java jar packages, etc. From the solution point of view, pyflink provides two solutions for each type of dependency, one is the API solution, and the other is the command line option. You can choose one of them.
Python UDF execution optimization
Python UDF execution optimization mainly includes two aspects, execution plan optimization and runtime optimization. It is very similar to SQL. A job containing Python UDF first generates an optimal execution plan through predefined rules. In the case of the implementation plan has been determined, in the actual implementation, we can use some other optimization means to achieve as high efficiency as possible.
Python UDF execution plan optimization
The optimization of execution plan mainly includes the following optimization ideas. One is the splitting of different types of UDFs. Because a node may contain multiple types of UDFs at the same time, different types of UDFs cannot be executed in one block; The second aspect is filter push down, whose main purpose is to reduce the amount of input data containing Python UDF nodes as much as possible, so as to improve the performance of the whole job; The third optimization idea is Python UDF chaining. The communication cost between Java process and python process and the cost of serialization and deserialization are relatively large. Python UDF chaining can minimize the communication cost between Java process and python process.
Splitting of different types of UDF
If there is such a job, it contains two UDFs, where add is Python UDF and subtract is vectorized Python UDF. By default, the execution plan of this job will have a project node, and the two UDFs will be in the project node at the same time. The main problem with this execution plan is that ordinary Python UDF processes one piece of data at a time, while vectorized Python UDF processes multiple pieces of data at a time. Therefore, such an execution plan cannot be executed.
However, by splitting, we can split the node of this project into two project nodes. The node of the first project only contains ordinary Python UDF, while the node of the second project only contains vectorized Python UDF. After different types of Python UDF are split into different nodes, each node contains only one type of UDF, so the operator can choose the most appropriate execution mode according to the type of UDF it contains.
Filter before Python UDF
The main purpose of filter push down is to push down the filter operator before the python UDF node, so as to minimize the amount of data in the python UDF node.
If we have such a job, the original execution plan of the job includes two project nodes, one is add and subtract, and the other includes a filter node. This execution plan can be run, but it needs to be optimized. As you can see, the python UDF has been calculated before the filter node because the python node is in front of the filter node. However, if you push the filter down to the front of the python UDF node, you can greatly reduce the amount of input data of the python UDF node.
Python UDF Chaining
Suppose we have such a job, which contains two types of UDF, one is add, the other is subtract, they are all ordinary Python UDF. There are two project nodes in an execution plan. The node of the first project calculates the subject first, and then transfers it to the second project node for execution.
Its main problem is that since subtract and add are located on two different nodes, their calculation results need to be sent back to Java from python, and then sent by java process to python of the second node for execution. It is equivalent to a data loop between Java process and python process, so it brings unnecessary communication overhead and serialization and deserialization overhead. Therefore, we can optimize the execution plan as shown in the figure on the right, that is, put the add node and the subtract node in one node, and then directly call the add node after the result of the subtract node is calculated.
Python UDF runtime optimization
At present, there are three ways to improve the execution efficiency of Python UDF operation: one is Python optimization, which is used to improve the execution efficiency of Python code; Second, customize the serializer and deserializer between Java process and python process to improve the efficiency of serialization and deserialization; The third is to provide the function of quantifying Python UDF.
Pyflink related function demonstration
First of all, you can open this page, which provides some demos of pyflink. These demos are run in docker, so if you want to run these demos, you need to install the docker environment on your machine.
Then, we can run the command, which will start a pyflink cluster, and the pyflink examples we will run later will be submitted to the cluster for execution.
The first example is word count, in which we first define the environment, source, sink, etc. we can run this job.
This is the result of the operation. You can see that Flink appears twice and pyflink appears once.
Next, run an example of Python UDF. This example is similar to the previous one. First, we define it to use pyflink, run in batch mode, and the concurrency of the job is 1. The difference is that we define a UDF in the job, its input includes two columns, both of which are bigint type, and its output type is corresponding. The logic of the UDF is to add the two columns as a result.
Let’s do the homework, and the result is 3.
Next we run a python UDF with dependencies. The UDF of the previous job does not contain any dependencies. It directly adds up the two input columns. In this example, UDF refers to a third-party dependency, which can be executed through API set Python requirement.
Next, we run the job, and its execution result is the same as before, because the logic of the two jobs is similar.
Next, let’s look at an example of vectorized Python UDF. When defining UDF, we add a UDF type field to show that we are a vectorized Python UDF, and other logic is similar to that of ordinary Python UDF. Finally, its execution result is also 3, because its logic is the same as before, and it calculates the sum of two pages.
Let’s take another example, using Python in Java’s table job. In this job, we will use a python UDF, which is registered through DDL statements and then used in execute SQL statements.
Next, let’s look at an example of using Python UDF in a pure SQL job. In the resource file, we declare a UDF named ADD1. Its type is python, and we can also see its UDF location.
Next we run it and the result is 234.
Pyflink’s next step
At present, pyflink only supports Python table API. We plan to support datastream API, Python udaf and pandas udaf in the next version. In addition, we will continue to optimize the execution efficiency of pyflink in the execution layer.
Here are links to some resources, including the document address of pyflink.
- Python table API documentation
- Pyflink document
- PyFlink playground
OK, that’s all for today’s sharing. Welcome to continue to pay attention to our course.
It only costs 99 yuan to experience Alibaba cloud’s enterprise class product based on Apache Flink – real time computing Flink! Click the link below for details:https://www.aliyun.com/product/bigdata/sc?utm\_content=g\_1000250506
Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.