Review of new features of Flink 1.12 resource management

Time:2021-9-10

Introduction: This paper introduces some features of Flink 1.12 resource management, including memory management, resource scheduling and extended resource framework.
This article is compiled by Chen Zhengyu, a community volunteer, shared by song Xintong, a technical expert of Apache Flink Committee and Alibaba, Apache Flink contributor and Guo Minze, a senior development engineer of Alibaba. It mainly introduces some features of Flink 1.12 resource management. The content is mainly divided into four parts:

memory management
Resource scheduling
Extended resource framework
Future planning

1、 Memory management

Firstly, we review the evolution of Flink’s memory model. On the left of the figure below are the new memory models introduced by Flink 1.10 and Flink 1.11 respectively. Although there are many modules involved, 80% – 90% of users only need to focus on task heap memory, task off heap memory, network memory and managed memory.

Most of the other modules are Flink’s framework memory, which does not need to be adjusted normally. Even if problems are encountered, they can be solved through community documents. In addition, “how much memory does a job need to meet the actual production needs” is also a problem we have to face, such as the function use of other indicators, whether the job affects the performance due to insufficient memory, whether there is a waste of resources, etc.

Review of new features of Flink 1.12 resource management

In response to the above, the community provided a new Web UI for task manager and jobmanager in Flink version 1.12.

Review of new features of Flink 1.12 resource management

In the new Web UI, the configuration value and actual usage of each monitoring indicator can be directly mapped to the memory model for intuitive display. On this basis, you can more clearly understand the operation of the job, how to adjust it, which configuration parameters to adjust, etc. (the community also has corresponding documents to support it). Through the new Web UI, we can better understand the usage of jobs, and memory management is more convenient.

1. Managed memory

Flink managed memory is actually a kind of local memory unique to Flink. It is not managed by JVM and GC, but managed by Flink itself.

The characteristics of local memory are mainly reflected in two aspects:

  • On the one hand, slot level budget planning can ensure that some operators or tasks cannot run due to insufficient memory during job operation; It will not waste resources because too much memory is reserved and not used. At the same time, Flink can ensure that the memory will be released accurately when the task runs, and ensure that there is enough memory available when the task manager executes a new task.
  • On the other hand, resource adaptability is also one of the important characteristics of managed memory, which means that the memory demand of operators is dynamically adjustable. With adaptability, the operator will not waste resources because it gives too much memory to the task, nor will it make the whole job unable to run because it provides relatively little memory, so as to keep the use of memory within a certain reasonable range.

Of course, when the memory allocation is relatively small, the job will be limited. For example, it is necessary to ensure the operation of the job through frequent disk dropping, which may affect the performance.

Currently, for managed memory, the usage scenarios of Flink are as follows:

  • Rocksdb state backend: in the scenario of stream computing, each slot will use the state operator to share the same underlying rocksdb cache;
  • Flink built-in operators: including batch processing, table SQL, dataset API and other operators. Each operator has an independent resource budget and will not be shared with each other;
  • Python process: when users use pyflink to define UDF in Python language, they need to start the virtual machine process of Python.

2. Job graph compilation phase

Flink’s management of management memory is mainly divided into two stages.

2.1 job graph compilation phase of job

At this stage, three issues need to be paid attention to:

The first question is: which operators or tasks in the slot will be executed at the same time. This problem is related to how to plan the memory in a query job and whether there are other tasks that need to use management memory to reserve the corresponding memory. In streaming operations, this problem is relatively simple, because we need all operators to execute at the same time to ensure that the upstream output data can be consumed by the downstream in time, and the data can flow through the whole job grep. However, if we are in some batch processing scenarios, we will actually have two data shuffle modes,

One is the pipeline mode. This mode is the same as the streaming mode, that is, the bounded stream processing method we mentioned earlier also requires the upstream and downstream operators to run at the same time. The upstream produces at any time and the downstream consumes at any time.
Review of new features of Flink 1.12 resource management

The other is the so-called batch blocking method, which requires the upstream to output all the data, and the downstream can start reading the data only after the end of the drop.

These two modes affect which tasks can be executed simultaneously. At present, in Flink, according to the type of an edge in the job topology (as shown in the figure). We have defined a concept called pipelined region, that is, a subgraph connected by the edge lock of pipeline. We identify this subgraph to judge which tasks will be executed at the same time.

The second question is: what are the usage scenarios in the slot? We just introduced three usage scenarios of manage memory. At this stage, python UDF and stateful operator may appear for streaming jobs. At this stage, we should note that it is not certain that the state operator will use management memory, because it is related to its state type.

If it uses rocksdb state operator, it needs to use manage memory;
However, if it uses heap state backend, it is not required.
However, in the compilation stage, the job does not know the type of state. Here is something to pay attention to.

The third question: for batch jobs, we need to be clear about not only the usage scenarios, but also the batch operator mentioned earlier. It uses management memory in an operator exclusive way, rather than sharing in slots. We need to know how much memory should be allocated by different operators. At present, Flink’s scheduling job automatically sets this.

2.2 execution phase

Review of new features of Flink 1.12 resource management

The first step is to judge whether there is rocksdb according to the type of state backend. As shown in the figure above, for example, a slot has three ABC operators. Python is used in both B and C, and stateful operator is used in C. In this case, in the case of heap, we take the above branch, and only one of the whole slot is used, that is, python. There will be two ways to use it later:

One of them is rocksdb state backend. After the first step of judgment, in the second step, we will decide how to share the slot management memory between different usage methods according to the user’s configuration.
In this steaming example, the weight of Python is 30%, and the weight of state backend is 70%. In this case, if there is only python, the python part naturally uses 100% memory (the heap state backend branch of streaming);

In the second case (the rocksdb state backend branch of streaming), the two operators of B and C share 30% of the memory for Python UDF, and C shares another 70% of the memory for rocksdb state backend. Finally, Flink will determine the actual amount of memory available to each operator according to the resource configuration of the task manager and the number of manager memories in a slot.
Review of new features of Flink 1.12 resource management

There are two differences between batch processing and stream processing. First, it does not need to judge the type of state backend, which is a simplification; Secondly, for batch operators, it is mentioned above that each operator has its own exclusive resource budget. In this case, we will calculate how much shared is required for different usage scenarios according to the utilization rate, and then further subdivide the proportion to each operator.

3. Parameter configuration

Review of new features of Flink 1.12 resource management
The chart above shows that what we need is a manager. Memory size can be configured in two ways:

  • One is absolute value configuration,
  • There is also a configuration method as a relative value of the total memory of task manager.

Taskmanager.memory.managed.consumer-weight is a newly added configuration item. Its data type is the type of map, that is, we actually give a key colon value, and then a comma plus the next set of key colon values. We currently support two types of consumer keys:

  • One is dataproc, which includes both the memory of the state backend in stream processing and the batch operator in batch processing;
  • The other is python.

2、 Resource scheduling

Some features related to resource scheduling are frequently asked in other versions or mailing lists. We also introduce them here.

1. Maximum number of slots

Review of new features of Flink 1.12 resource management

Flink supports a limit on the maximum number of slots (slot manager. Number of slots. Max) in 1.12. We also mentioned that for streaming jobs, we require all operators to execute at the same time to ensure the smooth operation of data. In this case, the job concurrency determines how many slots and resources our task needs to execute the job.

However, this is not true for batch processing. Batch jobs often have a large degree of concurrency, but they do not need so many resources. Batch processing uses very few resources to free up slots for subsequent tasks after running the previous tasks. Executing tasks in this serial way can avoid excessive resource occupation of yarn / k8s cluster. At present, this parameter is supported in yarn / mesos / native K8.

2. Taskmanager fault tolerance

In our actual production, there may be program errors, network jitter, hardware failures and other problems, resulting in the taskmanager unable to connect or even directly hang up. We often report an error like taskmanagerlost in the log. In this case, it is necessary to restart the job. During the restart process, it is necessary to re apply for resources and restart the taskmanager process. This performance consumption cost is very high.

For jobs with relatively high stability requirements, flink1.12 provides a new feature that can support a small number of redundant task managers in the Flink cluster. These redundant task managers can be used to quickly recover in case of a single point of failure without waiting for a new resource application process.

Review of new features of Flink 1.12 resource management

Redundant task manager can be implemented by configuring slotmanager.redundant-taskmanager-num. The so-called redundant task manager here does not mean that there are two task managers running empty load, but that there will be two more task managers compared with the total number of resources I need.

Tasks may be relatively evenly distributed on it, which can achieve a relatively good load while using idle taskmanager. In case of failure, you can quickly schedule the task to the existing task manager, and then make a new round of resource application. At present, this parameter is supported in yarn / mesos / native K8.

3. Task tile distribution

Task tiling mainly occurs in the Flink standalone mode or in the k8s mode deployment of older versions. In this mode, the number of task managers and the number of slots on each task manager are defined in advance, which will often lead to uneven scheduling. Some managers may put full tasks, while others put them loosely.

In version 1.11, the parameter cluster.even-spread-out-slots is introduced, which can control it for a relatively balanced scheduling.

Review of new features of Flink 1.12 resource management

be careful:

First, this parameter is only for standalone mode, because in the mode of yarn and k8s, the number of task managers is actually determined according to the requirements of your job. Therefore, there are requirements first and then task manager, rather than task manager and then slot scheduling requirements.

Every time you schedule a task, you can only see the currently registered task manager. Flink can’t know how many task managers will be registered. This is also a question many people ask, that is, why the feature doesn’t seem to have a good effect after it is opened. This is the first thing.

The second point to note is that we can only decide how many free slots there are on each taskmanager, but we can’t decide whether each operator has different concurrent numbers. Flink can’t decide whether each operator is evenly distributed on the taskmanager, because in the resource scheduling logic of Flink, Tasks are completely invisible in the allocation layer of the entire slot.

3、 Extended resource framework

1. Background

In recent years, with the continuous development of artificial intelligence, deep learning model has been applied to a variety of production needs, such as recommendation system, advertising push and intelligent risk control. These are also the scenes that Flink has been widely used. Therefore, supporting artificial intelligence has always been one of the long-term goals of Flink community. For this goal, there have been many third-party open source extensions. Alibaba’s open source work mainly includes two aspects:

  • One is the Flink AI extended project, which is a deep learning extension framework based on Flink. At present, it supports the integration of tensorflow, pytorch and other frameworks. It allows users to use tensorflow as an operator in Flink tasks.
  • The other is alink, a general algorithm platform based on Flink, which also has built-in many common machine learning algorithms.

The above two works extend Flink from the perspective of functionality. However, from the perspective of computational power, deep learning model or machine learning algorithm is usually the computational bottleneck of the whole task. GPU is a widely used resource in this field to accelerate training or prediction. Therefore, supporting GPU resources to accelerate computing is an essential function of Flink in the development of AI.

2. Use extended resources

At present, Flink only supports CPU and memory in the resource dimension of user configuration. In actual use, it is not only GPU, but also other resource requirements, such as network acceleration devices such as SSD or RDMA. Therefore, we hope to provide a general extension resource framework. Any extension resource can be added to the framework in the form of plug-in, and GPU is only one of them.

For the use of extension resources, two general requirements can be abstracted:

  • It is necessary to support the configuration and scheduling of such extension resources. The user can specify the requirements for such extension resources in the configuration. For example, each taskmanager needs a GPU card. When Flink is deployed on a resource base such as kubernetes / Yan, the user’s requirements for extension resources need to be forwarded to ensure that there are corresponding extension resources in the requested container / pod.
  • You need to provide the operator with runtime extended resource information. Users may need some runtime information in the user-defined operator to use extended resources. Taking GPU as an example, the operator needs to know which GPU card its internal model can be deployed on. Therefore, it is necessary to provide this information to the operator.

3. How to use the extended resource framework

Using the resource framework, we can divide it into the following three steps:

  • First, set the relevant configuration for the extended resource;
  • Then prepare the plug-ins in the extension resource framework for the required extension resources;
  • Finally, in the operator, obtain the information of extended resources from runtimecontext and use these resources

3.1 configuration parameters

#Define extended resource name, "GPU"
external-resources: gpu
#Define the number of GPUs required for each taskmanager
external-resource.gpu.amount: 1 
#Define the configuration keys for extended resources in yarn or kubernetes
external-resource.gpu.yarn.config-key: yarn.io/gpu
external-resource.gpu.kubernetes.config-key: nvidia.com/gpu
#Defines the factory class of the plug-in gpudriver.
external-resource.gpu.driver-factory.class: 
org.apache.flink.externalresource.gpu.GPUDriverFactory

The above is a configuration example of using GPU resources:

  • For any extension resource, the user first needs to add its name to “external resources”, which will also be used as the prefix of other related configurations of the extension resource. In the example, we define a resource called “GPU”.
  • In the scheduling layer, users are currently supported to configure and expand resource requirements at the granularity of task manager. In the example, we define the number of GPU devices on each taskmanager as 1.
  • When Flink is deployed on kubernetes or Yan, we need to configure the configuration key of the extension resource on the corresponding resource base so that Flink can forward the resource requirements. The configuration corresponding to GPU is shown in the example.
  • If a plug-in is provided, you need to put the factory class name of the plug-in into the configuration.

3.2 pre preparation

Before actually using the extended resources, some preparatory work needs to be done. Take GPU as an example:

  • In standalone mode, the cluster administrator needs to ensure that GPU resources are visible to the taskmanager process.
  • In kubernetes mode, the cluster needs to support device plugin [6], the corresponding kubernetes version is 1.10, and the plug-in corresponding to GPU is installed in the cluster.
  • In the yarn mode, GPU scheduling requires that the Hadoop version of the cluster is above 2.10 or 3.1, and the resource-types.xml and other files are configured correctly.

3.3 extension resource framework plug-in

After scheduling the extended resource, the user-defined operator may also need the information of the extended resource at run time to use it. The plug-in in the extended resource framework is responsible for obtaining this information. Its interface is as follows:

public interface ExternalResourceDriverFactory {
  /**
  *Create a driver that extends the resource based on the settings provided
  */
  ExternalResourceDriver createExternalResourceDriver(Configuration config) throws Exception;
}

public interface ExternalResourceDriver {
  /**
  *Get the required number of extended resource information
  */
  Set<? extends ExternalResourceInfo> retrieveResourceInfo(long amount) throws Exception;
}

The externalresourcedriver will be started on each taskmanager, and the extended resource framework will call the retrieveresourceinfo interface of each driver to obtain the extended resource information on the taskmanager, and transfer the obtained information to the runtimecontext of the operator. Externalresourcedriverfactory is the factory class of the plug-in.

4. GPU plug-in

Flink currently has a built-in plug-in for GPU resources. It internally obtains the GPU information available in the current environment by executing a script called discovery script. The current information includes the index of GPU devices.

Flink provides a default script, which is located in the “plugins / external resource GPU /” directory of the project. Users can also implement custom discovery scripts and specify the use of custom scripts through configuration. The protocol between the script and GPU plug-in is:

  • When the script is invoked, the number of GPUs required will be entered as the first parameter, followed by a user-defined parameter list.
  • If the script executes normally, the GPU index list is output, separated by commas.
  • If the script makes an error or the execution result is not as expected, the script exits with a non-zero value, which will cause taskmanager initialization failure and print the error message of the script in the log.

The default script provided by Flink is to obtain the number of GPUs and index available in the current machine through the “NVIDIA SMI” tool, and return the corresponding GPU index list according to the required number of GPUs. When the required number of GPUs cannot be obtained, the script exits with a non-zero value.

The resources of GPU devices are divided into two dimensions: stream processor and video memory. The video memory resources only support exclusive use. Therefore, when multiple task managers run on the same machine, if a GPU is used by multiple processes, it may cause its video memory oom. Therefore, in the standalone mode, a resource isolation mechanism at the taskmanager level is required.

The default script provides coordination mode to support GPU resource isolation between multiple task manager processes in a single machine. This mode uses file locks to synchronize GPU usage information among multiple processes and coordinate the use of GPU resources by multiple taskmanager processes on the same machine.

5. Obtain the extended resource information in the operator

In the user-defined operator, the resource name defined in “external resources” can be used to call the getexternalresourceinfo interface of runtimecontext to obtain the information of the corresponding extended resource. Taking GPU as an example, each external resourceinfo obtained represents a GPU card, and the field named “index” in it represents the device index of the GPU card.

public class ExternalResourceMapFunction extends RichMapFunction<String, String> {
  private static finalRESOURCE_NAME="gpu";
  @Override
  public String map(String value) {
    Set<ExternalResourceInfo> gpuInfos = getRuntimeContext().getExternalResourceInfos(RESOURCE_NAME);
    List<String> indexes = gpuInfos.stream()
          .map(gpuInfo -> gpuInfo.getProperty("index").get()).collect(Collectors.toList());
    // Map function with GPU// ...    
  }
}

6. MNIST Demo

The following figure illustrates the use of GPU to accelerate Flink jobs with the identification task of MNIST dataset.

Review of new features of Flink 1.12 resource management

MNIST, as shown in the above figure, is a handwritten digital picture data set, and each picture can be represented as a matrix of 28 * 28. In this task, we use the pre trained DNN model. The image input passes through a layer of fully connected network to obtain a 10-dimensional vector. The subscript of the largest element of the vector is the recognition result.

We start a standalone cluster with two task manager processes on an ECS with two GPU cards. With the coordination mode function provided by the default script, we can ensure that each task manager uses one of the GPU cards.

The core operator of the task is the image recognition function mnistclassier, and the core implementation is as follows

class MNISTClassifier extends RichMapFunction<List<Float>, Integer> {

  @Override
  public void open(Configuration parameters) {
    //Get the GPU information and select the first GPU
    Set<ExternalResourceInfo> externalResourceInfos =   getRuntimeContext().getExternalResourceInfos(resourceName);
    final Optional<String> firstIndexOptional = externalResourceInfos.iterator().next().getProperty("index");
    //Initialize the jcuda component with the index of the first GPU
    JCuda.cudaSetDevice(Integer.parseInt(firstIndexOptional.get()));
    JCublas.cublasInit();
  }
}

In the open method, obtain the GPU available to the current taskmanager from the runtimecontext, and select the first block to initialize the jcuda and jcubas libraries.

class MNISTClassifier extends RichMapFunction<List<Float>, Integer> {
    @Override
    public Integer map(List<Float> value) {
        //Using jucblas to do matrix algorithm
        JCublas.cublasSgemv('n', DIMENSIONS.f1, DIMENSIONS.f0, 1.0f,
                matrixPointer, DIMENSIONS.f1, inputPointer, 1, 0.0f, outputPointer, 1);

        //The multiplication result is obtained and the number represented in the figure is obtained
        JCublas.cublasGetVector(DIMENSIONS.f1, Sizeof.FLOAT, outputPointer, 1, Pointer.to(output), 1);

        JCublas.cublasFree(inputPointer);
        JCublas.cublasFree(outputPointer);

        int result = 0;
        for (int i = 0; i < DIMENSIONS.f1; ++i) {
            result = output[i] > output[result] ? i : result;
        }
        return result;
    }
}

In the map method, the pre trained model parameters and input matrix are put into the GPU memory, and the matrix multiplication in the GPU is carried out by jcublas. Finally, the result vector is taken out from the GPU memory and the recognition result number is obtained.

For the specific case demonstration process, you can go to watch the video or refer to the link above GitHub to try it.

4、 Future plans

In addition to the released features described above, the Apache Flink community is also actively preparing more optimization features in resource management, and will meet you in the future version.

  • Passive resource scheduling mode: managed memory allows Flink tasks to flexibly adapt to different taskmanager / slot resources, make full use of available resources, and provide computing tasks with the best computing power under given resource constraints. However, the user still needs to specify the parallelism of computing tasks. Flink needs to apply to a taskmanager / slot that meets the parallelism to execute smoothly. Passive resource scheduling will enable Flink to dynamically change the parallelism according to the available resources. When the resources are insufficient, Flink can perform the best effort data processing, and restore to the specified parallelism when the resources are sufficient to ensure the processing performance.
  • Fine grained resource management: Flink’s current slot based resource management and scheduling mechanism believes that all slots have the same specifications. For some complex large-scale production tasks, it is often necessary to split the calculation task into multiple subgraphs, and each subgraph is executed with a slot alone. When there are large differences in resource requirements between subgraphs, it is often difficult to meet the requirements of resource efficiency by using slots of the same specification, especially for high-cost extended resources such as GPU. Fine grained resource management allows users to specify resource requirements for job subgraphs. Flink will use taskmanagers / slots of different specifications to perform computing tasks according to resource requirements, so as to optimize resource efficiency.

5、 Summary

Through the introduction of this article, I believe you have a clearer understanding of Flink memory management.

  • First, the memory management and memory allocation details of each process are solved from the local memory, job graph compilation stage and execution stage, and the memory allocation of taskmanager is controlled through new parameter configuration;
  • Then, from the problems related to resource scheduling that we usually encounter, including the use of the maximum number of slots, how to make taskmanager fault-tolerant, and how to spread task resources evenly through task tiling;
  • Finally, GPU is often used for accelerated computing in the field of machine learning and deep learning. By explaining how Flink uses the extended resource framework and demo in version 1.12, we show us the use of resource extension. In terms of resource utilization, this paper puts forward the plans being made by the two communities in the future, including passive resource model and fine-grained resource management.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Recommended Today

Swift advanced (XV) extension

The extension in swift is somewhat similar to the category in OC Extension can beenumeration、structural morphology、class、agreementAdd new features□ you can add methods, calculation attributes, subscripts, (convenient) initializers, nested types, protocols, etc What extensions can’t do:□ original functions cannot be overwritten□ you cannot add storage attributes or add attribute observers to existing attributes□ cannot add parent […]