Flink on yarn (2): common problems and troubleshooting ideas


Flink supports standalone deployment and cluster deployment modes such as yarn, kubernetes, and mesos, among which yarn cluster deployment mode is more and more widely used in China. The Flink community will launch the Flink on yarn application Interpretation Series, which is divided into two parts. The first part introduces the whole process of Flink on yarn application startup based on the reconstructed resource scheduling model of flip-6. Based on the feedback from the community, this paper will answer the common problems of client and Flink cluster, and share the troubleshooting ideas of related problems.
Common problems and troubleshooting ideas of client
▼ the application submits the console exception information: could not build the program from jar file.
This problem is confusing. Most of the time, it is not the problem of the jar file specified to run, but the exception occurred during the submission process, which needs to be further investigated according to the log information. The most common reason is that the dependent Hadoop jar file was not added to the classpath, and the dependent class (for example: classnotfoundexception: org. Apache. Hadoop. Yarn. Exceptions. Yarnexception) could not be found, resulting in the failure of loading the client entry class (flinkyarnsessioncli).
**▼ how is Flink on yarn application related to the specified yarn cluster when it is submitted?
Flink on yarn clients usually need to configure Hadoop? Conf? Dir and Hadoop? Classpath two environment variables to enable clients to load Hadoop configuration and rely on jar files. Example (the existing environment variable Hadoop? Home specifies the Hadoop deployment directory):
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=${HADOOP_HOME}/bin/hadoop classpath
Where is the client log and how to configure it?
The client log is usually located in the log folder of the Flink deployment Directory: ${flink_home} / log / Flink – ${user} – client -. Log. Use log4j to configure: ${flink_home} / conf / log4j-cli.properties.
When some client environments are complex and it is difficult to locate the log location and configuration, you can configure the following environment variables to open the debug log of log4j and track the initialization and detailed loading process of log4j: export jvm_args = “- dlog4j. Debug = true”
Client troubleshooting ideas
When the client log cannot be located normally, you can modify the log4j configuration file to change the log level from info to debug and run it again to see if there is a debug log to help troubleshoot the problem. For some problems without log or incomplete log information, code level debugging may be required, and the way to modify the source code and repackage and replace it is too cumbersome. It is recommended to use byteman, a Java byteman injection tool (for detailed syntax, please refer to byteman document). For example:
(1) write debugging scripts, such as printing the client class actually used by Flink. The following scripts indicate printing the return value when the clifrontend ා getactivecustomcommandline function exits;
RULE test
CLASS org.apache.flink.client.cli.CliFrontend
METHOD getActiveCustomCommandLine
DO traceln(“——->CliFrontend#getActiveCustomCommandLine return: “+$!);
(2) set environment variables and use byteman javaagent:
export BYTEMAN_HOME=/path/to/byte-home
export TRACE_SCRIPT=/path/to/script
export JVM_ARGS=”-javaagent:${BYTEMAN_HOME}/lib/byteman.jar=script:${TRACE_SCRIPT}”
(3) run the test command bin / blink run – M yarn Cluster – p 1. / examples / streaming / wordcount.jar, and the console will output the following:
——->CliFrontend#getActiveCustomCommandLine return: [email protected]
Common problems and troubleshooting ideas of Flink cluster
User application and framework jar package version conflict
This problem usually throws exceptions such as nosuchmethoderror / classnotfoundexception / incompatibleclasschangeerror. To solve this problem:
1. First, you need to locate the dependency library * according to the exception class, then you can execute MVN dependency: tree in the project to display all the dependency chains in a tree structure, and then locate the conflicting dependency library from it. You can also add the parameter – includes to specify the package to be displayed in the format of [groupid]: [artifactid]: [type]: [version], which supports matching, and multiple packages are separated by commas, for example: MVN dependency: tre. E – dincludes = power, javaassist;
2. After locating the conflict package, you need to consider how to arrange the package. A simple solution is to use exclusion to eliminate the dependency passed from other dependent projects. However, some application scenarios need to coexist with multiple versions, and different components depend on different versions. You need to consider using the Maven shade plug-in to solve the problem. For details, please refer to Maven shade plugin.
How to determine the specific source of a class when multiple versions of jar packages coexist in the dependency library?
Many applications run classpath with multiple versions of jar packages of the same dependency library, resulting in the actual version related to the loading order. When troubleshooting problems, you often need to determine the source jar of a class. Flink supports configuring JVM parameters for the JM / TM process. Therefore, you can print the loading class and its source (output in the. Out log) through the following three configuration items, according to the specific. You need to select one of them:
Env. Java. Opts = – verbose: class / / configure jobmanager & taskmanager
Env. Java. Opts. Jobmanager = – verbose: class / / configure jobmanager
Env. Java. Opts. Taskmanager = – verbose: class / / configure taskmanager
How to view the complete log of Flink application?
JM / TM logs of Flink application running can be viewed on webui. However, when querying problems, it is usually necessary to analyze and troubleshoot with a complete log. Therefore, it is necessary to understand the log saving mechanism of yarn. The storage location of container logs on yarn is related to the application status:
1. If the application is not finished, the container log will always be kept on the node where it runs. Even if the container has finished running, it can still be found in the configuration directory of the node: ${yarn. Nodemanager. Log dirs} / / or directly accessed from webui: http: / / / node / containerlogs//
2. If the application has ended and the cluster has enabled log collection (yarn. Log aggregation enable = true), NM will upload all its logs to the distributed storage (usually HDFS) and delete local files after the application ends (there are also configurations that can be uploaded incrementally). We can view all the logs of the application through yarn logs – applicationid – appowner command, and also can view all the logs of the application. Add the parameter – containerid – nodeaddress to view the log of a container. You can also directly access the distributed storage directory: ${yarn. Nodemanager. Remote app log dir} / ${user} / ${yarn. Nodemanager. Remote app log dir suffix}./
▼ troubleshooting ideas for Flink application resource allocation
If the Flink application fails to start normally and reaches the running state, you can conduct troubleshooting as follows:
1. It is necessary to check the current status of the application first. According to the above description of the startup process, we know that:
Application information persistence is in progress when it is in the state of new_saving. If it is in this state, we need to check whether the RM state storage service (usually zookeeper cluster) is normal.
If it is in the supplemented state, it may be that some time-consuming operations of hold read / write lock in RM cause event accumulation, which needs to be further located according to the yarn cluster log;
If it is in accepted state, check whether am is normal first, jump to step 2;
If it is already running, but the resources are not all available, the job cannot run normally, skip to step 3;
2. Check whether am is normal. You can view diagnostics information from yarn application display interface (http: / / / cluster / APP /) or yarn application rest API (http: / / / WS / V1 / cluster / apps /), and clarify the cause and solution of the problem according to the keyword information:

  • Queue’s am resource limit exceeded. The reason is that the maximum available resources of queue am have been reached, that is, the sum of the used resources of queue am and the newly applied resources of AM exceeds the maximum am resources of queue. The configuration item of available resources percentage of queue am can be adjusted appropriately: yarn.scheduler.capacity.. maximum am resource percentage.
  • User’s am resource limit exceeded. The reason is that the maximum am available resources of the application’s users in the queue have been reached, that is, the sum of the am used resources of the application’s users in the queue and the new am application resources exceeds the maximum am resources of the application’s users in the queue, which can be properly improved to solve the problem. Related configuration item: yarn.sche Scheduler.capacity.. user limit factor and yarn.scheduler.capacity.. minimum user limit percent.
  • Am container is launched, waiting for am container to register with RM. The main reason is that am has been started, but the internal initialization has not been completed. There may be problems such as ZK connection timeout. The specific reason needs to check the AM log and solve according to the specific problems.
  • Application is activated, waiting for resources to be assigned for am.

3. Confirm that there is a resource request that yarn fails to meet: click the problematic application ID from the application list page to enter the application page, and then click the application instance ID from the list below to enter the application instance page to see whether there is pending resource in the total outstanding resource requests list. If not, it means that yarn has been assigned, exit the inspection process, and transfer to am for inspection. ; if yes, the scheduler fails to complete the allocation, skip to step 4;
4. Scheduler allocation troubleshooting. Yarn-9050 supports automatic diagnosis of application problems on webui or through rest API. It will be released in Hadoop 3.3.0. Previous versions still need manual Troubleshooting:
Check the cluster or queue resources. In the tree view of the scheduler page, expand the leaf queue to view the resource information: effective Max resources, used resources: (1) check whether the cluster resources or the queue resources or their parent queue resources have been used up; (2) check whether a dimension resource of the leaf queue is close to or reaches the upper limit;
Check whether there are resource fragments: (1) check the proportion of the sum of used resources and reserved resources in the total resources of the cluster. When the cluster resources are nearly full (for example, more than 90%), there may be resource fragments, and the application allocation speed will be affected more slowly, because most of the machines have no resources, and the insufficient available resources of the machines will be reserved. After reaching a certain scale, most of the machine resources may be locked, and the subsequent allocation may slow down; (2) check the distribution of available resources of nm. Even if the utilization rate of cluster resources is not high, it may be caused by different resource distribution of each dimension. For example, the memory resources of 1 / 2 node are close to full CPU resources, and the CPU resources of 1 / 2 node are close to full CPU resources. There are a lot of memory resources left, and the resource value of a certain dimension in the application resource is configured too large, which may cause the resource cannot be applied.
Check whether there are high priority applications frequently apply and release resources immediately. This situation will cause the scheduler to be busy meeting the resource requests of this application without considering other applications.
Check whether the container fails to start or exits automatically as soon as it starts. You can check the container log (including localize log, launch log, etc.), yarn nm log or yarn RM log for troubleshooting.
▼ taskmanager startup exception:
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is … found …
This exception is thrown when Flink am applies to yarn nm to start the container whose token has timed out. Usually, the reason is that Flink am does not start the container until it has been received from yarn RM for a long time (the container has been released for more than 10 minutes by default). The next reason is that Flink internal strings after receiving the container resource returned by yarn RM. Line start.
When the number of containers to be started is large and the performance of distributed file storage, such as HDFS, is slow (upload task manager configuration before startup), the container startup request is easy to pile up in the interior. Flink-13184 optimizes this problem. First, the effectiveness check is added before startup to avoid meaningless configuration upload process. Second, asynchronous multi-threaded optimization is carried out. Speed up startup.
▼ failover exception 1:
java.util.concurrent.TimeoutException: Slot allocation request timed out for …
The reason for the exception is that the requested taskmanager resources cannot be allocated normally. You can use step 4 of the Flink application resource allocation troubleshooting idea to troubleshoot the problem.
▼ failure 2:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id timed out.
The direct cause of the exception is the timeout of taskmanager’s heartbeat. Further reasons may include:
The process has exited, which may cause errors or be affected by the preemption mechanism on yarn RM or nm. It is necessary to further trace the taskmanager log or yarn RM / nm log;
The process is still running, the cluster network problem causes the loss of connection, the connection timeout will exit by itself, and the jobmanager will recover by itself after the exception (reapply resources and start a new taskmanager);
The GC time of the process is too long, which may be caused by memory leak or unreasonable memory resource configuration. It is necessary to further locate the specific reason according to the log or analysis of memory.
▼ failure 3:
java.lang.Exception: Container released on a lost node
The reason for the exception is that the node where the container runs is marked as lost in the yarn cluster. All the containers on this node will be actively released by yarn RM and am will be notified. After receiving this exception, jobmanager will recover by itself (reapply resources and start a new taskmanager). The legacy taskmanager process can exit by itself after timeout.
▼ Flink cluster troubleshooting ideas
First, analyze and locate the problem according to the log of jobmanager / taskmanager. For the complete log, please refer to “how to view the complete log of Flink application”. If you want to get the debug information, you need to modify the log4j configuration of jobmanager / taskmanager (${flink_home} / conf / log4j. Properties) and resubmit it for operation. For the running process, Java bybycode injection tool is recommended. Teman can have a look at the internal status of the process. For details, please refer to: how do I install the agent into a running program?
Reference material
There are jumps in the green font part of the article. Please refer to the following link for details:
Byteman Documents
Maven Shade Plugin
How Do I Install The Agent Into A Running Program?
The first two articles and the next two articles of Flink on yarn sort out the whole process of Flink on yarn application startup, and provide troubleshooting ideas for common problems of client and Flink cluster for your reference, hoping to be helpful in application practice.
Author: Yang Chen (Boyuan)
Original link: https://yq.aliyun.com/article…
This is the original content of yunqi community, which can not be reproduced without permission.