The problem of spark executor being killed by yarn

Time:2020-10-24

For spark tasks, the executor always hangs up during runtime. At first, I felt that the amount of data was too large, and the executor memory was not enough. However, after estimating the amount of data, we don’t think there should be insufficient memory. Therefore, we first try to observe the memory distribution of executor through jvisualvm
The problem of spark executor being killed by yarn

If the older generation is not filled, the process will hang up, so it is not a JVM level oom.
Check the log of the corresponding nodemanager carefully and find the following logs:

WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=1151,containerID=container_1578970174552_5615_01_
000003] is running beyond physical memory limits. Current usage: 4.3 GB of 4 GB physical memory used; 7.8 GB of 8.4 GB virtual memory used. Killing container.
Dump of the process-tree for container_1578970174552_5615_01_000003 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 1585 1175 1585 1151 (python) 50 67 567230464 8448 python -m pyspark.daemon
        |- 1596 1585 1585 1151 (python) 1006 81 1920327680 303705 python -m pyspark.daemon
        |- 1175 1151 1151 1151 (java) ...
        |- 1151 1146 1151 1151 (bash) ...
INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 1152 for container-id container_1578
970174552_5615_01_000004: 4.3 GB of 4 GB physical memory used; 7.8 GB of 8.4 GB virtual memory used

According to the log, the process of a container occupies more than the threshold of physical memory, and yarn will kill it. And the memory statistics are based onProcess TreeOur spark task will start the python process and transfer the data to the python process through pyspark. In other words, the data exists in both the JVM and the python process. If the statistics are based on the process tree, it means that it will be repeated at least twice. It’s easy to exceed the “threshold.”.

In yarn, nodemanager will monitor the resource usage of the container, and set the upper limit of physical memory and virtual memory for the container. When it is exceeded, the container will be killed.

Maximum virtual memory = maximum physical memory x yarn.nodemanager.vmem -Pmem ratio (default is 2.1)

You can turn off the check by setting two switches, but pay attention to setting them on nodemanager:

#Physical memory check
<property>
  <name>yarn.nodemanager.pmem-check-enabled </name>
  <value>false</value>
  <description>Whether physical memory limits will be enforced for containers.</description>
</property>
#Virtual memory check
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
  <description>Whether virtual memory limits will be enforced for containers.</description>
</property>

Another case is related to spark’s own memory settings

Why is there a memory overrun in spark on yarn? The container is killed

Other references:

[Hadoop] running Mr task, container is running beyond physical memory limits error appears
How to set yarn.nodemanager.pmem-check-enabled?