Diagnosis and repair of Linux kernel problems encountered by TiDB Operator in K8s test

Time:2019-9-12

Author: Zhang Wenbo

Kubernetes (K8s) is an open source container orchestration system that automates application deployment, expansion and management. It is the operating system of the cloud native world. Any defect in K8s or operating system may put user processes at risk. As a PingCAP EE (efficiency engineering) team, we found two Linux kernel errors when testing TiDB Operator (a tool for creating and managing TiDB clusters) in K8s. These errors have been troubling us for a long time and have not been completely repaired in the entire K8s community.

After extensive investigation and diagnosis, we have identified ways to deal with these problems. In this article, we will share these solutions with you. However, although these methods are useful, we believe that they are only temporary measures. We believe that there will be more elegant solutions in the future. We also hope that the K8s community, RHEL and entOS can completely repair these problems in the near future.

Bug #1: Diagnosis and repair of unstable Kmem Accounting

Key words: SLUB: Unable to allocate memory on node-1

Community-related Issues:

  • https://github.com/kubernetes/kubernetes/issues/61937
  • https://github.com/opencontainers/runc/issues/1725
  • https://support.mesosphere.com/s/article/Critical-Issue-KMEM-MSPH-2018-0006

Origin of problem

Schrodinger platform is an automated testing framework based on K8s developed by our company. It provides various Chaos capabilities, as well as automated Bench testing, anomaly monitoring, alarm and automatic output test reports. We find that I/O performance jitter occasionally occurs when TiKV performs OLTP testing on Schrodinger platform, but no abnormalities are found in the following items:

  • Logs of TiKV and RocksDB
  • CPU utilization
  • Load information such as memory and disk

Only occasionally can you see that the results of the dmesg command execution contain some “SLUB: Unable to allocate memory on node-1” information.

problem analysis

We use funcslower trace in perf-tools to perform slower kernel functions and adjust kernel parametershung_task_timeout_secsThreshold, capturing some kernel path information when TiKV performs write operations:

From the information shown above, you can see that I/O jitter is related to the execution of writepage by the file system. At the same time, the performance jitter is captured before and after, when the node has sufficient memory resources.dmesgThe results returned will also present a large amount of “SLUB: Unable to allocate memory on node-1” information.

fromhung_taskThe output call stack information is found in combination with the kernel code, and the kernel is executingbvec_allocFunction assignmentbio_vecObjects are first attempted to passkmem_cache_allocDistribution,kmem_cache_allocAfter failure, fallback attempts are made to allocate from mempool, and within mempool, execution is attempted first.pool->allocCallbacks are allocated.pool->allocWhen the allocation fails, the kernel sets the process to an uninterruptible state and places it in a waiting queue to wait. When other processes return memory to MemPool or timer timeout (5s), the process scheduler wakes the process up for retry, which is consistent with the jitter delay of our business monitoring.

But when we created the Docker container, we did not set the kmem limit. Why is there a shortage of kmem? In order to determine whether kmem limit is set, we go into the CGroup memory controller to check the kmem information of the container, and find that the statistical information of kmem is opened, but the limit value is set very large.

We know that kmem accounting is unstable in RHEL 3.10 kernel, so we suspect that SLUB allocation failure is caused by kernel bugs. Searching for kernel patch information, we find that it is indeed a kernel bug, which has been repaired in high-level community kernels:

slub: make dead caches discard free slabs immediately

There is also a namespace leak problem associated with kmem accounting:

mm: memcontrol: fix cgroup creation failure after many small jobs

So who turned on the kmem accounting function? We use the opensnoop tool in BCC to monitor the kmem configuration file and capture the modifier runc. From the K8s code, it can be confirmed that the K8s-dependent runc project opens kmem accounting by default.

Solution

From the above analysis, we can either upgrade to a higher version of the kernel, or disable the kmem accounting function when starting the container. Currently, runc has provided conditional compilation options. We can disable kmem accounting through Build Tags. After closing, we found that jitter disappeared, namespace leakage problem and SLUB. The problem of distribution failure disappeared.

Operation steps

We need to turn kmem account off on both kubelet and docker.

  1. Kubelet needs to be recompiled, and different versions have different ways.

    If the kubelet version is v1.14 or above, kmem account can be closed by adding Build Tags when compiling kubelet:

    $ git clone --branch v1.14.1 --single-branch --depth 1 [https://github.com/kubernetes/kubernetes](https://github.com/kubernetes/kubernetes) 
    $ cd kubernetes
    
    $ KUBE_GIT_VERSION=v1.14.1 ./build/run.sh make kubelet GOFLAGS="-tags=nokmem"

    However, if the kubelet version is v1.13 or below, it can not be turned off by adding Build Tags when compiling kubelet. It is necessary to recompile kubelet. The steps are as follows.

    First download the Kubernetes code:

    $ git clone --branch v1.12.8 --single-branch --depth 1 https://github.com/kubernetes/kubernetes
    $ cd kubernetes

    Then manually replace the two functions that open the kmem account function with the following:

    func EnableKernelMemoryAccounting(path string) error {
        return nil
    }
    
    func setKernelMemory(path string, kernelMemoryLimit int64) error {
        return nil
    }
    

    Then recompile kubelet:

    $ KUBE_GIT_VERSION=v1.12.8 ./build/run.sh make kubelet

    The compiled kubelet is in the./_output/dockerized/bin/$GOOS/$GOARCH/kubeletMedium.

  2. At the same time, docker-ce needs to be upgraded to 18.09.1 or more. This version of docker has turned off the kmem account function of runc.
  3. Finally, the machine needs to be restarted.

The validation method is to see that all containers of the newly created pod have closed kmem, if the following results are closed:

$ cat /sys/fs/cgroup/memory/kubepods/burstable/pod<pod-uid>/<container-id>/memory.kmem.slabinfo
cat: memory.kmem.slabinfo: Input/output error

Bug # 2: Diagnosis and repair of network device reference count leakage

Key words: kernel: unregister_netdevice: wait for eth0 to become free. Usage count = 1

Community-related Issues:

  • https://github.com/kubernetes/kubernetes/issues/64743
  • https://github.com/projectcalico/calico/issues/1109
  • https://github.com/moby/moby/issues/5618

Origin of problem

After running our Schrodinger distributed test cluster for a period of time, the problem of “kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 1” often persists, and causes multiple processes to enter an uninterruptible state, which can only be solved by restarting the server.

problem analysis

By using the crash tool to analyze vmcore, we found that the kernel threads were blocked innetdev_wait_allrefsFunction, infinite loop waitdev->refcntDown to 0. Since the pod has been released, it is suspected that it is a reference count leak problem. We looked up K8s issue and found that the problem was on the kernel, but there was no simple, stable and reliable way to reproduce the problem, and it still happened on high-level community kernels.

To avoid the need to restart the server every time a problem occurs, we develop a kernel module, when foundnet_deviceWhen the reference count is leaked, remove the core module after the reference count is cleared to zero (to avoid deleting other non-reference count leaked network cards by mistake). To avoid each manual cleanup, we wrote a monitoring script that periodically automated this operation. However, there are still shortcomings in this scheme:

  • There is a delay between the leakage of reference counting and the detection of monitoring. In this delay, K8s system may have other problems.
  • In the kernel module, it is difficult to determine whether a reference count leak exists or not.netdev_wait_allrefsAll message subscribers will be continuously retried for publication through Notification ChainsNETDEV_UNREGISTERandNETDEV_UNREGISTER_FINALIt is not easy to find out the processing logic of each callback function registered by 22 subscribers to determine whether there is a way to avoid misjudgement.

Solution

While we are preparing to go deep into the callback function logic registered by each subscriber, we are continually following the progress of kernel patch and RHEL. We find that RHEL solutions: 3659011 has an update, referring to a patch submitted by upstream:

route: set the deleted fnhe fnhe_daddr to 0 in ip_del_fnhe to fix a race

After trying to patch the kernel in hotfix mode, we continued testing for a week and the problem did not recur. We fed back the test information to RHEL and learned that they had started backporting the patch.

Operation steps

The recommended kernel version is Centos 7.6 kernel-3.10.0-957 and above.

  1. Install kpatch and kpatch-build dependencies:

    UNAME=$(uname -r)
    sudo yum install gcc kernel-devel-${UNAME%.*} elfutils elfutils-devel
    sudo yum install pesign yum-utils zlib-devel \
      binutils-devel newt-devel python-devel perl-ExtUtils-Embed \
      audit-libs audit-libs-devel numactl-devel pciutils-devel bison
    
    # enable CentOS 7 debug repo
    sudo yum-config-manager --enable debug
    
    sudo yum-builddep kernel-${UNAME%.*}
    sudo debuginfo-install kernel-${UNAME%.*}
    
    # optional, but highly recommended - enable EPEL 7
    sudo yum install ccache
    ccache --max-size=5G
    
  2. Install kpatch and kpatch-build:

    git clone https://github.com/dynup/kpatch && cd kpatch
    make 
    sudo make install
    systemctl enable kpatch
  3. Download and build the hot patch kernel module:

    curl -SOL  https://raw.githubusercontent.com/pingcap/kdt/master/kpatchs/route.patch
    Kpatch-build-t vmlinux route.patch
    mkdir -p /var/lib/kpatch/${UNAME} 
    cp -a livepatch-route.ko /var/lib/kpatch/${UNAME}
    systemctl restart kpatch (Loads the kernel module)
    kpatch list (Checks the loaded module)

summary

Although we have fixed these kernel errors, there should be better solutions in the future. For Bug #1, we hope that the K8s community can provide a parameter for kubelet to allow users to disable or enable the kmem account function. For Bug # 2, the best solution is to fix the kernel errors by RHEL and entOS. We hope TiDB users will not have to worry about this problem after upgrading CentOS to a new version.

Recommended Today

Hadoop MapReduce Spark Configuration Item

Scope of application The configuration items covered in this article are mainly for Hadoop 2.x and Spark 2.x. MapReduce Official documents https://hadoop.apache.org/doc…Lower left corner: mapred-default.xml Examples of configuration items name value description mapreduce.job.reduce.slowstart.completedmaps 0.05 Resource requests for Reduce Task will not be made until the percentage of Map Task completed reaches that value. mapreduce.output.fileoutputformat.compress false […]