Core dump caused by Kafka log cleanup

Time:2021-9-15

Original link:http://fxbing.github.io/2021/…

The team developed the function of Kafka on HDFS to store Kafka data on HDFS. However, it was found that core dump occurred on some machines during use.

This article is based on Kafka version 0.10.2

Troubleshooting process

1、 Troubleshooting core dump files

Because there are many core dump problems, you need to analyze them from the core dump file first. From the core dump file, we can see the following problems:

  1. The names of the hung threads are all HDFS related, so it is speculated that they are related to HDFS related functions
  2. The location of the hanging code is related to the index reading logic, so it is speculated that it is related to the index cleaning logic

Core dump caused by Kafka log cleanup

Core dump caused by Kafka log cleanup

2、 Analysis code

Analyzing the code of Kafka local log deletion, it is found that the local log deletion is passedasyncDeleteSegmentconduct,asyncDeleteSegmentWhen deleting, first rename the local log file and index file, and then delay the deletion for a certain time.

/** a file that is scheduled to be deleted */
  val DeletedFileSuffix = ".deleted"


/**
   * Perform an asynchronous delete on the given file if it exists (otherwise do nothing)
   *
   * @throws KafkaStorageException if the file can't be renamed and still exists
   */
  private def asyncDeleteSegment(segment: LogSegment) {
    segment.changeFileSuffixes("", Log.DeletedFileSuffix)
    def deleteSeg() {
      info("Deleting segment %d from log %s.".format(segment.baseOffset, name))
      segment.delete()
    }
    scheduler.schedule("delete-file", deleteSeg, delay = config.fileDeleteDelayMs)
  }


/**
   * Change the suffix for the index and log file for this log segment
   */
  def changeFileSuffixes(oldSuffix: String, newSuffix: String) {

    def kafkaStorageException(fileType: String, e: IOException) =
      new KafkaStorageException(s"Failed to change the $fileType file suffix from $oldSuffix to $newSuffix for log segment $baseOffset", e)

    try log.renameTo(new File(CoreUtils.replaceSuffix(log.file.getPath, oldSuffix, newSuffix)))
    catch {
      case e: IOException => throw kafkaStorageException("log", e)
    }
    try index.renameTo(new File(CoreUtils.replaceSuffix(index.file.getPath, oldSuffix, newSuffix)))
    catch {
      case e: IOException => throw kafkaStorageException("index", e)
    }
    try timeIndex.renameTo(new File(CoreUtils.replaceSuffix(timeIndex.file.getPath, oldSuffix, newSuffix)))
    catch {
      case e: IOException => throw kafkaStorageException("timeindex", e)
    }
  }

However, in HDFS related cleanup functions, log cleanup is performed directly without rename and delay operations. Therefore, it is speculated that if the files are still read during log and index cleanup, forced deletion will lead to core dump.

3、 Test conclusion

When writing a single test and forcibly deleting the index file during the local index lookup, the core dump phenomenon does occur.

Solution

Following the local log cleaning strategy, in the HDFS related logic, the file is not deleted directly, but renamed first, and then deleted after a certain delay.

summary

  1. This problem is an accidental phenomenon. It can only occur when Kafka consumer consumes lag and reads the log to be deleted.
  2. In the process of core dump problem analysis, it is necessary to analyze the problem through the similarities of multiple cases.