Kafka source learning: Kafka APIs leader_ AND_ ISR

Time:2021-6-14

Link to the original text:https://fxbing.github.io/2021…
This source code is based on Kafka 0.10.2

Whenever the controller changes state, it will call thesendRequestsToBrokersMethod sendleaderAndIsrRequestThis paper mainly introduces the logic and process of Kafka server processing the request.

LEADER_AND_ISR

Overall logical process

case ApiKeys.LEADER_AND_ISR => handleLeaderAndIsrRequest(request)

Receive leader on server side_ AND_ After ISR request, thehandleLeaderAndIsrRequestThe processing flow of this method is shown in the figure

Kafka source learning: Kafka APIs leader_ AND_ ISR

Source code

handleLeaderAndIsrRequest

handleLeaderAndIsrRequestThe logic result of the function is mainly divided into the following parts

  1. Constructing callback functiononLeadershipChangeTo call back the coordinator to handle the new leader or follower node
  2. Verify the request permission. If the verification is successful, callreplicaManager.becomeLeaderOrFollower(correlationId, leaderAndIsrRequest, metadataCache, onLeadershipChange)Carry out the subsequent processing [the main procedure of the function here], otherwise, the error code will be returned directlyErrors.CLUSTER_AUTHORIZATION_FAILED.code
def handleLeaderAndIsrRequest(request: RequestChannel.Request) {
    // ensureTopicExists is only for client facing requests
    // We can't have the ensureTopicExists check here since the controller sends it as an advisory to all brokers so they
    // stop serving data to clients for the topic being deleted
    val correlationId = request.header.correlationId
    val leaderAndIsrRequest = request.body.asInstanceOf[LeaderAndIsrRequest]

    try {
      def onLeadershipChange(updatedLeaders: Iterable[Partition], updatedFollowers: Iterable[Partition]) {
        // for each new leader or follower, call coordinator to handle consumer group migration.
        // this callback is invoked under the replica state change lock to ensure proper order of
        // leadership changes
        updatedLeaders.foreach { partition =>
          if (partition.topic == Topic.GroupMetadataTopicName)
            coordinator.handleGroupImmigration(partition.partitionId)
        }
        updatedFollowers.foreach { partition =>
          if (partition.topic == Topic.GroupMetadataTopicName)
            coordinator.handleGroupEmigration(partition.partitionId)
        }
      }

      val leaderAndIsrResponse =
        if (authorize(request.session, ClusterAction, Resource.ClusterResource)) {
          val result = replicaManager.becomeLeaderOrFollower(correlationId, leaderAndIsrRequest, metadataCache, onLeadershipChange)
          new LeaderAndIsrResponse(result.errorCode, result.responseMap.mapValues(new JShort(_)).asJava)
        } else {
          val result = leaderAndIsrRequest.partitionStates.asScala.keys.map((_, new JShort(Errors.CLUSTER_AUTHORIZATION_FAILED.code))).toMap
          new LeaderAndIsrResponse(Errors.CLUSTER_AUTHORIZATION_FAILED.code, result.asJava)
        }

      requestChannel.sendResponse(new Response(request, leaderAndIsrResponse))
    } catch {
      case e: KafkaStorageException =>
        fatal("Disk error during leadership change.", e)
        Runtime.getRuntime.halt(1)
    }
  }

becomeLeaderOrFollower

ReplicaManagerThe main work of this paper is as follows

  1. Verify whether the controller epoch is compliant, and only process the request of TP which is larger than its own epoch and has a local replica
  2. callmakeLeadersandmakeFollowersMethod to construct the new leader partition and follower partition
  3. If the request is received for the first time, start the thread to update HW regularly
  4. Stop the empty fetcher thread
  5. Call the callback function, and the coordinator handles the new leader partition and follower partition
def becomeLeaderOrFollower(correlationId: Int,leaderAndISRRequest: LeaderAndIsrRequest,
                           metadataCache: MetadataCache,
                           onLeadershipChange: (Iterable[Partition], Iterable[Partition]) => Unit): BecomeLeaderOrFollowerResult = {
    leaderAndISRRequest.partitionStates.asScala.foreach { case (topicPartition, stateInfo) =>
        stateChangeLogger.trace("Broker %d received LeaderAndIsr request %s correlation id %d from controller %d epoch %d for partition [%s,%d]"
                                .format(localBrokerId, stateInfo, correlationId,
                                        leaderAndISRRequest.controllerId, leaderAndISRRequest.controllerEpoch, topicPartition.topic, topicPartition.partition))
    }
    //Main code, construct the return result
    replicaStateChangeLock synchronized {
        val responseMap = new mutable.HashMap[TopicPartition, Short]
        //If the controller epoch is not correct, it will directly return errors.stale_ CONTROLLER_ Epoch.code error code
        if (leaderAndISRRequest.controllerEpoch < controllerEpoch) {
            stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d with correlation id %d since " +
                                    "its controller epoch %d is old. Latest known controller epoch is %d").format(localBrokerId, leaderAndISRRequest.controllerId,
                                                                                                                  correlationId, leaderAndISRRequest.controllerEpoch, controllerEpoch))
            BecomeLeaderOrFollowerResult(responseMap, Errors.STALE_CONTROLLER_EPOCH.code)
        } else {
            val controllerId = leaderAndISRRequest.controllerId
            controllerEpoch = leaderAndISRRequest.controllerEpoch

            // First check partition's leader epoch
            //Verify all partition information, which can be divided into the following three cases:
            //1. Local does not contain the partition, return errors.unknown_ TOPIC_ OR_ PARTITION.code
            //2. The local contains the partition, the controller epoch is larger than the local epoch, and the information is correct
            //3. The controller epoch is smaller than the local epoch and returns errors.stale_ CONTROLLER_ EPOCH.code
            val partitionState = new mutable.HashMap[Partition, PartitionState]()
            leaderAndISRRequest.partitionStates.asScala.foreach { case (topicPartition, stateInfo) =>
                val partition = getOrCreatePartition(topicPartition)
                val partitionLeaderEpoch = partition.getLeaderEpoch
                // If the leader epoch is valid record the epoch of the controller that made the leadership decision.
                // This is useful while updating the isr to maintain the decision maker controller's epoch in the zookeeper path
                if (partitionLeaderEpoch < stateInfo.leaderEpoch) {
                    if(stateInfo.replicas.contains(localBrokerId))
                    partitionState.put(partition, stateInfo)
                    else {
                        stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d with correlation id %d " +
                                                "epoch %d for partition [%s,%d] as itself is not in assigned replica list %s")
                                               .format(localBrokerId, controllerId, correlationId, leaderAndISRRequest.controllerEpoch,
                                                       topicPartition.topic, topicPartition.partition, stateInfo.replicas.asScala.mkString(",")))
                        responseMap.put(topicPartition, Errors.UNKNOWN_TOPIC_OR_PARTITION.code)
                    }
                } else {
                    // Otherwise record the error code in response
                    stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d with correlation id %d " +
                                            "epoch %d for partition [%s,%d] since its associated leader epoch %d is not higher than the current leader epoch %d")
                                           .format(localBrokerId, controllerId, correlationId, leaderAndISRRequest.controllerEpoch,
                                                   topicPartition.topic, topicPartition.partition, stateInfo.leaderEpoch, partitionLeaderEpoch))
                    responseMap.put(topicPartition, Errors.STALE_CONTROLLER_EPOCH.code)
                }
            }
            //Process the leader & follower copy, and construct partitions become leader and partitions become follower for callback processing (coordinator processing)
            val partitionsTobeLeader = partitionState.filter { case (_, stateInfo) =>
                stateInfo.leader == localBrokerId
            }
            val partitionsToBeFollower = partitionState -- partitionsTobeLeader.keys

            val partitionsBecomeLeader = if (partitionsTobeLeader.nonEmpty)
            //Main call
            makeLeaders(controllerId, controllerEpoch, partitionsTobeLeader, correlationId, responseMap)
            else
            Set.empty[Partition]
            val partitionsBecomeFollower = if (partitionsToBeFollower.nonEmpty)
            //Main call
            makeFollowers(controllerId, controllerEpoch, partitionsToBeFollower, correlationId, responseMap, metadataCache)
            else
            Set.empty[Partition]

            // we initialize highwatermark thread after the first leaderisrrequest. This ensures that all the partitions
            // have been completely populated before starting the checkpointing there by avoiding weird race conditions
            //After receiving the request for the first time, the scheduler will be started and HW checkpoint will be updated regularly
            if (!hwThreadInitialized) {
                startHighWaterMarksCheckPointThread()
                hwThreadInitialized = true
            }
            //Because the meta information is updated above, check here to stop unnecessary fetcher threads
            replicaFetcherManager.shutdownIdleFetcherThreads()
            //Callback
            onLeadershipChange(partitionsBecomeLeader, partitionsBecomeFollower)
            BecomeLeaderOrFollowerResult(responseMap, Errors.NONE.code)
        }
    }
}

makeLeaders

Processing new leader partition

  1. Stop the follower thread of these partitions
  2. Update the metadata cache of these partitions
  3. Construct new leader set
private def makeLeaders(controllerId: Int,
                          epoch: Int,
                          partitionState: Map[Partition, PartitionState],
                          correlationId: Int,
                          responseMap: mutable.Map[TopicPartition, Short]): Set[Partition] = {
    //Return results needed to construct becomeleaderorfollower
    for (partition <- partitionState.keys)
      responseMap.put(partition.topicPartition, Errors.NONE.code)

    val partitionsToMakeLeaders: mutable.Set[Partition] = mutable.Set()

    try {
      // First stop fetchers for all the partitions
      //Stop the fetcher thread 
        replicaFetcherManager.removeFetcherForPartitions(partitionState.keySet.map(_.topicPartition))
      // Update the partition information to be the leader
      //Construct a new leader partition set
      partitionState.foreach{ case (partition, partitionStateInfo) =>
        if (partition.makeLeader(controllerId, partitionStateInfo, correlationId))
          partitionsToMakeLeaders += partition
        else
          stateChangeLogger.info(("Broker %d skipped the become-leader state change after marking its partition as leader with correlation id %d from " +
            "controller %d epoch %d for partition %s since it is already the leader for the partition.")
            .format(localBrokerId, correlationId, controllerId, epoch, partition.topicPartition))
      }
      }
    } catch {
      case e: Throwable =>
        partitionState.keys.foreach { partition =>
          val errorMsg = ("Error on broker %d while processing LeaderAndIsr request correlationId %d received from controller %d" +
            " epoch %d for partition %s").format(localBrokerId, correlationId, controllerId, epoch, partition.topicPartition)
          stateChangeLogger.error(errorMsg, e)
        }
        // Re-throw the exception for it to be caught in KafkaApis
        throw e
    }

    partitionsToMakeLeaders
  }

partition.makeLeader(controllerId, partitionStateInfo, correlationId)Will process the meta information and update HW, this method will callmaybeIncrementLeaderHWFunction, which attempts to catch up with HW:If the other replicas are not far behind the leader and larger than the previous HW, the growth rate of HW will be slowed down and other replicas will be included in the team as much as possible.

def makeLeader(controllerId: Int, partitionStateInfo: PartitionState, correlationId: Int): Boolean = {
    val (leaderHWIncremented, isNewLeader) = inWriteLock(leaderIsrUpdateLock) {
      val allReplicas = partitionStateInfo.replicas.asScala.map(_.toInt)
      // record the epoch of the controller that made the leadership decision. This is useful while updating the isr
      // to maintain the decision maker controller's epoch in the zookeeper path
      controllerEpoch = partitionStateInfo.controllerEpoch
      // add replicas that are new
      //Constructing new ISR
      allReplicas.foreach(replica => getOrCreateReplica(replica))
      val newInSyncReplicas = partitionStateInfo.isr.asScala.map(r => getOrCreateReplica(r)).toSet
      // remove assigned replicas that have been removed by the controller
      //Remove all copies that are not in the new ISR
      (assignedReplicas.map(_.brokerId) -- allReplicas).foreach(removeReplica)
      inSyncReplicas = newInSyncReplicas
      leaderEpoch = partitionStateInfo.leaderEpoch
      zkVersion = partitionStateInfo.zkVersion
      //Do you want to be the leader of the partition for the first time
      val isNewLeader =
        if (leaderReplicaIdOpt.isDefined && leaderReplicaIdOpt.get == localBrokerId) {
          false
        } else {
          leaderReplicaIdOpt = Some(localBrokerId)
          true
        }
      val leaderReplica = getReplica().get
      val curLeaderLogEndOffset = leaderReplica.logEndOffset.messageOffset
      val curTimeMs = time.milliseconds
      // initialize lastCaughtUpTime of replicas as well as their lastFetchTimeMs and lastFetchLeaderLogEndOffset.
      //New leader initialization
      (assignedReplicas - leaderReplica).foreach { replica =>
        val lastCaughtUpTimeMs = if (inSyncReplicas.contains(replica)) curTimeMs else 0L
        replica.resetLastCaughtUpTime(curLeaderLogEndOffset, curTimeMs, lastCaughtUpTimeMs)
      }
      // we may need to increment high watermark since ISR could be down to 1
      if (isNewLeader) {
        // construct the high watermark metadata for the new leader replica
        leaderReplica.convertHWToLocalOffsetMetadata()
        // reset log end offset for remote replicas
        assignedReplicas.filter(_.brokerId != localBrokerId).foreach(_.updateLogReadResult(LogReadResult.UnknownLogReadResult))
      }
      //Try to catch up with HW. If the other copies are not far behind the leader and larger than the previous HW, it will slow down the growth of HW and let the other copies into the team as much as possible
      (maybeIncrementLeaderHW(leaderReplica), isNewLeader)
    }
    // some delayed operations may be unblocked after HW changed
    //HW will process some requests after updating
    if (leaderHWIncremented)
      tryCompleteDelayedRequests()
    isNewLeader
  }

makeFollowers

Process new follower partition

  1. Remove these partitions from the leadpartitions collection
  2. Mark as follower to block producer requests
  3. Remove the picker thread
  4. According to HW truncate the local logs of these partitions
  5. Clean up producer and fetch requests
  6. If there is no downtime, start with the new leader fetch data
private def makeFollowers(controllerId: Int,
                          epoch: Int,
                          partitionState: Map[Partition, PartitionState],
                          correlationId: Int,
                          responseMap: mutable.Map[TopicPartition, Short],
                          metadataCache: MetadataCache) : Set[Partition] = {
    partitionState.keys.foreach { partition =>
        stateChangeLogger.trace(("Broker %d handling LeaderAndIsr request correlationId %d from controller %d epoch %d " +
                                 "starting the become-follower transition for partition %s")
                                .format(localBrokerId, correlationId, controllerId, epoch, partition.topicPartition))
    }

    //Return results needed to construct becomeleaderorfollower
    for (partition <- partitionState.keys)
    responseMap.put(partition.topicPartition, Errors.NONE.code)

    val partitionsToMakeFollower: mutable.Set[Partition] = mutable.Set()

    try {

        // TODO: Delete leaders from LeaderAndIsrRequest
        partitionState.foreach{ case (partition, partitionStateInfo) =>
            val newLeaderBrokerId = partitionStateInfo.leader
            metadataCache.getAliveBrokers.find(_.id == newLeaderBrokerId) match {
                // Only change partition state when the leader is available
                case Some(_) =>
                //Construct return result
                if (partition.makeFollower(controllerId, partitionStateInfo, correlationId))
                partitionsToMakeFollower += partition
                else
                stateChangeLogger.info(("Broker %d skipped the become-follower state change after marking its partition as follower with correlation id %d from " +
                                        "controller %d epoch %d for partition %s since the new leader %d is the same as the old leader")
                                       .format(localBrokerId, correlationId, controllerId, partitionStateInfo.controllerEpoch,
                                               partition.topicPartition, newLeaderBrokerId))
                case None =>
                // The leader broker should always be present in the metadata cache.
                // If not, we should record the error message and abort the transition process for this partition
                stateChangeLogger.error(("Broker %d received LeaderAndIsrRequest with correlation id %d from controller" +
                                         " %d epoch %d for partition %s but cannot become follower since the new leader %d is unavailable.")
                                        .format(localBrokerId, correlationId, controllerId, partitionStateInfo.controllerEpoch,
                                                partition.topicPartition, newLeaderBrokerId))
                // Create the local replica even if the leader is unavailable. This is required to ensure that we include
                // the partition's high watermark in the checkpoint file (see KAFKA-1647)
                partition.getOrCreateReplica()
            }
        }
//Remove the picker thread
        replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(_.topicPartition))
        //Truncate according to the new HW
        logManager.truncateTo(partitionsToMakeFollower.map { partition =>
            (partition.topicPartition, partition.getOrCreateReplica().highWatermark.messageOffset)
        }.toMap)
        //HW update, trying to process the request
        partitionsToMakeFollower.foreach { partition =>
            val topicPartitionOperationKey = new TopicPartitionOperationKey(partition.topicPartition)
            tryCompleteDelayedProduce(topicPartitionOperationKey)
            tryCompleteDelayedFetch(topicPartitionOperationKey)
        }

        if (isShuttingDown.get()) {
            partitionsToMakeFollower.foreach { partition =>
                stateChangeLogger.trace(("Broker %d skipped the adding-fetcher step of the become-follower state change with correlation id %d from " +
                                         "controller %d epoch %d for partition %s since it is shutting down").format(localBrokerId, correlationId,
                                                                                                                     controllerId, epoch, partition.topicPartition))
            }
        }
        else {
            // we do not need to check if the leader exists again since this has been done at the beginning of this process
            //Reset the fetch position and add the picker
            val partitionsToMakeFollowerWithLeaderAndOffset = partitionsToMakeFollower.map(partition =>
                                                                                           partition.topicPartition -> BrokerAndInitialOffset(
                                                                                               metadataCache.getAliveBrokers.find(_.id == partition.leaderReplicaIdOpt.get).get.getBrokerEndPoint(config.interBrokerListenerName),
                                                                                               partition.getReplica().get.logEndOffset.messageOffset)).toMap
            replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset)
        }
    } catch {
        case e: Throwable =>
        val errorMsg = ("Error on broker %d while processing LeaderAndIsr request with correlationId %d received from controller %d " +
                        "epoch %d").format(localBrokerId, correlationId, controllerId, epoch)
        stateChangeLogger.error(errorMsg, e)
        // Re-throw the exception for it to be caught in KafkaApis
        throw e
    }

    partitionsToMakeFollower
}