Kafka growth 5: preliminary serialization and partition routing source code principle of producer messages

Time:2021-11-21

Kafka growth 5: preliminary serialization and partition routing source code principle of producer messages

In the first four sections of Kafka’s growth story, we analyzed the configuration analysis, component composition and metadata pulling principle of producer through Kafka producer HelloWorld.

However, the code for Kafka producer HelloWorld to send messages has not been analyzed. We have analyzed it to the location shown in the figure below:

Kafka growth 5: preliminary serialization and partition routing source code principle of producer messages

Next, let’s continue to analyze. In this section, we mainly analyze the source code principle of sending message preliminary serialization and partition routing.

Customize how messages are initially serialized

When producer. Send () performs dosend (), what is the context after waitonmetadata successfully pulls metadata?

private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
    TopicPartition tp = null;
    try {
        // first make sure the metadata for the topic is available
        long waitedOnMetadataMs = waitOnMetadata(record.topic(), this.maxBlockTimeMs);
        long remainingWaitMs = Math.max(0, this.maxBlockTimeMs - waitedOnMetadataMs);
        byte[] serializedKey;
        try {
            serializedKey = keySerializer.serialize(record.topic(), record.key());
        } catch (ClassCastException cce) {
            throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
                    " to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
                    " specified in key.serializer");
        }
        byte[] serializedValue;
        try {
            serializedValue = valueSerializer.serialize(record.topic(), record.value());
        } catch (ClassCastException cce) {
            throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
                    " to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
                    " specified in value.serializer");
        }
        int partition = partition(record, serializedKey, serializedValue, metadata.fetch());
        int serializedSize = Records.LOG_OVERHEAD + Record.recordSize(serializedKey, serializedValue);
        ensureValidRecordSize(serializedSize);
        tp = new TopicPartition(record.topic(), partition);
        long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();
        log.trace("Sending record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
        // producer callback will make sure to call both 'callback' and interceptor callback
        Callback interceptCallback = this.interceptors == null ? callback : new InterceptorCallback<>(callback, this.interceptors, tp);
        RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);
        if (result.batchIsFull || result.newBatchCreated) {
            log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
            this.sender.wakeup();
        }
        return result.future;
        // handling exceptions and record the errors;
        // for API exceptions return them in the future,
        // for other exceptions throw directly
    } catch (ApiException e) {
        log.debug("Exception occurred during message send:", e);
        if (callback != null)
            callback.onCompletion(null, e);
        this.errors.record();
        if (this.interceptors != null)
            this.interceptors.onSendError(record, tp, e);
        return new FutureFailure(e);
    } catch (Exception e) {
        throw e;
    }
    //Omit various other exception catches
}

The main context is:

1) Waitonmetadata waiting for metadata pull

2) Keyserializer.serialize and valueserializer.serialize are obviously to serialize records into byte byte arrays

3) Route partition through partition, and select a partition under topic according to a certain routing policy

4) Accumulator.append puts the message into the buffer

5) Wake up the blocking of selector. Select() of the sender thread and start processing the data in the memory buffer.

The whole context is shown in the figure below:

Kafka growth 5: preliminary serialization and partition routing source code principle of producer messages

The second step is to use a custom serializer to convert the message into a byte [] array. Let’s look at the logic of this first.

The first question is, where did the custom message serializer come from? It is actually set in the configuration parameters. Remember the Kafka producer HelloWorld code?

  // KafkaProducerHelloWorld.java
  public static void main(String[] args) throws Exception {
        Properties props = new Properties();
        props.put("bootstrap.servers", "mengfanmao.org:9092");
        KafkaProducer<String, String> producer = new KafkaProducer<>(props);
        ProducerRecord<String, String> record = new ProducerRecord<>("test-topic", "test-key", "test-value");
        producer.send(record).get();
        Thread.sleep(5 * 1000);
        producer.close();
    }

In the previous kafkaproducerloworld. Java, we did not set the serialization parameter at first. As a result, sending a message failed, prompting the following stack:

Exception in thread "main" org.apache.kafka.common.config.ConfigException: Missing required configuration "key.serializer" which has no default value.
    at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:421)
    at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:55)
    at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:62)
    at org.apache.kafka.clients.producer.ProducerConfig.<init>(ProducerConfig.java:336)
    at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:188)
    at org.mfm.learn.kafka.KafkaProducerHelloWorld.main(KafkaProducerHelloWorld.java:20)

Are you familiar with the information on the stack above? Aren’t the classes mentioned above the source classes related to configuration resolution we studied before? Producerconfig, abstractconfig and configdef are really familiar.

When you open the source configdef, you will find that when configdef parses the configuration file, there is no serialization configuration, which will directly throw an exception in the step of new kafkaproducer(), and the message sending fails.

Here, can you experience one of the benefits of reading the source code?

Then you can configure the serialization parameters as follows:

  // KafkaProducerHelloWorld.java
  public static void main(String[] args) throws Exception {
        Properties props = new Properties();
        props.put("bootstrap.servers", "mengfanmao.org:9092");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        KafkaProducer<String, String> producer = new KafkaProducer<>(props);
        ProducerRecord<String, String> record = new ProducerRecord<>("test-topic", "test-key", "test-value");
        producer.send(record).get();
        Thread.sleep(5 * 1000);
        producer.close();
    }

Message sent successfully! The serializer we added is the stringserializer provided by default in the client jar package. Now that we have a message serializer, let’s see how it serializes key and value.

We simplified the core of the second step before, which is actually the following code:

//KafkaProudcer.java#doSend
ProducerRecord<String, String> record = new ProducerRecord<>("test-topic", "test-key", "test-value");
keySerializer.serialize(record.topic(), record.key());
valueSerializer.serialize(record.topic(), record.value());

//StringSerializer.java
public byte[] serialize(String topic, String data) {
    try {
        if (data == null)
            return null;
        else
            return data.getBytes(encoding);
    } catch (UnsupportedEncodingException e) {
        throw new SerializationException("Error when serializing string to byte[] due to unsupported encoding " + encoding);
    }
}

You can see that the serialization method of stringserializer is very simple, that is, calling the original GetBytes () method of string. (PS: the first parameter is not used…)

Is serialization really just over here? Definitely not. The data of this bytes [] array must eventually be sent through the network. This is just a preliminary serialization. The final serialization after the message, including the specific format, will be studied in depth when Kafka uses native Java NiO to solve the problem of sticking and unpacking.

At least, here we can get the following figure:

Kafka growth 5: preliminary serialization and partition routing source code principle of producer messages

The message is based on the source code principle of topic partition routing

When sending a message, pull the metadata and preliminarily serialize the message into a byte [] array. Then, route through metadata information and select a partition corresponding to a topic to send messages. The cluster metadata in metadata is used when routing the partition for sending messages. Here we will review its structure.

Review of metadata memory structure of cluster class

List<Node>:The Kafka broker node is mainly the IP and port of the broker.

Map nodesById,Key is the ID of the broker and value is the information node of the broker

Map partitionsByTopic:Which partitions are there for each topic? Key is the topic name and value is the partition information list

Map availablePartitionsByTopic, which partitions are currently available for each topic. Key is the topic name and value is the partition information list

Map partitionsByNode,Which partitions are placed on each broker. Key is the ID of the broker and value is the partition information list

unautorhizedTopics:The list of topics that are not authorized to access. If your client is not authorized to access a topic, the permission control of message queue is rarely used, which can be almost ignored.

You can the breakpoint and look at the data as follows:

Kafka growth 5: preliminary serialization and partition routing source code principle of producer messages

For cluster metadata, you can find that different data structures are used for storage according to different needs, uses and scenarios. Kafka producer designs different data structures. In fact, many times we can learn to write code with similar ideas.

After reviewing the metadata, the client can certainly route according to the metadata information. So how is it routed? The code is as follows:

// KakfaProducer.java
private final Partitioner partitioner;
//#doSend()
int partitionpartition = partition(record, serializedKey, serializedValue, metadata.fetch());
//#partition()
private int partition(ProducerRecord<K, V> record, byte[] serializedKey , byte[] serializedValue, Cluster cluster) {
    Integer partition = record.partition();
    if (partition != null) {
        List<PartitionInfo> partitions = cluster.partitionsForTopic(record.topic());
        int lastPartition = partitions.size() - 1;
        // they have given us a partition, use it
        if (partition < 0 || partition > lastPartition) {
            throw new IllegalArgumentException(String.format("Invalid partition given with record: %d is not in the range [0...%d].", partition, lastPartition));
        }
        return partition;
    }
    return this.partitioner.partition(record.topic(), record.key(), serializedKey, record.value(), serializedValue,
                                      cluster);
}

The context of this method is very simple. It is mainly determined according to whether record specifies partition:

1) If the sent message record specifies a partition, the partition after routing is the specified partition number after verifying with the metadata information cluster.
2) If the sent message record does not specify a partition, a partition component partition method is used to determine the partition number.

As shown below:

Kafka growth 5: preliminary serialization and partition routing source code principle of producer messages

In the previous section, we said that the timestamp and partition of producer record are optional. They are null by default. That is, by default, it will go to the partition branch of the partitioner component.

But here comes the problem. When was the partitioner initialized?

Since partitioner is a member variable of kafkaproducer, you can search it. You will find:

private KafkaProducer(ProducerConfig config, Serializer<K> keySerializer, Serializer<V> valueSerializer) {
    //Omit other code
    this.partitioner = config.getConfiguredInstance(ProducerConfig.PARTITIONER_CLASS_CONFIG, Partitioner.class);
    //Omit other code
}

It was originally initialized during the constructor. In fact, it is obtained through configuration resolution. And there is a default valueDefaultPartitioner

After knowing this, let’s see how the default message is sent?

//DefaultPartitioner.java
private final AtomicInteger counter = new AtomicInteger(new Random().nextInt());

public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
    List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
    int numPartitions = partitions.size();
    if (keyBytes == null) {
        int nextValue = counter.getAndIncrement();
        List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
        if (availablePartitions.size() > 0) {
            int part = DefaultPartitioner.toPositive(nextValue) % availablePartitions.size();
            return availablePartitions.get(part).partition();
        } else {
            // no partitions are available, give a non-available partition
            return DefaultPartitioner.toPositive(nextValue) % numPartitions;
        }
    } else {
        // hash the keyBytes to choose a partition
        return DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
    }
}

This method mainly includes:

1) Obtain all partitions and partition numbers corresponding to topic from the map in the metadata cluster

2) If no key is specified for sending a message, start from a random number, increment + 1 each time through atomicinteger, modulus the number of partitions or available partition size, and obtain the corresponding partition number

3) If the key is specified in the sending message, an algorithm murmur2 will be executed for the byte array corresponding to the key to obtain an int number, and then the number of partitions will be modulus to obtain the corresponding partition number

The whole process is shown in the figure below:

Kafka growth 5: preliminary serialization and partition routing source code principle of producer messages

Through the above routing strategy, you can find that the message sent by Kafka can be even if only topic is specified. You do not need to specify key and partition. However, this may cause message disorder.

As for how to ensure the order of Kafka sending messages, in addition to specifying partitions and keys, other configurations are required. For example, the size of inflightrequest is 5 by default and needs to be set to 1. Otherwise, retry will also lead to message disorder, which we will analyze later.

Summary

Today, we mainly explore the preliminary serialization of messages and the routing strategy of messages. Let’s briefly summarize:

1) The initial serialization of Kafka message must be specified through configuration parameters. Generally, stringserializer is used. If it is not specified, the message will fail to be sent

2) For messages sent by Kafka, the topic must be specified, while the key and partition under the topic are optional.

The default partition routing strategy supports three types: specify a partition, specify a partition key, or do not specify a partition key

a. Also specify or only specify the partition. Since the priority of partition routing is higher than the key, messages will be routed directly according to the specified partition number.

b. If only the key is specified, an algorithm murmur2 will be executed on the byte array corresponding to the key to obtain an int number, and then the number of partitions will be modulo to obtain the corresponding partition number

c. If none is specified, start with a random number, increment + 1 each time through atomicinteger, modulus the number of partitions or available partition size, and obtain the corresponding partition number

The knowledge of this section is relatively easy. I don’t know how you have mastered it. With the analysis of Kafka producer, we have slowly unveiled its mystery. In the next two sections, let’s analyze the principle of the memory buffer for sending messages, how to allocate memory areas, and how the queue mechanism + batch mechanism sends messages in batches. After that, we will analyze how kakfa solves the problem of unpacking and sticking packages in Java Native NiO. The basic source code principle of producer is almost studied.

See you next time!

This article is composed of blog one article multi posting platformOpenWriterelease!