4. Deep diving Kafka producer

Time:2021-6-23

Through the introduction of the last lesson, we learned that when a business thread uses the kafkaproducer. Send() method to send a message, it will first write it to the recordaccumulator for buffering. When the message cached in the recordaccumulator reaches a certain threshold, the IO thread will form a batch of requests and send them to the Kafka cluster. In this lesson, we will focus on the structure of the buffer named recordaccumulator.

First of all, we can see from the figure above that the recordaccumulator is written by the business thread and read by the sender thread. This is a very obvious producer consumer mode, so we need to ensure that the recordaccumulator is thread safe.
Recordaccumulator maintains a collection of types of concurrentmap < topicpartition, deque < producerbatch > > where the key is the topicpartition, which is used to represent the target partition, and the value is the arraydeque < producerbatch > queue, which is used to buffer messages sent to the target partition. Arraydeque here is not a thread safe collection. We will see the locking related operations later.

A memoryrecordsbuilder object is maintained in each producer batch. Memoryrecordsbuilder is the real place to store messages. The relationship among the three core classes, recordaccumulator, producterbatch, and memoryrecords builder, is shown in the following figure

Client module
Message format
Now that we are going to analyze Kafka producer in depth, we need to understand the format of message inside Kafka, rather than simply knowing that message is an object. Kafka currently has three versions of message format

V0: before kafka0.10
V1: Kafka 0.10 ~ 0.11
V2: Kafka after 0.11.0
V0 version
When using message of version V0, message is simply piled up in the recordaccumulator without aggregation. Each message has its own meta information, as shown in the figure below

The only explanation is the attributes part. The lower three bits are used to identify the compression algorithm currently used, and the upper five bits are not used.

V1 version
The format of V1 is basically similar to that of V0, that is, there is an additional timestamp field. The specific structure is as follows:

The lower three bits of the attributes part are still used to identify the compression algorithm currently used, and the fourth bit is used to identify the type of timestamp.

The introduction of timestamp in V1 version is mainly for the following interface problems:

More accurate log retention policy. Previously, we have briefly described the retention policy based on the existence time of message. When using the V0 version, Kafka broker will directly determine whether to delete the segment file according to the last modification time of the segment file on the disk. However, the major disadvantage of this scheme is that if replica migration or replica expansion occurs, The segment files in the newly added replica are all newly created, and the old messages contained in them will not be deleted.
More accurate log segmentation strategy. We have mentioned earlier that segment files will be segmented regularly and quantitatively. If segment creation time is used for segmentation in V0 version, the same problem will also exist, resulting in a single large file or a very small segment file because no message is written.
Compression in V1 version
For common compression algorithms, the more compressed content, the higher the compression effect ratio. However, the length of a single message is generally not very long. If we want to solve this contradiction, we should put multiple messages together and compress them. Kafka does the same thing. In V1, Kafka uses wrapper message to improve compression efficiency. Simply understand, wrapper message is also a message, but the value value in it is a message collection composed of multiple ordinary messages, which are also called inner message. As shown in the figure below:

In order to further reduce the invalid load of message, Kafka only records the complete offset value in wrapper message, and the offset in inner message is only an offset relative to wrapper message offset, as shown in the following figure:

When the wrapper message is sent to the Kafka broker, the broker does not need to decompress it. It can be stored directly. When the consumer pulls the message, it is also delivered intact. The real decompression is completed in the consumer. In this way, the resources for decompressing and recompressing of the broker can be saved.

On time stamp in V1 version
The timestamp type in V1 message is identified by the fourth bit in attributes. There are two types: createtime and logappendtime

The create time: timestamp field records the time stamp of the message produced by producer
The logappendtime: timestamp field records the time stamp when the broker writes the message to the segment file.
When producer generates a message, the timestamp in the message is createtime, and the timestamp in wrapper message is the maximum value of all inner message timestamps.

When a message is delivered to the broker, the broker will modify the timestamp of the wrapper message according to its own log.message.timestamp.type configuration (or topic’s message.timestamp.type configuration) (the default value is createtime). If the broker uses createtime, we can also set the max.message.time.difference.ms parameter. When the difference between the time stamp in the message and the broker’s local time is greater than the configured value, the broker will refuse to write the message.

If the broker or topic uses logappendtime, the local time of the broker will be directly set to the timestamp field of message, and the timestamp type bit in attributes will be changed to 1. If the message is compressed, only the timestamp and timestamp type in the wrapper message will be modified, not the inner message. This is to avoid decompression and recompression. That is to say, the broker only cares about the timestamp of wrapper message and ignores the timestamp of inner message.

When a message is pulled to a consumer, the consumer will only process it according to the value of timestamp type. If wrapper message is createtime, consumer uses the timestamp of inner message as the createtime; If wrapper message is logappendtime, consumer uses wrapper message as logappendtime of all inner messages and ignores the timestamp value of inner message.

Finally, the timestamp in message is also an important basis for timestamp index. We will introduce it in detail later when we introduce broker.

V2 version
After Kafka version 0.11, we started to use the message format of V2 version, which is also compatible with message of V0 and V1 versions. Of course, some new features of Kafka can’t be used with the old version of message.

The V2 version of message format refers to some characteristics of protocol buffer, and introduces varints (variable length integer) and zigzag encoding. Varints is a method of serializing integers by using one or more bytes. The smaller the value is, the less bytes will be occupied. To put it bluntly, it is to reduce the size of message. Zigzag coding is to solve the problem of low coding efficiency of negative numbers by varints. Zigzag will map signed integers to unsigned integers, so as to improve the coding efficiency of negative numbers with small absolute values by varints, as shown in the following figure:

After understanding the theoretical basis of V2 format, let’s look at the format of message in V2 (also known as record)

It should be noted that all fields with length are variable (or varlong), that is, variable length fields, and both timestamp and offset are delta values, that is, offset. In addition, all the bits in the attribute field are discarded, and the header extension is added.

In addition to the basic record format, V2 also defines a structure of record batch. Students can compare V1 format. Record is the inner structure, and record batch is the outer structure, as shown in the figure below

There are many fields in record batch

Baseoffset: the starting displacement of the current recordbatch. The offset Delta in the record is added to the baseoffset to get the true offset value. When recordbatch is still on the producer side, offset is a value assigned by producer, not by partition. Don’t confuse it.
Batchlength: the total length of recordbatch.
Partitionleaderepoch: used to mark the epoch information of the leader replica in the target partition. We will see the related implementation of this value again when we introduce the specific implementation later.
Magic: magic number 2 in V2.
CRC check code: all the data from the attribute value to the end of the recordbatch are involved in the check. Partitionleaderepoch is not included in CRC because every time the broker receives the recordbatch, it will assign partitionleaderepoch. If it is included in CRC, it will cause the CRC to be recalculated. This implementation will be said later.
Attributes: expanded from 8 bits in V1 version to 16 bits. Bits 0-2 indicate compression type, bit 3 indicates timestamp type, and bit 4 indicates whether it is a transaction type record. The so-called “transaction” is a new function of Kafka. After the transaction is started, only after the transaction is submitted can the transaction type consumer see the record. 5 indicates whether it is a control record. This kind of record always appears singly and is included in a control record batch. It can be used to mark “whether the transaction has been submitted”, “whether the transaction has been aborted”, etc. it will only be processed in the broker and will not be transferred to the consumer and producer, that is, it is transparent to the client.
Lastoffsetdelta: the relative displacement of the last record in recordbatch, which is used by the broker to confirm the correctness of the records assembly in recordbatch.
Firsttimestamp: the timestamp of the first record in recordbatch.
Maxtimestamp: the largest time stamp in recordbatch. Generally, it is the time stamp of the last message. It is used by the broker to confirm the correctness of the records assembly in recordbatch.
Producer ID: producer number, used to support idempotency (exact only semantics). Refer to kip-98 – exact only delivery and transactional messaging.
Producer Epoch: producer epoch, used to support idempotency (exact only semantics).
Base sequence: basic sequence number, used to support idempotency (exact only semantics), used to verify whether it is a duplicate record.
Records count: the number of records.
By analyzing the message format of V2 version, we know that Kafka message not only provides new functions such as transaction, power and so on, but also provides enough optimization for space occupation, and the overall improvement is great

MemoryRecordsBuilder
After understanding the evolution of the Kafka message format, let’s go back to the Kafka producer code.

The bottom layer of each memoryrecords builder relies on a ByteBuffer to store messages. We will introduce the management of ByteBuffer by Kafka producer later. ByteBuffer will be encapsulated as bytebufferoutputstream in memoryrecords builder. Bytebufferoutputstream implements OutputStream, so that we can write data as stream. At the same time, ByteBuffer OutputStream provides the ability to automatically expand the underlying ByteBuffer.

Another thing to pay attention to is the compressiontype field, which is used to specify which compression algorithm is used by memory records builder to compress the data in ByteBuffer. Kafka currently supports four compression algorithms: gzip, snappy, lz4 and zstd. Note: only Kafka V2 version supports zstd compression algorithm.

1
2
3
4
5
6
7
8
9
10
11
Public memoryrecords Builder (ByteBuffer buffer,…) {/ / omit other parameters

//Encapsulate ByteBuffer associated with memoryrecords builder as ByteBuffer OutputStream stream
this(new ByteBufferOutputStream(buffer), ...);

}

Public memoryrecords Builder (bytebufferoutputstream, bufferstream,…) {/ / omit other parameters

//Omit initialization of other fields
this.bufferStream = bufferStream;
//A layer of compressed stream is set in the outer layer of bufferstream stream, and then a layer of dataoutputstream is set
this.appendStream = new DataOutputStream(compressionType.wrapForOutput(this.bufferStream, magic));

}
In this way, we get the appendstream as shown in the following figure:

After understanding the underlying storage mode of memoryrecords builder, let’s look at the core method of memoryrecords builder. The first is the appendwithoffset() method. The logic is not complicated. What needs to be made clear is that producer batch is benchmarking the recordbatch in V2. The data we write is benchmarking the record in V2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
public Long append(long timestamp, ByteBuffer key, ByteBuffer value, Header[] headers) {

return appendWithOffset(nextSequentialOffset(), timestamp, key, value, headers);

}

private long nextSequentialOffset() {
//The baseoffset here is the baseoffset in recordbatch. Lastoffset is used to record the offset value currently written to the record. Later, lastoffset baseoffset will be used to calculate the offset Delta. The lastoffset is incremented every time a new record is written
return lastOffset == null ? baseOffset : lastOffset + 1;
}

private Long appendWithOffset(long offset, boolean isControlRecord, long timestamp, ByteBuffer key, ByteBuffer value, Header[] headers) {

if (isControlRecord !=  This. Iscontrol batch) // check whether the iscontrol flag is consistent
    throw new IllegalArgumentException("...");

if (lastOffset !=  Null & & offset < = this. Lastoffset) // ensure the increment of offset
    throw new IllegalArgumentException("...");

if (timestamp < 0 && timestamp !=  RecordBatch.NO_ Timestamp) // check the timestamp
    throw new IllegalArgumentException("...");

//Check: only V2 message can have header
if (magic < RecordBatch.MAGIC_VALUE_V2 && headers != null && headers.length > 0)
    throw new IllegalArgumentException("...");

If (this. Firsttimestamp = = null) // update firsttimestamp
    this.firstTimestamp = timestamp;

if (magic > RecordBatch.MAGIC_ VALUE_ V1) {// write to V2
    appendDefaultRecord(offset, timestamp, key, value, headers);
    return null;
}Else {// write to V0 and v1
    return appendLegacyRecord(offset, timestamp, key, value, magic);
}

}
The appenddefaultrecord() method calculates the offsetdelta and timestampdelta in the record, writes the record, and updates the metadata of the recordbatch

1
2
3
4
5
6
7
8
9
10
11
12
private void appendDefaultRecord(long offset, long timestamp, ByteBuffer key, ByteBuffer value,

Header[] headers) throws IOException {
ensureOpenForRecordAppend(); //  Check appendstream status
//Calculate offsetdelta
int offsetDelta = (int) (offset - this.baseOffset);
//Calculating timestampdelta
long timestampDelta = timestamp - this.firstTimestamp; 
//The defaultrecord used here is a tool class, and its writeto () method will be sent to the appendstream stream in the format of V2 record
int sizeInBytes = DefaultRecord.writeTo(appendStream, offsetDelta, timestampDelta, key, value, headers);
//Modify meta information in recordbatch, for example, number of records (numrecords)
recordWritten(offset, timestamp, sizeInBytes);

}
Another method that needs attention in memoryrecords builder is the hasroomfor() method, which is mainly used to estimate whether the current memoryrecords builder still has space to hold the records written this time. The specific implementation is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public boolean hasRoomFor(long timestamp, ByteBuffer key, ByteBuffer value, Header[] headers) {

//Check two states, one is appendstream state, the other is whether the estimated number of bytes that have been written exceeds the upper limit specified by the writelimit field. The writelimit field here is used to record the upper limit of bytes that can be written by memoryrecords builder
if (isFull()) 
    return false;

//Each recordbatch can write at least one record. At this time, if there is no record, you can continue to write
if (numRecords == 0)
    return true;

final int recordSize;
if (magic < RecordBatch.MAGIC_ VALUE_ V2) {// judgment of V0 and V1 versions
    recordSize = Records.LOG_OVERHEAD + LegacyRecord.recordSize(magic, key, value);
} else { 
    //Estimate the record size for this write
    int nextOffsetDelta = lastOffset == null ? 0 : (int) (lastOffset - baseOffset + 1);
    long timestampDelta = firstTimestamp == null ? 0 : timestamp - firstTimestamp;
    recordSize = DefaultRecord.sizeInBytes(nextOffsetDelta, timestampDelta, key, value, headers);
}
//Number of bytes written + the number of bytes written to record this time cannot exceed writelimit
return this.writeLimit >= estimatedBytesWritten() + recordSize;

}
ProducerBatch
Next, let’s go up one layer to see the implementation of producterbatch. The core method is the tryappend() method. The core steps are as follows:

Through the hasroomfor() method of memoryrecordsbuilder, check whether the current producerbatch has enough space to store the record to be written this time.
Call the memoryrecordsbuilder. Append() method to append the record to the ByteBuffer of the underlying layer.
Create futurerecordmetadata object. Futurerecordmetadata inherits the future interface and corresponds to the sending of this record.
Encapsulate the futurerecordmetadata object and the callback associated with the record into a thunk object and record it in thunks (list type).
The following is the specific implementation of the producerbatch. Tryappend() method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers, Callback callback, long now) {

//Check if memory records builder has space to continue writing
if (!recordsBuilder.hasRoomFor(timestamp, key, value, headers)) {
    return null; //  If there is no space to write, null is returned
} else {
    //Call the append() method to write the record
    Long checksum = this.recordsBuilder.append(timestamp, key, value, headers);
    //Update maxrecordsize and lastappendtime
    this.maxRecordSize = Math.max(this.maxRecordSize, AbstractRecords.estimateSizeInBytesUpperBound(magic(),
            recordsBuilder.compressionType(), key, value, headers));
    this.lastAppendTime = now;
    //Create futurerecordmetadata object
    FutureRecordMetadata future = new FutureRecordMetadata(this.produceFuture,
                                  this.recordCount,timestamp, checksum, 
                                  key == null ? -1 : key.length, 
                                  value == null ? -1 : value.length, Time.SYSTEM);
    //Record the callback and futurerecordmetadata in chunks
    thunks.add(new Thunk(callback, future));
    this.recordCount++; //  Update the RecordCount field
    return future; //  Return to futurerecordmetadata
}

}
In addition to memoryrecords builder, there are many other key information recorded in producer batch

Here we first focus on the class producerequestresult, which maintains a countdownlatch object (count value is 1) and implements functions similar to future. When the request formed by producterbatch is responded by broker (normal response, timeout, abnormal response) or Kafka producer is closed, the producterequesresult. Done() method will be called, and the countdown() method of countdownlatch object will be called to wake up the thread blocking the await() method of countdownlatch object. These threads can judge whether the request is successful or failed through the error field of producerequestresult.

There is also a baseoffset field in the producerequestresult, which is used to record the offset value allocated by the broker for the first record in the associated producerbatch. In this way, the real offset of each record can be calculated according to its position in the producerbatch (real offset of record = producerequestresult. Baseoffset + relativeoffset).

Next, look at futurerecordmetadata, which implements the future interface in JDK and represents the state of a record. In futurerecordmetadata, in addition to maintaining an associated producerequestresult object, a relativeoffset field is also maintained. Relativeoffset is used to record the offset of the corresponding record in producterbatch.

In futurerecordmetadata, there are two methods worth noting. One is the get() method, which relies on countdown in producerequestresult to achieve blocking waiting, and calls the value() method to return the recordmetadata object

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public RecordMetadata get(long timeout, TimeUnit unit) throws InterruptedException, ExecutionException, TimeoutException {

//Calculate timeout
long now = time.milliseconds();
long timeoutMillis = unit.toMillis(timeout);
long deadline = Long.MAX_VALUE - timeoutMillis < now ? Long.MAX_VALUE : now + timeoutMillis;
//Relying on countdown in producerequestresult to implement blocking waiting
boolean occurred = this.result.await(timeout, unit);
if (!occurred)
    throw new TimeoutException("Timeout after waiting for " + timeoutMillis + " ms.");
if (nextRecordMetadata !=  Null) // nextrecordmetadata can be ignored first. When we introduce split later, we will go further
    return nextRecordMetadata.get(deadline - time.milliseconds(), TimeUnit.MILLISECONDS);
//Call the value () method to return the recordmetadata object
return valueOrError();

}
The other is the value () method, which encapsulates partition information, baseoffset, relativeoffset, timestamp (logappendtime or createtime) and other information into a recordmetadata object

1
2
3
4
5
6
7
8
RecordMetadata value() {

if (nextRecordMetadata !=  Null) // ignore nextrecordmetadata first
    return nextRecordMetadata.value();
//Encapsulate partition information, baseoffset, relativeoffset, time stamp (logappendtime or createtime) and other information into recordmetadata object
return new RecordMetadata(result.topicPartition(), this.result.baseOffset(), 
                          this.relativeOffset, timestamp(), this.checksum, 
                          this.serializedKeySize, this.serializedValueSize);

}
Finally, let’s look at the chunks collection in producerbatch. Each chunk object corresponds to a record object. In the chunk object, the callback object associated with the corresponding record and the futurerecordmetadata object are recorded.

After learning about the data written by producterbatch, let’s go back to producterbatch and focus on its done () method. Kafkaproducer will call the done() method of producterbatch when it receives the normal response, or when it times out, or when it closes the producer. In the done () method, ProducerBatch first updates the finalState state, then invokes the completeFutureAndFireCallbacks () method to trigger the Callback callbacks of each Record, which is implemented as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public boolean done(long baseOffset, long logAppendTime, RuntimeException exception) {

//According to the exception field, determine the final status of the producer batch
final FinalState tryFinalState = (exception == null) ? FinalState.SUCCEEDED : FinalState.FAILED;
//The CAS operation updates the finalstate state. The completefutureandfirecallbacks() method will be triggered only when it is updated for the first time
if (this.finalState.compareAndSet(null, tryFinalState)) {
    completeFutureAndFireCallbacks(baseOffset, logAppendTime, exception);
    return true;
}
//The done () method may be called once or twice. If the succeeded state is switched to another failed state, an exception will be thrown directly
if (this.finalState.get() != FinalState.SUCCEEDED) {
    //Omit the log output code
} else {
    throw new IllegalStateException("...");
}
return false;

}
In the completefutureandfirecallbacks() method, it will traverse the chunks collection, trigger the callback of each record, update the baseoffset, logappendtime and error fields in the producerequestresult, and call its done() method to release the thread blocked on it. The specific implementation is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
private void completeFutureAndFireCallbacks(long baseOffset, long logAppendTime, RuntimeException exception) {

//Update the baseoffset, logappendtime and error fields in the producerequestresult
produceFuture.set(baseOffset, logAppendTime, exception);
//Traverse the thunks collection and trigger the callback of each record
For (thunk thunk: thunks) {// omit the try / catch block
    if (exception == null) {
        RecordMetadata metadata = thunk.future.value();
        if (thunk.callback != null)
            thunk.callback.onCompletion(metadata, null);
    } else {
        if (thunk.callback != null)
            thunk.callback.onCompletion(null, exception);
    }
}
//Call the underlying countdownlatch. Countdown() method to release the thread blocked on it
produceFuture.done();

}
BufferPool
As mentioned earlier, the bottom layer of memoryrecords builder uses ByteBuffer to store the written record data, but creating ByteBuffer objects is a resource consuming behavior, so Kafka producer uses bufferpool to realize the unified management of ByteBuffer. To put it bluntly, a bufferpool is a resource pool of ByteBuffer. When a ByteBuffer is needed, we will get it from it. After using it, we will return the ByteBuffer to the bufferpool.

Bufferpool is a relatively simple resource pool implementation. It only manages bytebuffers of a specific size (poolablesize field), but ignores the selection of bytebuffers of other sizes (the buffer pool in netty is more complex, which will be described in detail later when introducing the source code of netty).

Generally, we will adjust the size of the producer batch (batch. Size configuration (specify the number of records) * the estimated size of a single record), so that each producer batch can cache multiple records. However, when the number of bytes of a record is larger than that of the whole producerbatch, it will not try to apply for ByteBuffer from bufferpool, instead, it will directly allocate a new ByteBuffer object and discard it after it is used up for GC recovery.

Let’s take a look at the core fields of bufferpool

The core logic of allocating ByteBuffer by bufferpool is in the allocate() method. The logic is not complex. The code and comments are directly given

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
public ByteBuffer allocate(int size, long maxTimeToBlockMs) throws InterruptedException {

If (size > this. Totalmemory) // first check whether the size of the target ByteBuffer is larger than
    throw new IllegalArgumentException("...");

ByteBuffer buffer = null;
this.lock.lock(); //  Lock

//Check the status of the current bufferpool. If the current bufferpool is closed, an exception will be thrown directly

try {
    //If the target size is exactly poolablesize and the free list is empty, the ByteBuffer in the free list can be reused directly
    if (size == poolableSize && !this.free.isEmpty())
        return this.free.pollFirst();

    //Calculate the total ByteBuffer space in the free list
    int freeListSize = freeSize() * this.poolableSize;
    //If the current buffer pool can release the space of the target size, it can be allocated directly through the freeup () method
    if (this.nonPooledAvailableMemory + freeListSize >= size) {
        freeUp(size);
        this.nonPooledAvailableMemory -= size;
    } else {
        int accumulated = 0;
        //If the current bufferpool space is insufficient to provide the target space, the current thread needs to be blocked
        Condition moreMemory = this.lock.newCondition();
        try {
            //Calculate the maximum blocking time of the current thread
            long remainingTimeToBlockNs = TimeUnit.MILLISECONDS.toNanos(maxTimeToBlockMs);
            this.waiters.addLast(moreMemory);
            While (accumulated < size) {// the loop waits for allocation to succeed
                long startWaitNs = time.nanoseconds();
                long timeNs;
                boolean waitingTimeElapsed;
                try {
                    //The current thread is blocking and waiting. If the return result is false, it means blocking timeout
                    waitingTimeElapsed = !moreMemory.await(remainingTimeToBlockNs, TimeUnit.NANOSECONDS);
                } finally {
                    long endWaitNs = time.nanoseconds();
                    timeNs = Math.max(0L, endWaitNs - startWaitNs);
                }
                //Check the status of the current bufferpool. If the current bufferpool is closed, an exception will be thrown directly
                if (waitingTimeElapsed) { 
                    //If the space of the target size is not obtained within the specified time, an exception is thrown
                    throw new BufferExhaustedException("...");
                }

                remainingTimeToBlockNs -= timeNs;
                //The target size is a poolablesize ByteBuffer, and a free ByteBuffer appears in free
                if (accumulated == 0 && size == this.poolableSize && !this.free.isEmpty()) {
                    buffer = this.free.pollFirst();
                    accumulated = size;
                } else {
                    //Allocate some space first and continue to wait for free space
                    freeUp(size - accumulated);
                    int got = (int) Math.min(size - accumulated, this.nonPooledAvailableMemory);
                    this.nonPooledAvailableMemory -= got;
                    accumulated += got;
                }
            }
            accumulated = 0;
        } finally {
            //If the above while loop does not end normally, accumulated is not 0 and will be returned here
            this.nonPooledAvailableMemory += accumulated;
            this.waiters.remove(moreMemory);
        }
    }
} finally {
    //If there is still free space in the current buffer pool, the next waiting thread will wake up to try to allocate ByteBuffer
    try {
        if (!(this.nonPooledAvailableMemory == 0 && this.free.isEmpty()) && !this.waiters.isEmpty())
            this.waiters.peekFirst().signal();
    } finally {
        lock.unlock(); //  Unlocking
    }
}

if (buffer == null) 
    //Allocation succeeded, but ByteBuffer in free list cannot be reused (the target size may not be poolablesize, or the free list itself is empty)
    return safeAllocateByteBuffer(size);
else
    return buffer; //  Direct multiplexing of free size ByteBuffer

}

//Let’s take a brief look at the freeup () method, where idle bytebuffers will be continuously released from the free list to supplement nonpooledavailable memory
private void freeUp(int size) {

while (!this.free.isEmpty() && this.nonPooledAvailableMemory < size)
    this.nonPooledAvailableMemory += this.free.pollLast().capacity();

}
After using the ByteBuffer allocated from the bufferpool, the deallocate() method will be called to release it. The specific implementation is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public void deallocate(ByteBuffer buffer, int size) {

lock.lock(); //  Lock
try {
    //If the size of ByteBuffer to be released is poolablesize, it will be directly put into the free list
    if (size == this.poolableSize && size == buffer.capacity()) {
        buffer.clear();
        this.free.add(buffer);
    } else {
        //If the size is not poolablesize, ByteBuffer is reclaimed by JVM GC and nonpooledavailable memory is increased
        this.nonPooledAvailableMemory += size;
    }
    //Wake up the first blocked thread in waiters
    Condition moreMem = this.waiters.peekFirst();
    if (moreMem != null)
        moreMem.signal();
} finally {
    lock.unlock(); //  Release the lock
}

}
RecordAccumulator
After analyzing the write related methods of memoryrecords builder, producer batch and bufferpool, let’s look at the implementation of recordaccumulator.

When analyzing a class, we should first look at its data structure, and then look at its behavior (method). The key fields in the recordaccumulator are as follows:

When the kafkaproducer. Dosend() method was used to send a message, the recordsaccumulator. Append() method was directly called. This is where the producterbatch. Tryappend() method was called to append the message to the underlying memoryrecords builder. Let’s take a look at the core logic of the recordaccumulator. Append() method

Find the arraydeque set corresponding to the target partition in the batches set. If the search fails, create a new arraydeque and add it to the batches set.
Lock the arraydeque set obtained in step 1. Here, the synchronized code block is used for locking.
Execute the tryappend() method to try to write a record to the last producer batch in arraydeque.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers, Callback callback, Deque<ProducerBatch> deque, long nowMs) {

//Gets the last producer batch object in the arraydeque < producer batch > collection
ProducerBatch last = deque.peekLast();
if (last != null) {
    //Try to write the message to the producer batch object last
    FutureRecordMetadata future = last.tryAppend(timestamp, key, value, headers, callback, nowMs);
    if (future == null) 
        //If the write fails, close the producterbatch object pointed to by last, and return null to indicate the write failure
        last.closeForRecordAppends();
    else
        //If the writing is successful, the recordappendresult object is returned
        return new RecordAppendResult(future, deque.size() > 1 || last.isFull(), false, false);
}
return null;

}
At the end of the synchronized block execution, the lock is released automatically.
If the append operation in step 3 is successful, recordappendresult is returned.
If the appending record in step 3 fails, it may be because the currently used producer batch has been filled. Here, we will judge whether the abortonnewbatch parameter is true. If it is, we will immediately return the recordappendresult (the abortfornewbatch field is set to true). In the returned recordappendresult, if abortfornewbatch is true, we will trigger the recordaccumulator. Append() method again.
If the abortfornewbatch parameter is not true, a new ByteBuffer is allocated from the bufferpool and encapsulated as a new producterbatch object.
Lock arraydeque again, and try to append the record to the new producer batch, and append the new producer batch to the corresponding deque tail.
Add the new producer batch to the include collection. The synchronized block ends and automatically unlocks.
Return recordappendresult. The batchisfull field and newbatchcreated field in recordappendresult will be used as the conditions to wake up the sender thread. In the kafkaproducer. Dosend() method, the code fragment to wake up the sender thread is as follows:
1
2
3
4
if (result.batchIsFull || result.newBatchCreated) {

//When this write fills a producer batch or a new producer batch is created, the sender thread will be awakened to send
this.sender.wakeup();

}
Let’s take a look at the specific implementation of the recordaccumulator. Append() method

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
public RecordAppendResult append(TopicPartition tp, long timestamp,

byte[] key, byte[] value, Header[] headers, Callback callback,
    long maxTimeToBlock, boolean abortOnNewBatch, long nowMs) throws InterruptedException {
//Count the number of records being written to the recordaccumulator
appendsInProgress.incrementAndGet();
ByteBuffer buffer = null;
if (headers == null) headers = Record.EMPTY_HEADERS;
try {
    //Find the arraydeque < producer batch > set corresponding to the target partition in the batches set,
    //If the lookup fails, a new arraydeque < producer batch > is created and added to the batches collection.
    Deque<ProducerBatch> dq = getOrCreateDeque(tp);
    Synchronized (DQ) {// lock arraydeque < producer batch >
        //If the append operation is successful, recordappendresult is returned
        RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
        if (appendResult !=  Null) // if the append is successful, the appendresult returned is not null
            return appendResult;
    }

    //If the append of record fails, it may be because the currently used producterbatch has been filled up. Here, according to the abortonnewbatch parameter,
    //Determine whether to return the recordappendresult immediately. In the returned recordappendresult, if abortfornewbatch is true,
    //The append () method is triggered again
    if (abortOnNewBatch) {
        return new RecordAppendResult(null, false, false, true);
    }

    byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
    int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
    //Allocate ByteBuffer from bufferpool
    buffer = free.allocate(size, maxTimeToBlock);

    nowMs = time.milliseconds();
    Synchronized (DQ) {// lock arraydeque < producer batch > again
        if (closed)
            throw new KafkaException("Producer closed while send in progress");
        //Try the tryappend() method again to append the record
        RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
        if (appendResult != null) {
            return appendResult;
        }
        //Encapsulating ByteBuffer into memoryrecords builder
        MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
        //Creating a producterbatch object
        ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, nowMs);
        //Append the record to the producerbatch through the tryappend() method
        FutureRecordMetadata future = Objects.requireNonNull(batch.tryAppend(timestamp, key, value, headers,
                callback, nowMs));
        //Add producer batch to arraydeque < producer batch >
        dq.addLast(batch);
        //Add producer batch to incompletebatches
        incomplete.add(batch);

        buffer = null; //  Empty buffer
        return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true, false);
    }
} finally {
    if (buffer !=  Null) // if the buffer is not empty, it means that an exception occurred during the writing process. Here, ByteBuffer will be released
        free.deallocate(buffer);
    //The current record has been written, decreasing appendsinprogress
    appendsInProgress.decrementAndGet();
}

}
Here we can clearly see that the code for locking arraydeque is a non thread safe collection, and the locking process is understandable. But why is it locked twice? This is mainly because when applying for a new ByteBuffer from bufferpool, it may cause blocking. Let’s assume that all the above operations are completed in a synchronized block. The record sent by thread a is relatively large and needs to apply for new space from the bufferpool. At this time, if the bufferpool space is insufficient, thread a will block and wait on the bufferpool. At this time, it still holds the lock corresponding to arraydeque; The record sent by thread B is small. At this time, the remaining space of the last producer batch of arraydeque is just enough to write the record. However, because thread a does not release the lock of deque, it also needs to wait together. This causes thread B to block unnecessarily and reduces the throughput. The essence here is to optimize by reducing the holding time of the lock.

In addition to the two arraydeque lock operations, we also see the second lock and then try again. This is mainly to prevent multiple threads from applying for space from the bufferpool concurrently, causing internal fragmentation. This scenario is shown in the following figure. Thread a finds that the last producterbatch is out of space, applies for space, creates a new producterbatch object and adds it to the tail of arraydeque. Then thread B and thread a execute concurrently, and adds a new producterbatch to the tail of arraydeque. As can be seen from the above logic of the tryappend() method, subsequent writing will only be performed on the producer batch at the end of arraydeque, which will cause the producer batch 3 in the following figure to no longer be written, resulting in internal fragmentation:

After understanding the recordaccumulator’s support for record writing, let’s look at the recordaccumulator. Ready() method, which is called by the sender thread before sending records to Kafka broker. This method will obtain the node collection that can receive the records to be sent according to the cluster metadata. The specific filtering conditions are as follows:

Whether there are more than one recordbatch or the first recordbatch is full in arraydeque in batches collection.
Whether the waiting time is long enough. There are two main aspects. If there is a retrial, it needs to exceed the backoff time of retrybackoff Ms; If there is no retrial, the waiting time specified by linker.ms configuration should be exceeded (linker.ms is 0 by default).
Whether there are other threads waiting for bufferpool to free space.
Whether a thread has called the flush () method and is waiting for the flush operation to complete.
Let’s look at the code of the ready method. It will traverse each partition in the batches collection. First, it will find the node where the leader replica of the target partition is located. Only when it knows the node information, Kafka producer will know where to send. Then, each arraydeque is processed. If the above four conditions are met, the corresponding node information is recorded in the readynodes set. Finally, the readycheckresult object is returned by the ready() method, which records the node collection that meets the sending condition, the topic that the leader replica cannot be found in the traversal process, and the time interval of the next call to the ready() method for checking.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
public ReadyCheckResult ready(Cluster cluster, long nowMs) {

//It is used to record which nodes can send data to
Set<Node> readyNodes = new HashSet<>();  
//Record the time interval of the next call to the ready() method
long nextReadyCheckDelayMs = Long.MAX_VALUE;
//The topic of leader replica cannot be found in the record cluster metadata
Set<String> unknownLeaderTopics = new HashSet<>();
//Is there a thread blocking waiting for bufferpool to free space
boolean exhausted = this.free.queued() > 0;

//Next, traverse the batches collection and judge the node where the leader replica of each partition is located
for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {
    Deque<ProducerBatch> deque = entry.getValue();
    synchronized (deque) {
        ProducerBatch batch = deque.peekFirst();
        if (batch != null) {
            TopicPartition part = entry.getKey();
            //Find the node where the leader replica of the target partition is located
            Node leader = cluster.leaderFor(part);
            if (leader == null) {
                //If the leader replica cannot be found, it will be considered as an exception and the message cannot be sent
                unknownLeaderTopics.add(part.topic());
            } else if (!readyNodes.contains(leader) && !isMuted(part)) {
                boolean full = deque.size() > 1 || batch.isFull();
                long waitedTimeMs = batch.waitedTimeMs(nowMs);
                boolean backingOff = batch.attempts() > 0 && waitedTimeMs < retryBackoffMs;
                long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                boolean expired = waitedTimeMs >= timeToWaitMs;
                //Check the above five conditions to find the node involved in this sending
                boolean sendable = full || expired || exhausted || closed || flushInProgress();
                if (sendable && !backingOff) {
                    readyNodes.add(leader); 
                } else {
                    long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                    //Record the time interval of the next call to the ready() method
                    nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
                }
            }
        }
    }
}
return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);

}
After calling the recordaccumulator. Ready() method to get the readynodes collection, this collection must be filtered by the networkclient (which will be described in detail when we introduce the sender thread) to get the node collection that can finally send messages.

After that, the sender thread will call the recordaccumulator. Drain() method to obtain the producer batch to be sent according to the above node set, and return the map < integer, list > set, where the key is the ID of the target node and the value is the producer batch set to be sent. In the upper business logic of calling Kafka producer, data is generated in the way of topic partition. It only cares which topic partition to send to, not which node these topic partitions are on. At the network IO level, the producer sends message data to the node. It only establishes a connection to the node and sends data, and does not care which topic the data belongs to. The core function of the drain() method is to transform the mapping relationship of the topicpartition > producerbatch set into the mapping relationship of the node > producerbatch set. The following is the core code of the drain() method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
public Map<Integer, List<ProducerBatch>> drain(Cluster cluster, Set<Node> nodes, int maxSize, long now) {

if (nodes.isEmpty())
    return Collections.emptyMap();
//After conversion, key is the ID of the target node and value is the producer batch set sent to the target node
Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
for (Node node : nodes) {
    //Gets the producer batch collection of the target node
    List<ProducerBatch> ready = drainBatchesForOneNode(cluster, node, maxSize, now);
    batches.put(node.id(), ready);
}
return batches;

}

private List<ProducerBatch> drainBatchesForOneNode(Cluster cluster, Node node, int maxSize, long now) {

int size = 0;
//Gets the partition collection on the current node
List<PartitionInfo> parts = cluster.partitionsForNode(node.id());
//Record the producer batch collection sent to the target node
List<ProducerBatch> ready = new ArrayList<>();
//Drainindex is the subscript of batches. It records the position where the last transmission stopped. Next time, it continues to send from this position. If always from     
//When the queue with index 0 starts to send, the message of the first few partitions may be sent all the time, resulting in starvation of other partitions.
int start = drainIndex = drainIndex % parts.size();
do {
    //Get metadata of partition
    PartitionInfo part = parts.get(drainIndex);
    TopicPartition tp = new TopicPartition(part.topic(), part.partition());
    this.drainIndex = (this.drainIndex + 1) % parts.size();
    //Check whether the arraydeque corresponding to the target partition is empty (omitted)
    synchronized (deque) {
        //Gets the first producer batch object in arraydeque
        ProducerBatch first = deque.peekFirst();
        if (first == null)
            continue;
        //If you try the operation again, you need to check whether you have been waiting for enough backoff time (omitted)
        if (size + first.estimatedSizeInBytes() > maxSize && !ready.isEmpty()) {
            //The amount of data to be sent by this request is full, end the cycle
            break;
        } else {
            if (shouldStopDrainBatchesForPartition(first, tp))
                break;
            //Get the first producer batch in arraydeque
            ProducerBatch batch = deque.pollFirst();
            //Transaction related processing (omitted)
            //Close the underlying output stream and set producer batch to read-only state
            batch.close();
            size += batch.records().sizeInBytes();
            //Record the producer batch to the ready collection
            ready.add(batch);
            //Modifying the drainedms tag of producterbatch
            batch.drained(now);
        }
    }
} while (start != drainIndex);
return ready;

}
summary
This session first introduces the evolution of message format in Kafka, and analyzes the format changes of message in V0, V1 and V2 in detail.

Then it introduces the core content of Kafka producer, which is the data transfer station between business thread and sender thread. It mainly involves the memory records builder, producer batch, buffer pool and other underlying components, as well as the core method of record accumulator.

In the next session, we will introduce the sender thread in Kafka producer.

The article and video explanation in this class will be put in the following sections:

The official account of WeChat:

Website B: Yang Sizheng