Bytebuffermessageset of Kafka message store



Messageset is a very important hierarchical concept for Kafka to operate message at the bottom. From the name, it can be seen that it is a collection of messages, but the processing logic in the code mainly considers the processing of nested messages. The main function of messageset is to provide sequential read and batch write of messages. The basic object of operation is messageandoffset.


First, let’s talk about how messages are distributed in the messageset. A long will be added to the front end of each message to indicate the position offset of the message in the set, followed by an int to indicate the size of the message. Messageset splits messages by reading these.

At the same time, messageset implements the iteratable interface. Its two main methods are to return the iterator and write to the channel. The code is pasted below.

/** Write the messages in this set to the given channel starting at the given offset byte. 
    * Less than the complete amount may be written, but no more than maxSize can be. The number
    * of bytes written is returned */
  def writeTo(channel: GatheringByteChannel, offset: Long, maxSize: Int): Int
   * Provides an iterator over the message/offset pairs in this set
  def iterator: Iterator[MessageAndOffset]
   * Gives the total size of this message set in bytes
  def sizeInBytes: Int

Here I would like to raise a few questions:

  • Why implement Iterable instead of directly implementing the iterator interface?

  • Messageset should be able to integrate more APIs into this virtual class, such as inserting messages and message sets in a specific offset and deleting messages in a specific offset. Should these APIs be introduced? If necessary, to what extent should it be introduced?

First, let’s look at the first problem. It’s not difficult that the collection interface also inherits Iterable, because the core method next () or hasnext () of the iterator interface depends on the current iteration position of the iterator. If the collection directly implements the iterator interface, the collection object will inevitably contain the data (pointer) of the current iteration position. When the collection is passed between different methods, the result of the next () method will become unpredictable because the current iteration position is not preset.

For the second problem, what I think of at present is that due to the existence of nested messages, both insertion and deletion need to go through complex inspection operations. For message queues, message insertion and consumption must be sequential, and occur at the head and tail. It is too expensive to add such API settings to the parent virtual class. But this is just my personal idea, not necessarily correct.

Generation of iterator

Next, let’s take a look at how bytebuffermessageset implements sequential reading. Here, we directly consider the most complex case, that is, the case where nested messages need to be parsed and decompressed.

1. Read long from the position header of buffer to get the initial offset of the whole set
2. Read an int to get the size of the first message; Check. If the size is less than the specified message header size, an error occurs. If the remaining size of the buffer is less than the size, the last message is truncated and parsing is completed
3. Generate messageandoffset according to the obtained message size. If it is uncompressed, it will be directly used as the next message; Otherwise, enter 4
4. Generate an inner iterator responsible for message iteration in nested messages. Here we need to pay attention to the following points: first, modify the two properties of the inner message according to the timestamp and timestamp type of the outer message; Second, pay attention to the conversion of offset. I will focus on the conversion of offset.
5. The inner iterator first decompresses and reads all inner messages from the compressed byte code stream, and iterates one by one when the outer layer requests next(), until the inner iterator is reset and the outer layer iteration logic is returned

The problem of offset location is what we need to focus on here. First, what does this offset mean: in terms of design, there are two options: 1. Similar to the sequence number of a message, it determines the number of messages. 2. The starting position of the message in the byte code stream. In order to better serve the processing logic after the upper layer reads, Kafka selects the message sequence number. However, the main problem here is that for nested messages, after decompression, these internal messages store relative displacement (sequence number relative to the outer layer), and their relative displacement needs to be modified to absolute displacement. In addition, what should the displacement of nested messages be? If only the outer layer is considered, the displacement of all subsequent messages needs to be increased after decompression, so it is unreasonable. A desirable way is to select the absolute displacement of the last message in its internal message.

RO = IO_of_a_message - IO_of_the_last_message
AO = AO_Of_Last_Inner_Message + RO


override def makeNext(): MessageAndOffset = {
        messageAndOffsets.pollFirst() match {
          case null => allDone()
          case [email protected] MessageAndOffset(message, offset) =>
            if (wrapperMessage.magic > MagicValue_V0) {
              val relativeOffset = offset - lastInnerOffset
              val absoluteOffset = wrapperMessageOffset + relativeOffset
               /**This is very ingenious. This lastinnerofset is actually the equivalent internal relative displacement of the entire outer message**/
              new MessageAndOffset(message, absoluteOffset)
            } else {

Constructors and write functions

It is not difficult to write a messageset. First write the offset, then the size, and finally the ontology. For compressed messages, add a header to it. However, the design of constructor actually reflects the function of this class. Where will bytebuffermessageset be used?

  • Get the original data directly from the buffer and parse it as a messageset, that is, read operation

  • It is assembled from a series of messages (optional overall timestamp and magic value) and finally written to channel

The most special use of bytebuffermessageset is to read and write nested messages. It sets the relative offset for nested messages, checks the magic value of all internal messages, converts the timestamp of internal messages when reading, compresses the internal message set and adds nested message headers.

See here, the design of the constructor is ready to come out.

  • It is constructed directly from buffer or bytes and mainly completes the read operation

  • Constructed from a series of messages, you can specify whether they form nested messages, that is, specify the compression method of the whole set and whether to compress; It is more important to specify the offset

  • If compression is required, specify the header related properties of the external nested message. For this part, you can refer to the constructor of message

Next, post the constructor code of the latter

private def create(offsetAssigner: OffsetAssigner,
                     compressionCodec: CompressionCodec,
                     wrapperMessageTimestamp: Option[Long],
                     timestampType: TimestampType,
                     messages: Message*): ByteBuffer = {
    if (messages.isEmpty)
    else if (compressionCodec == NoCompressionCodec) {
//When a message is not nested, only a series of messages are collected, and the expected offset is specified only through offsetassigner
      val buffer = ByteBuffer.allocate(MessageSet.messageSetSize(messages))
      for (message <- messages) writeMessage(buffer, message, offsetAssigner.nextAbsoluteOffset())
    } else {
      val magicAndTimestamp = wrapperMessageTimestamp match {
        case Some(ts) => MagicAndTimestamp(messages.head.magic, ts)
        case None => MessageSet.magicAndLargestTimestamp(messages)
      var offset = -1L
      val messageWriter = new MessageWriter(math.min(math.max(MessageSet.messageSetSize(messages) / 2, 1024), 1 << 16))

//Write nested headers
      messageWriter.write(codec = compressionCodec, timestamp = magicAndTimestamp.timestamp, timestampType = timestampType, magicValue = magicAndTimestamp.magic) { outputStream =>
        val output = new DataOutputStream(CompressionFactory(compressionCodec, magicAndTimestamp.magic, outputStream))
        try {
          for (message <- messages) {
            offset = offsetAssigner.nextAbsoluteOffset()
            if (message.magic != magicAndTimestamp.magic)
              throw new IllegalArgumentException("Messages in the message set must have same magic value")
            // Use inner offset if magic value is greater than 0
            if (magicAndTimestamp.magic > Message.MagicValue_V0)

            //Note here that the internal message is written to the relative displacement
            output.write(message.buffer.array, message.buffer.arrayOffset, message.buffer.limit)
        } finally {
      val buffer = ByteBuffer.allocate(messageWriter.size + MessageSet.LogOverhead)

//Note here that the size and displacement of the nested message are written. The nested message displacement is the absolute displacement of its last internal message
      writeMessage(buffer, messageWriter, offset)

It can be said that the above code almost includes the design purpose and reading and writing method of the whole bytebuffermessageset. This paragraph is very meaningful. Many of my doubts were answered only when I understood this paragraph.

Verification and correction

The following tasks are mainly completed:

  • Check timestamp and timestamp type

  • For nested inner messages, you need to check whether it has a key

  • The timestamp type and timestamp can be reset and modified

  • If offset correction is required, you can set an overall starting offset for the whole set and recheck whether the displacement of all messages is reasonable

In the code of Kafka 0.10.0, these functions are mixed in one traversal. For the messageset with compression, many operations such as offset correction are not feasible. Only a new decompressed messageset can be returned. Personally, I think this may not be a good practice. These functions should be distinguished independently, and the necessary calibration and correction should be done one by one, and finally reset one by one. Take the code of Kafka 0.8.0 as an example to show how to do offset correction.

   * Update the offsets for this message set. This method attempts to do an in-place conversion
   * if there is no compression, but otherwise recopies the messages
private[kafka] def assignOffsets(offsetCounter: AtomicLong, codec: CompressionCodec):            ByteBufferMessageSet = {
    if(codec == NoCompressionCodec) {
      // do an in-place conversion
      var position = 0
      while(position < sizeInBytes - MessageSet.LogOverhead) {
        position += MessageSet.LogOverhead + buffer.getInt()
    } else {
      // messages are compressed, crack open the messageset and recompress with correct offset
      val messages = this.internalIterator(isShallow = false).map(_.message)
      new ByteBufferMessageSet(compressionCodec = codec, offsetCounter = offsetCounter, messages = messages.toBuffer:_*)