Kakfa producer dry goods

Time:2020-9-26

Kafka producer is an application program responsible for sending messages to Kafka service. This article does not tell stories, but focuses on principles and thinking. It may be difficult and boring for people who don’t know anything about Kafka.

KafkaProducer

Kafkaproducer is thread safe and supports multiple threads to share the same instance object.

ProducerRecord

public class ProducerRecord<K, V> {
    private final String topic;
    private final Integer partition;
    private final Headers headers;
    private final K key;
    private final V value;
    private final Long timestamp;
}

Timestamp is the timestamp of the message,It has two types: createtime and logappendtimeThe former represents the time when the message was created, and the latter represents the time when the message was appended to the log file.

If it is empty, it is called tombstone message.

What is the value of the header?

The authentication of message validity and value data are generally specified data format, while the header can be changeable to achieve different personalized requirements. Header data is not serialized.

Sending of messages

There are three modes of sending messages:

  • Forgetting after hair
  • synchronization
  • asynchronous

How to synchronizeThe reliability is the highest, but the performance is much worse,You need to block and wait for a message to be sent before sending the next message.

In asynchronous mode, Kafka provides a callback mode to ensure the response operation after data transmissionThe call of callback function can also ensure the partition order.

The return value of the recordmetadata object after sending the message

public final class RecordMetadata {   
   private final long offset;
    private final long timestamp;
    private final int serializedKeySize;
    private final int serializedValueSize;
    private final TopicPartition topicPartition;
    private volatile Long checksum;
}

Timestamp returned with logappendtime.

KafkaProducer.send()The return value of the method is a future object, which provides a time parameter for timeout blocking. If the timeout occurs, an exception will be thrown.

exception handling

Kafkaproducer is divided into two types of exceptions: retrievable exceptions and non retryable exceptions.

Retrievable exception: networkexception, leadernotavailableexception

Non retryable exception: recordtoolargeexception

After an exception occurs, the retrievable exception can be handled by retrying. Kafka provides the corresponding parameters

props.put(ProducerConfig.RETRIES_CONFIG,10);

Overall structure

<img style=”zoom:50%;” />

The whole producer client has two threads running coordinately, which are the main thread and the sender thread respectively.

Interceptor

Producer interceptor can be used to filter or pre process data according to some rules.

The custom interceptor implements the producer interceptor interface,

public interface ProducerInterceptor<K, V> extends Configurable {
public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record);
public void onAcknowledgement(RecordMetadata metadata, Exception exception);
 public void close();
}

Kafkaproducer will call the onacknowledgement method of the producer interceptor before the message is responded or when the message fails to be sent, which takes precedence over the callback set by the user.

Load interceptor

adoptinterceptor.classesThe configuration item is used to configure. Multiple interceptors are supported and separated by commas.

Comparator

The partition determines where the message is sent.

  • If the partition field is specified in the message producer record, the role of the partition is not required.
  • The default partitioner will hash the key (murmurhash2 algorithm, high performance and low collision rate)The partition type is ar, so there may be a possibility of sending data failure and retrying.
  • If the key is null, the message is sent to each member in the body by pollingAvailable partitions (ISR)
  • If the number of partitions increases (Kafka supports adding new partitions after creating topics, but reducing partitions does not support them), it is difficult to guarantee the mapping relationship between keys and partitions.

Recordaccumulator message accumulator

It is used to cache messages so that the sender thread can send in batches, thus reducing the resource loss of network transmission to improve performance. Default cache size bybuffer.memoryThe default value is 32m.

If the speed of sending messages by the producer exceeds the speed of sending messages to the server, the producer will be short of space and block or throw exceptions when calling the send() methodmax.block.msThe default is 60 seconds.

The data structure of recordaccumulator is double ended queue(Deque<ProducerBatch>), append from tail, get from head.

withConcurrentMap<TopicPartition, Deque<ProducerBatch>>Data format storage.

ProducerBatch

This is the producer record collection, which refers to a message batch. The advantage of this is to make the use of bytes more compact and reduce the resource loss of network transmission to improve performance.

linger.msSend the dwell time for the message, and when it reaches that time, the data is sent to the sender thread.

batch.sizeFor each producer batch theory maximum (some data will also exceed this value), when this value is reached, the data will be sent to the sender thread.

BufferPool

It is mainly used to realize the reuse of ByteBuffer and the efficient utilization of cache has been realized.

However, the bufferpool only manages the ByteBuffer of a specific size. Bytebuffers of other sizes will not be cached in the bufferpool.

Size bybatch.sizeParameters are specified. The default value is 16K

Relationship between producer batch and bufferpool

When a message flows into the recordaccumulator, it will first look for the double end-to-end column corresponding to the message partition, and get a producer batch at the end of the double ended queue. If the space can still allow the size of the current data, it will be written to the producer batch. If it is not possible to create a new producer batch, and then write it. The size of the producer batch depends onbatch.sizeIf the current data size has exceededbatch.size, the producer batch object is created according to its actual sizebatch.sizeAfter using this memory area, it can be reused through the management of the bufferpool. If it is exceeded, it will not be reused again.

InFlightRequests

Cache has been sent but not received the corresponding request.

The data format isMap<NodeId, Deque<NetworkClient.InFlightRequest>>

adoptmax.in.flight.requests.per.connectionThe parameter can limit the maximum number of requests per connection. The default value is 5, and only 5 unresponsive requests can be cached at most.

If there is a build-up, check to see if there is a network problem.

Infiltrequests can obtain the least loadednode, that is, the one with the least load among all nodes. By importing the data to the least loadednode node, the node can send data as much as possible.

When the waiting request response time exceedsrequest.timeout.msWhen the value is set, the retry mechanism is executed.

Metadata update

Metadata update is completed through the sender thread, but the main thread ensures data synchronization through synchronized and final methods.

The update interval is based onmetadata.max.age.msCome on. The default is 5 minutes.

When metadata needs to be updated, the least loadednode is selected first, and then a metadatarequest request is sent to this node to obtain the specific metadata information.

Summary

The requests in infiltrequests are not only to send messages, but also to get metadata.

When infiltrequests sends a topic message, it initiates the request according to the node where the leader copy of the topic partition is located.

Parameter description

acks

Kafka provides three kinds of message acknowledgement mechanisms (acks) through attributesrequest.required.acksset up.

  • Acks = 0: the producer sends messages continuously without waiting for the agent to return an acknowledgement
  • Acks = 1: by default, the producer needs to wait for the leader copy to successfully write the message to the log file. To a certain extent, reduce the risk of data loss. When the leader replica is down, the follower replica does not synchronize the data in time, and then the agent will not obtain the data from the down leader, and the data will be lost
  • Acks = – 1: if the leader copy and all the copies in the follower list have completed the data storage, then the confirmation message will be sent to the producer, that is, the synchronous copy must be greater than 1, and the synchronization copy must be greater than 1min.insync.replicasWhen the synchronous copy is insufficient, the producer will throw an exception. This mode will affect the speed of sending data and throughput of producers.

Description of producer configuration

Property value Default value describe
Message.send.max.retries 3 The number of times the producer retries before losing the message
Retry.backoff.ms 100 Checks whether a new leader has been elected, and the time the producer needs to wait before updating the metadata for the topic
Queue.buffering.max.ms 1000 When this time is reached, messages will start to be sent in batches. If you have configured theBatch.num.messages, then reaching one of these two thresholds will start sending messages in bulk
Queue.buffering.max.message 10000 In asynchronous mode, the maximum number of unsent messages that can be cached in the queue before the producer must be blocked or the data must be lost, that is, the length of the initialization message queue
Batch.num.messages 200 The maximum number of messages per batch sent in asynchronous mode
Request.timeout.ms 1500 When acks is required, the producer will wait for the agent to reply. If there is no response within this time range, the error will be sent to the client
Topic.metadata.refresh.interval.ms 5min The interval at which the producer regularly requests to update the topic metadata. If it is set to 0, the data will be updated after each message is sent
Client.id Console.producer An identification field specified by the producer. This field is included in each request to track the call. According to this field, it can be logically confirmed which application issued the request
Queue.enqueue.timeout.ms 2147483647 If the value is 0, it indicates that if the queue is not full, it will be queued directly. If it is full, it will be discarded immediately. A negative number indicates that unconditional blocking will not be discarded. A positive number indicates that a queuefullexception will be thrown after the blocking reaches this value

Kakfa producer dry goods)

Recommended Today

Comparison and analysis of Py = > redis and python operation redis syntax

preface R: For redis cli P: Redis for Python get ready pip install redis pool = redis.ConnectionPool(host=’39.107.86.223′, port=6379, db=1) redis = redis.Redis(connection_pool=pool) Redis. All commands I have omitted all the following commands. If there are conflicts with Python built-in functions, I will add redis Global command Dbsize (number of returned keys) R: dbsize P: print(redis.dbsize()) […]