[big data practice] Kafka producer programming (4) — detailed explanation of producer config (1)

Time:2020-10-27

preface

The previous article has made a general introduction to producer process and its customizable configuration classes. This article will continue to explain the relevant knowledge points of Kafka generator programming.ProducerConfigClass stores the configurable items of producer client and its corresponding explanation documents. In this paper, based on the description documents, this paper analyzes some internal mechanisms and principles of Kafka.

remarks:

  • This article aims atkafka-clients 1.1.0edition.
  • Producerconfig class in packageorg.apache.kafka.clients.producerMedium.

Producerconfig configuration items

bootstrap.servers

Importance: high
Type: List
Default value: Collections.emptyList ()

Guide producer to find theBoot service address list

As the name suggests, the configuration item isguideService list, which is used to find all brokers in the Kafka clusterhost:portList, producer through thesehost:portEstablish a connection with Kafka cluster. Producer uses the addresses in this list only to discover all the services in the Kafka clusterbrokerIn Kafka cluster, the broker may change dynamically. In addition, in the Kafka mechanism, all other brokers can be queried through one broker, so thebootstrap.serversYou do not need to configure thehost:portIdeally, you only need to configure one of them. However, in order to improve the availability and avoid the failure of searching due to the broker hanging up, you can choose to configure multiple.

The configuration format is:

host1:port1,host2:port2,...

metadata.max.age.ms

Importance: high
Type: long
Default value: 300000 MS, 5 minutes

Metadata maximum lifetime, everymetadata.max.age.msThe producer client will force to refresh the metadatametadataEven if no partition leadership actively discovers new brokers or new partitions.

metadata

metadata class org.apache.kafka.clients#MetadataIn addition to recording some information related to its own update strategy (metadata’s update strategy is worthy of another article analysis). Some information about the Kafka cluster is also saved. Seeorg.apache.kafka.common#ClusterClass:

  • All nodes in the cluster are listed in the broker node. The IP, port and rack information of the node are recorded in the node node.

    Rack information: the rack information of broker, similar to Hadoop, can make better use of local primitives 
    Reduce the network overhead in the cluster. If rack information is specified( brooker.rack )Kafka is working as a deputy to the partition 
    This part of information will be considered in this allocation, and brokers of different racks will be selected as far as possible for the replica.
  • For each topicpartition in the cluster, the corresponding partition information is partitioninfo.org.apache.kafka.common#PartitionInfoThe following information is mainly recorded in

    • The topic to which the partition belongs.
    • Partition number.
    • The node where the leader of the partition is located.
    • List of partition replica nodes.
    • Partition replica synchronization node queue (ISR).
    • Offline replica node queue.
  • Control node information in the cluster.

    The control node broker is responsible for managing the state of partitions and replicas in the whole cluster. For example, if the leader copy of a partition fails, the controller is responsible for re electing a new leader for the partition When a change in ISR list is detected, the controller informs all brokers in the cluster to update their metadata cache information, or when a topic partition is added, the controller also manages the partition reallocation
  • The list of all partitions corresponding to each topic in the cluster is equivalent to taking the topic as the index.
  • The list of available partitions for each topic in the cluster.
  • The list of all partitions corresponding to each broker node in the cluster is equivalent to the following broker.id As an index.
  • ID of each node in the cluster( broker.id )Corresponding node information.

batch.size

Importance: high
Type: long
Default value: 16384 bytes, i.e. 16K

Message record batch size limit. When sending message records to the cluster, Kafka producer will try to compress a batch of message records to be sent to the same partition together, which is called batch. Each request does not send a record, but sends several batches, and each batch may contain multiple records. In this way, the network requests are reduced and the performance of producer client and Kafka cluster service is improved.

batch.sizeIt is used to set the maximum number of bytes byte of a batch. When set to 0, it means that the function of batch is completely disabled. Ifbatch.sizeIf the setting is too large, it may cause memory waste, because each record sent to a different partition needs to be allocated in advance batch.size Size of memory.

acks

Importance: high
Type: String
Default value: “1”

Answer number setting. Only when the producer receives a specified number of replies from the server’s acks, will the producer consider that the message record sent to the server has been delivered. This configuration item is used to control the persistence of sent message records. It has the following settings:

  • acks = 0: indicates that the producer does not need to wait for the response message from the server. The producer throws the record into the sending buffer and considers that the record has been sent, and then turns to walk away. In this case, the server can not guarantee that the message record is received successfullyretriesThe configuration item also fails to take effect because producer cannot know if it failed. In addition, each record returns theoffsetBoth are set to – 1.
  • acks = 1: indicates that the partition leader receiving the message record writes the message record to the local log file and returns the acknowledgement to inform the producer that the transmission has been completed without waiting for confirmation from other follower partitions. In this case, the message record may not be backed up (follower downtime, etc.).
  • acks = all: indicates that the message record can only reply to acknowledgement after being confirmed by the partition leader and the partition follower in the ISR of other partitions to inform the producer that the transmission has been completed. In this case, the message record will not be lost as long as one of the followers in ISR survives. This is the safest, but also the least efficient.
  • acks = -1: equivalent toacks = all

linger.ms

Importance: medium
Type: long
Default value: 0 ms, which means no delay, send immediately.

As mentioned above and in previous articles, producer will compress records sent to the same partition in batch when sending message records. But usually, this only happens when the arrival speed of records is faster than the sending speed of records. It is easy to understand: if the sending speed is greater than the arrival speed of records, every record will be sent immediately. There is no possibility of compressing multiple records into one.

But most of the time, even if the sending speed is greater than the arrival speed, we do not want to send each record once, or we want to send in batches to reduce the number of sending and improve the performance of producer client and server. For this, we need to add one artificiallySend delay limitIn other words, there is a certain time interval between each transmissionlinger.msDuring this period, multiple records may arrive. At this time, they can be compressed in groups and sent in batches. This is similar to the congestion control method of TCP.

be careful:

  • linger.msSet the maximum time limit of sending delay, another configuration itembatch.sizeIt also controls the timing of transmission. If the number of batch bytes compressed for a partition has reachedbatch.sizeSet the number of bytes, then the batch will be sent to the specified partition immediately, even if the delay time is not reachedlinger.msSettings for.
  • Similarly, if the delay time has been reachedlinger.msEven if the accumulated batch of compression is not reachedbatch.sizeThe number of bytes set will also be sent to the specified partition.
  • linger.msIt is for each request sent to the partition. That is, requests from different partitions are not sent at the same time.
  • Latency is considered to be a performance degradation, and a balance needs to be made between latency and performance to find a suitable onelinger.msValue.

client.id

Importance: medium
Type: String
Default value:

The producer client ID will be transferred to Kafka service when the request is created. Its purpose is to track and record the source of the request. Although the server can track the source of the request through IP / port, IP / port can not express business semantics, so it can be used to track the source of requestclient.idTo set a name with business logic semantics (such as PDK game), which is helpful for subsequent analysis and recording.

send.buffer.bytes

Importance: medium
Type: int
Default value: 131072 bytes, i.e. 128K.

TCP send buffer (so_ Sndbuf), ifsend.buffer.bytesSet to – 1, the default value of the operating system is used.

receive.buffer.bytes

Importance: medium
Type: int
Default value: 32768 bytes, that is, 32K.

TCP receive buffer (so_ Rcvbuf) size, whenreceive.buffer.bytesSet to – 1, the default size of the operating system is used.

max.request.size

Importance: medium
Type: String
Default value: 1048576 bytes, i.e. 1m.

The maximum number of bytes in a request request is used to limit the maximum number of record batches in a single request request sent by producer, so as to avoid too large a single request data.

max.request.size & batch.size

  • A request request may contain multiple record batches.
  • max.request.size It may affect the maximum size of record batch, that is, when batch.size Greater than max.request.size The upper limit of batch becomes max.request.size Set the size of.

reconnect.backoff.ms

Importance: low
Type: long
Default value: 50 ms.

To avoid the producer client reconnecting Kafka service broker too tightly and cyclically. This value is for all client to broker connections.

reconnect.backoff.max.ms

Importance: low
Type: long
Default: 1000 ms

The producer client failed to connect to a Kafka service (broker)Total timeEvery time the connection fails, the reconnection time will increase exponentially, and there will be 20% random jitter for each increase time to avoidConnecting storms

Connecting storms

When an application is started, the number of connections to each application server may soar abnormally. Suppose the number of connections is set as: min value 3, max value 10, and the number of normal business connections is about 5. When the application is restarted, the number of connections of each application may soar to 10, and even some applications may fail to report connections in a moment. After the start-up is completed, the connection begins to slowly return to the normal value of the business. It’s called the connection storm.

max.block.ms

Importance: low
Type: long
Default: 1000 ms

This configuration value controlsKafkaProducer.send()Function andKafkaProducer.partitionsFor()The maximum time that the function will block. In addition, both methods are blocked when the send buffer is full or metadata is unavailable. If blocking occurs in a user-defined serialization classserializersOr a custom partition classpartitionerThen these blocking times will not be calculated in the configuration value and so on.

Summary

It’s summarized aboveProducerConfigSome configuration items in the class are limited to a long space. The remaining configuration items will be introduced in another article later. In addition, in this article, I have a question:

What is the mechanism when producer sends messages to broker? From the introduction of the above configuration items,batch.sizemax.request.sizelinger.msThese configuration items will affect the sending time.

Record here first, and update it after understanding. If you can help me answer this question, you can help me answer it in the comments.