Detailed introduction to Kafka

Time:2020-2-17

Kafka

Kafka core concept

What is Kafka

Kafka is an open source stream processing platform developed by the Apache Software Foundation, which is written by Scala and Java. The goal of this project is to provide a unified, high throughput and low latency platform for real-time data processing. Its persistence layer is essentially a “large-scale publish / subscribe message queue based on distributed transaction log architecture”, which makes it very valuable to process streaming data as an enterprise level infrastructure. In addition, Kafka can connect to external systems (for data input / output) through Kafka connect, and provides Kafka streams, a Java streaming library. This design is greatly influenced by transaction log.

Basic concepts

Kafka is a distributed data flow platform, which can run on a single server or be deployed on multiple servers to form a cluster. It provides publish and subscribe functions. Users can send data to Kafka or read data from Kafka (for subsequent processing). Kafka has the characteristics of high throughput, low latency and high fault tolerance. Here are some basic concepts commonly used in Kafka:

  • Broker

A common concept in message queuing, in Kafka, refers to the server node where Kafka instance is deployed.

  • Topic

Topics used to distinguish different types of information. For example, if application a subscribes to topic T1 and application B subscribes to topic T2 without subscribing to T1, the data sent to topic T1 can only be read by application a, but not by application B.

  • Partition

Each topic can have one or more partitions. Partition is on the physical level, different partitions correspond to different data files. Kafka uses partitions to support physical concurrent writes and reads, which greatly improves throughput.

  • Record

A record of messages that are actually written to Kafka and can be read. Each record contains key, value, and timestamp.

  • Producer

Producer, used to send data (record) to Kafka.

  • Consumer

Consumer, used to read the data in Kafka (record).

  • Consumer Group

A consumer group can contain one or more consumers. Using multi partition + multi consumer mode can greatly improve the processing speed of data downstream.

Interpretation of Kafka core terms

  • Topic: every message sent to the Kafka cluster can have a category, which is called topic. Different messages will be stored separately. If the topic is large, it can be distributed to multiple brokers. It can also be understood as follows: topic is considered as a queue, and each message must specify its topic. It can be said that we need to know which message to put in Queue. For traditional message queues, messages that have been consumed are usually deleted, while Kafka clusters retain all messages, whether they are consumed or not. Of course, due to disk limitations, it is impossible to keep all the data permanently (which is not really necessary), so Kafka provides two strategies to delete the old data. One is based on time, the other is based on the size of partition file.
  • Broker: a Kafka server can be called a broker. A cluster consists of multiple brokers. A broker can have multiple topics
  • Partition: in order to improve the throughput of Kafka linearly, topics are physically divided into one or more partitions, each of which is an orderly queue. And each partition physically corresponds to a folder, which stores all messages and index files of this partition.

Partition representation: topic name – partition ID each log file is a log entry sequence, each log entry contains a 4-byte integer value (value m + 5), a byte “magic value” and a 4-byte CRC check code, and then the m-byte message. The log entries are not composed of one file, but are divided into multiple segments, with the first segment for each segment The offset of the message is named and suffixed with “. Kafka”. In addition, there will be an index file, which indicates that each message in the offset range partition of the log entry contained in each segment has a unique 64 byte offset under the current partition. It indicates the starting position of this message. Kafka only guarantees that the data of one partition is sent to the consumer in sequence, but not the order of multiple partitions in the whole topic

  • Replicas: imagine that once a broker goes down, all the partition data on it cannot be consumed, so you need to back up the partition. After one of them goes down, other replicas must be able to continue serving and not cause data duplication or data loss.

If there is no leader, all replicas can read / write data at the same time, it is necessary to ensure that multiple replicas can synchronize data with each other (n × n paths). It is very difficult to ensure the consistency and order of data, which greatly increases the complexity of replication implementation and the probability of exceptions. When the leader is introduced, only the leader is responsible for data reading and writing, and the follower only feeds data (n paths) to the leader in order, making the system simpler and more efficient.
According to the replication factor N, there will be n replicas in each partition. For example, if there is a topic on broker1, the partition is topic-1, and the replication factor is 2, then there is a topic-1 in the data directory of two brokers, one of which is leader. A replica may have multiple replicas in the same partition. At this time, you need to select a leader, produc, between these replications Er and consumer only interact with this leader, and other replicas copy data from the leader as follower

  • Producer: producer publishes the message to the specified topic. At the same time, producer also needs to specify which partition the message belongs to
  • Consumer: in essence, Kafka only supports topic. Each consumer belongs to a consumer group. Each consumer group can contain multiple consumers. Messages sent to a topic will only be consumed by one consumer in each group that subscribes to the topic. If all consumers have the same group, this situation is very similar to the queue, and messages will be evenly distributed among consumers; if all consumers are in different groups, this situation is the broadcast mode, and messages will be sent to all groups subscribing to the topic, then all consumers will consume the message. Kafka’s design principle determines that for the same topic, the number of consumers in the same group cannot be more than the number of partitions, otherwise, consumers will not be able to get messages.
  • Offset: offset refers to partition and user group. It records the current position of a user group in a partition that has been consumed.

Kafka usage scenario

At present, the main use scenarios are as follows:

  • Message queuing (MQ)

In system architecture design, message queue (MQ) is often used. MQ is a cross process communication mechanism, which is used for upstream and downstream message delivery. With MQ, upstream and downstream can be decoupled. The upstream of message sending only needs to rely on MQ, and logically and physically do not need to rely on other downstream services. Common usage scenarios of MQ, such as traffic peak shaving, data-driven task dependency, etc. In the MQ field, in addition to Kafka, there are also traditional message queues such as ActiveMQ and rabbitmq.

  • Track website activity

Kafka is designed to track website activities (such as PV, UV, search records, etc.). Different activities can be put into different themes for subsequent real-time calculation, real-time monitoring and other programs, or data can be imported into the data warehouse for subsequent offline processing and report generation.

  • Metrics

Kafka is often used to transmit monitoring data. It is mainly used to aggregate the statistical data of distributed applications, and conduct unified analysis and display after data collection.

  • Log aggregation

Many people use Kafka as a solution for log aggregation. Log aggregation usually refers to collecting logs from different servers and putting them into a log center, such as a file server or a directory in HDFS, for subsequent analysis and processing. Kafka has better performance than other log aggregation tools such as flume and scribe.

Kafka cluster construction

Install kefka cluster

Since Kafka relies on the zookeeper environment, install zookeeper first, ZK install

Installation environment


linux: CentSO-7.5_x64
java: jdk1.8.0_191
zookeeper: zookeeper3.4.10
kafka: kafka_2.11-2.0.1
Download
$ wget http://mirrors.hust.edu.cn/apache/kafka/2.1.0/kafka_2.11-2.1.0.tgz

Decompression
$ tar -zxvf kafka_2.11-2.1.0.tgz

#Edit the configuration file to modify several configurations
$ vim $KAFKA_HOME/config/server.properties

#The broker.id of each server cannot be the same. It can only be a number
broker.id=1

#Change to your server's IP or hostname
advertised.listeners=PLAINTEXT://node-1:9092

#Set the connection port of zookeeper, and change the following IP to your IP name or host name
zookeeper.connect=node-1:2181,node-2:2181,node-3:2181

Start Kafka cluster and test

$ cd $KAFKA_HOME

#Start Kafka service at each node respectively (- daemon means running in the background)
$ bin/kafka-server-start.sh -daemon config/server.properties

#Create a Topic with the name test-topic. Partitions means the number of partitions is 3 --replication-factor means the number of replicas is 2
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 3 --topic test-topic

#View topic
$ bin/kafka-topics.sh --list --zookeeper localhost:2181

#View topic状态
$ bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test-topic

#View topic详细信息
$ bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test-topic

#Modify topic information
$ bin/kafka-topics.sh --alter --topic test-topic --zookeeper localhost:2181 --partitions 5

#Delete topic (simple deletion, just mark deletion)
$ bin/kafka-topics.sh --delete --topic test-topic --zookeeper localhost:2181

#Create a producer on a server
$ bin/kafka-console-producer.sh --broker-list node-1:9092,node-2:9092,node-3:9092 --topic test-topic

#Create a consumer on a server
$ bin/kafka-console-consumer.sh --bootstrap-server node-2:9092,node-3:9092,node-4:9092 --topic test-topic --from-beginning

#Now you can input any character in the producer's console to see the consumer side has a consumption message.

Java client connection Kafka

Normal Java form

  • pom.xml
<dependencies>

    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka_2.11</artifactId>
        <version>2.1.0</version>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>

    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>1.7.25</version>
    </dependency>

    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.25</version>
    </dependency>

</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.8.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
    </plugins>
</build>
  • Javakafkaconsumer.java consumer
import org.apache.kafka.clients.consumer.ConsumerRebalanceListener;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Collection;
import java.util.Collections;
import java.util.Properties;

/**
 * <p>
 *
 * @author leone
 * @since 2018-12-26
 **/
public class JavaKafkaConsumer {

    private static Logger logger = LoggerFactory.getLogger(JavaKafkaConsumer.class);

    private static Producer<String, String> producer;

    private final static String TOPIC = "kafka-test-topic";

    private static final String ZOOKEEPER_HOST = "node-2:2181,node-3:2181,node-4:2181";

    private static final String KAFKA_BROKER = "node-2:9092,node-3:9092,node-4:9092";

    private static Properties properties;

    static {
        properties = new Properties();
        properties.put("bootstrap.servers", KAFKA_BROKER);
        properties.put("group.id", "test");
        properties.put("enable.auto.commit", "true");
        properties.put("auto.commit.interval.ms", "1000");
        properties.put("key.deserializer", StringDeserializer.class.getName());
        properties.put("value.deserializer", StringDeserializer.class.getName());
    }

    public static void main(String[] args) {

        final KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);

        consumer.subscribe(Collections.singletonList(TOPIC), new ConsumerRebalanceListener() {

            public void onPartitionsRevoked(Collection<TopicPartition> collection) {

            }

            public void onPartitionsAssigned(Collection<TopicPartition> collection) {
                //Set offset to start
                consumer.seekToBeginning(collection);
            }
        });
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(100);
            for (ConsumerRecord<String, String> record : records) {
                logger.info("offset: {}, key: {}, value: {}", record.offset(), record.key(), record.value());
            }
        }
    }

}
  • Javakafkaproducer.java producer

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Properties;
import java.util.UUID;

/**
 * <p>
 *
 * @author leone
 * @since 2018-12-26
 **/
public class JavaKafkaProducer {

    private static Logger logger = LoggerFactory.getLogger(JavaKafkaProducer.class);

    private static Producer<String, String> producer;

    private final static String TOPIC = "kafka-test-topic";

    private static final String ZOOKEEPER_HOST = "node-2:2181,node-3:2181,node-4:2181";

    private static final String KAFKA_BROKER = "node-2:9092,node-3:9092,node-4:9092";

    private static Properties properties;

    static {
        properties = new Properties();
        properties.put("bootstrap.servers", KAFKA_BROKER);
        properties.put("acks", "all");
        properties.put("retries", 0);
        properties.put("batch.size", 16384);
        properties.put("linger.ms", 1);
        properties.put("buffer.memory", 33554432);
        properties.put("key.serializer", StringSerializer.class.getName());
        properties.put("value.serializer", StringSerializer.class.getName());
    }

    public static void main(String[] args) {

        Producer<String, String> producer = new KafkaProducer<>(properties);

        for (int i = 0; i < 200; i++) {
            try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            String uuid = UUID.randomUUID().toString();
            producer.send(new ProducerRecord<>(TOPIC, Integer.toString(i), uuid));
            logger.info("send message success key: {}, value: {}", i, uuid);
        }
        producer.close();
    }

}
  • KafkaClient.java
import kafka.admin.AdminUtils;
import kafka.admin.RackAwareMode;
import kafka.server.ConfigType;
import kafka.utils.ZkUtils;
import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.CreateTopicsResult;
import org.apache.kafka.clients.admin.NewTopic;
import org.apache.kafka.common.security.JaasUtils;
import org.junit.Test;

import java.util.*;

/**
 * <p>
 *
 * @author leone
 * @since 2018-12-26
 **/
public class KafkaClient {

    private final static String TOPIC = "kafka-test-topic";

    private static final String ZOOKEEPER_HOST = "node-2:2181,node-3:2181,node-4:2181";

    private static final String KAFKA_BROKER = "node-2:9092,node-3:9092,node-4:9092";

    private static Properties properties = new Properties();

    static {
        properties.put("bootstrap.servers", KAFKA_BROKER);
    }

    /**
     *Create topic
     */
    @Test
    public void createTopic() {
        AdminClient adminClient = AdminClient.create(properties);
        List<NewTopic> newTopics = Arrays.asList(new NewTopic(TOPIC, 1, (short) 1));
        CreateTopicsResult result = adminClient.createTopics(newTopics);
        try {
            result.all().get();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }


    /**
     *Create topic
     */
    @Test
    public void create() {
        ZkUtils zkUtils = ZkUtils.apply(ZOOKEEPER_HOST, 30000, 30000, JaasUtils.isZkSecurityEnabled());
        //Create 3 partitions and 2 copies of topic named T1
        AdminUtils.createTopic(zkUtils, "t1", 3, 2, new Properties(), RackAwareMode.Enforced$.MODULE$);
        zkUtils.close();
    }

    /**
     *Query topic
     */
    @Test
    public void listTopic() {
        ZkUtils zkUtils = ZkUtils.apply(ZOOKEEPER_HOST, 30000, 30000, JaasUtils.isZkSecurityEnabled());
        //Get all properties of topic
        Properties props = AdminUtils.fetchEntityConfig(zkUtils, ConfigType.Topic(), "streaming-topic");

        Iterator it = props.entrySet().iterator();
        while (it.hasNext()) {
            Map.Entry entry = (Map.Entry) it.next();
            System.err.println(entry.getKey() + " = " + entry.getValue());
        }
        zkUtils.close();
    }

    /**
     *Modify topic
     */
    @Test
    public void updateTopic() {
        ZkUtils zkUtils = ZkUtils.apply(ZOOKEEPER_HOST, 30000, 30000, JaasUtils.isZkSecurityEnabled());
        Properties props = AdminUtils.fetchEntityConfig(zkUtils, ConfigType.Topic(), "log-test");
        //Add topic level attribute
        props.put("min.cleanable.dirty.ratio", "0.4");
        //Delete topic level attribute
        props.remove("max.message.bytes");
        //Modify the properties of topic 'test'
        AdminUtils.changeTopicConfig(zkUtils, "log-test", props);
        zkUtils.close();

    }

    /**
     *Delete topic't1'
     */
    @Test
    public void deleteTopic() {
        ZkUtils zkUtils = ZkUtils.apply(ZOOKEEPER_HOST, 30000, 30000, JaasUtils.isZkSecurityEnabled());
        AdminUtils.deleteTopic(zkUtils, "t1");
        zkUtils.close();
    }


}
  • Log4j.properties log configuration
log4j.rootLogger=info, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%5p [%t] - %m%n

Integrating Kafka based on spingboot

  • pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <artifactId>spring-boot-kafka</artifactId>
    <groupId>com.andy</groupId>
    <version>1.0.7.RELEASE</version>
    
    <packaging>jar</packaging>
    <modelVersion>4.0.0</modelVersion>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>io.spring.platform</groupId>
                <artifactId>platform-bom</artifactId>
                <version>Cairo-SR5</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-amqp</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.kafka</groupId>
            <artifactId>spring-kafka</artifactId>
        </dependency>

    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.7.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <version>2.0.3.RELEASE</version>
                <configuration>
                    <!--<mainClass>${start-class}</mainClass>-->
                    <layout>ZIP</layout>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>repackage</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>
  • application.yml
spring:
  application:
    name: spring-jms
  kafka:
    bootstrap-servers: node-2:9092,node-3:9092,node-4:9092
    producer:
      retries:
      batch-size: 16384
      buffer-memory: 33554432
      compressionType: snappy
      acks: all
    consumer:
      group-id: 0
      auto-offset-reset: earliest
      enable-auto-commit: true
  • Message.java message

/**
 * <p>
 *
 * @author leone
 * @since 2018-12-26
 **/
@ToString
public class Message<T> {

    private Long id;

    private T message;

    private Date time;

    public Long getId() {
        return id;
    }

    public void setId(Long id) {
        this.id = id;
    }

    public T getMessage() {
        return message;
    }

    public void setMessage(T message) {
        this.message = message;
    }

    public Date getTime() {
        return time;
    }

    public void setTime(Date time) {
        this.time = time;
    }
}
  • Kafkacontroller.java controller
import com.andy.jms.kafka.service.KafkaSender;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

/**
 * <p> 
 *
 * @author leone
 * @since 2018-12-26
 **/
@Slf4j
@RestController
public class KafkaController {

    @Autowired
    private KafkaSender kafkaSender;

    @GetMapping("/kafka/{topic}")
    public String send(@PathVariable("topic") String topic, @RequestParam String message) {
        kafkaSender.send(topic, message);
        return "success";
    }

}
  • KafkaReceiver.java
import lombok.extern.slf4j.Slf4j;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.stereotype.Component;

import java.util.Optional;

/**
 * <p> 
 *
 * @author leone
 * @since 2018-12-26
 **/
@Slf4j
@Component
public class KafkaReceiver {


    @KafkaListener(topics = {"order"})
    public void listen(ConsumerRecord<?, ?> record) {
        Optional<?> kafkaMessage = Optional.ofNullable(record.value());
        if (kafkaMessage.isPresent()) {
            Object message = kafkaMessage.get();
            log.info("record:{}", record);
            log.info("message:{}", message);
        }
    }
}
  • KafkaSender.java
import com.andy.jms.kafka.commen.Message;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.stereotype.Component;

import java.util.Date;

/**
 * <p>
 *
 * @author leone
 * @since 2018-12-26
 **/
@Slf4j
@Component
public class KafkaSender {

    @Autowired
    private KafkaTemplate<String, String> kafkaTemplate;

    @Autowired
    private ObjectMapper objectMapper;

    /**
     *
     * @param topic
     * @param body
     */
    public void send(String topic, Object body) {
        Message<String> message = new Message<>();
        message.setId(System.currentTimeMillis());
        message.setMessage(body.toString());
        message.setTime(new Date());
        String content = null;
        try {
            content = objectMapper.writeValueAsString(message);
        } catch (JsonProcessingException e) {
            e.printStackTrace();
        }
        kafkaTemplate.send(topic, content);
        log.info("send {} to {} success!", message, topic);
    }
}
  • Startup class
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

/**
 * @author Leone
 * @since 2018-04-10
 **/
@SpringBootApplication
public class JmsApplication {
    public static void main(String[] args) {
        SpringApplication.run(JmsApplication.class, args);
    }
}

GitHub address