Hadoop-10-kafka of big data

Time:2021-9-27

1、 Kafka overview

1. Definition

Kafka is a distributed system based on publish / subscribe modeMessage queue, mainly used inBig data real-time processing field

2. Infrastructure

Hadoop-10-kafka of big data

(1) Producer: Message producer, which is the client that sends messages to Kafka broker;
(2) Consumer: the message consumer, the client that fetches messages from Kafka broker;
(3) Consumer group (CG): a consumer group consisting of multiple consumers. Each consumer in the consumer group is responsible for consuming data from different partitions, and one partition can only be consumed by one consumer; Consumer groups do not affect each other. All consumers belong to a certain consumer group, that is, the consumer group is a logical subscriber.
(4) Broker: a Kafka server is a broker. A cluster consists of multiple brokers. A broker can accommodate multiple topics.
(5) Topic: it can be understood as a queue,Both producers and consumers face the same topic
(6) Partition: in order to achieve scalability, a very large topic can be distributed to multiple brokers (i.e. servers),A topic can be divided into multiple partitionsEach partition is an ordered queue
(7) Replica: replica. In order to ensure that the partition data on a node in the cluster will not be lost when it fails, and Kafka can still continue to work, Kafka provides a replica mechanism. Each partition of a topic has several replicas, one leader and several followers.
(8) Leader: the “master” of multiple copies in each partition, the object of data sent by the producer and the object of consumer consumption data are all leaders.
(9) Follower: the “slave” in multiple replicas of each partition, which synchronizes data from the leader in real time and maintains synchronization with the leader data. When a leader fails, a follower will become a new leader.

For the above concept, we can understand it as follows:Different topics are like different highways, zones are like lanes on a highway, and messages are vehicles running on the lanes. If the traffic flow is large, the lane will be widened; otherwise, the lane will be reduced. Consumers are like toll stations on highways. The more toll stations open, the faster the vehicles pass through.

As for the consumer group, it is stipulated that multiple consumers in the same consumer group are not allowed to consume messages in the same partition, while different consumer groups can consume messages in the same partition at the same time. In other words, the corresponding relationship between the partition and consumers in the same consumer group is many to one rather than one to many.

Hadoop-10-kafka of big data

2、 Cluster installation Kafka

1. Download and install

Kafka relies on zookeeper cluster. Before building Kafka cluster, we need to build zookeeper cluster. We have built zookeeper cluster before.

Correspondence between Kafka and zookeeper versions:
Hadoop-10-kafka of big data

Download Kafka from Apache official website (address:http://kafka.apache.org/downl…)We download the stable version according to the zookeeper versionkafka_2.12-2.5.0.tgz(since Kafka is written in scala and Java, 2.12 refers to scala version number and 2.5.0 refers to Kafka version number)

Hadoop-10-kafka of big data

In the centos01 node, switch to the directory/opt/softwares/And enter the directory, download it first, and then extract it to the directory /opt/modules/

$ cd /opt/softwares/
$ wget https://archive.apache.org/dist/kafka/2.5.0/kafka_2.12-2.5.0.tgz
$ tar -zxvf kafka_2.12-2.5.0.tgz -C /opt/modules/

2. Write configuration file

Switch to the installation directory and the name of the installation directorykafka_2.12-2.5.0

cd /opt/modules/kafka_2.12-2.5.0

At / opt / modules / Kafka_ Create logs folder under 2.12-2.5.0 directory

[[email protected] kafka_2.12-2.5.0]# mkdir logs
[[email protected] kafka_2.12-2.5.0]# ls -l
Total consumption 56
Drwxr-xr-x. 3 root 4096 April 8 2020 bin
Drwxr-xr-x. 2 root 4096 April 8 2020 config
Drwxr-xr-x. 2 root 8192 July 29 23:17 LIBS
-Rw-r -- R --. 1 root 32216 April 8 2020 license
Drwxr-xr-x. 2 root 6 July 29 23:49 logs
-Rw-r -- R --. 1 root 337 April 8 2020 note
Drwxr-xr-x. 2 root 44 April 8 2020 site docs
[[email protected] kafka_2.12-2.5.0]#

Modify profile/config/server.properties

Modified content:

#The globally unique number of the broker, which cannot be repeated
broker.id=1
#The number of partitions of topic in the current broker is 1 by default. You can increase the number of partitions, but you cannot reduce the number of partitions
num.partitions=2
#Socket listening address is used by the broker to listen to producer and consumer requests. If this parameter is not configured, the host name is obtained through the Java API by default
listeners=PLAINTEXT://centos01:9092
#Kafka running log storage path
log.dirs=/opt/modules/kafka_2.12-2.5.0/logs
#Configure zookeeper cluster address
zookeeper.connect=centos01:2181,centos02:2181,centos03:2181

Modified profile:

[[email protected] config]# cat server.properties 
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from 
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://centos01:9092

# Hostname and port the broker will advertise to producers and consumers. If not set, 
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

# A comma separated list of directories under which to store log files
log.dirs=/opt/modules/kafka_2.12-2.5.0/logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=2

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=centos01:2181,centos02:2181,centos03:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=18000


############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0

3. Send installation information to other nodes

After installing centos01 node, you need to copy the entire Kafka installation directory to centos02 and centos03 nodes. The commands are as follows:

#The password of centos02 host user Hadoop is [email protected]
$ scp -r /opt/modules/kafka_2.12-2.5.0  [email protected]:/opt/modules/
$ scp -r /opt/modules/kafka_2.12-2.5.0  [email protected]:/opt/modules/

4. Modify other node configurations

cd /opt/modules/kafka_2.12-2.5.0/config

vi server.properties, centos02 configuration file is modified to:

#The globally unique number of the broker, which cannot be repeated,
broker.id=2
#Socket listening address is used by the broker to listen to producer and consumer requests. If this parameter is not configured, the host name is obtained through the Java API by default
listeners=PLAINTEXT://centos02:9092

The CentOS 03 configuration file is the same as above.

5. Start zookeeper cluster

Execute the following commands on the three nodes respectively to start the zookeeper cluster (you need to enter the zookeeper installation directory)

cd /opt/modules/zookeeper-3.5.9/bin
[[email protected] bin]# ./zkServer.sh start

6. Start Kafka cluster

Execute the following commands on the three nodes to start the Kafka cluster (you need to enter the Kafka installation directory)

cd /opt/modules/kafka_2.12-2.5.0
bin/kafka-server-start.sh -daemon config/server.properties

close

cd /opt/modules/kafka_2.12-2.5.0/bin
./kafka-server-stop.sh

After the cluster is started, execute the JPS command on each node to view the started Java processes

[[email protected] kafka_2.12-2.5.0]# bin/kafka-server-start.sh -daemon config/server.properties
[[email protected] kafka_2.12-2.5.0]# jps
7356 QuorumPeerMain
8142 Jps
8111 Kafka
[[email protected] kafka_2.12-2.5.0]# 

You can see that Kafka has started successfully^_^

7. Kafka group script (need to be modified)

for i in `cat /opt/module/hadoop-2.7.2/etc/hadoop/slaves`
do
echo "========== $i ==========" 
ssh $i '/opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties'
echo $?
done