Hadoop framework: building distributed environment in cluster mode

Time:2021-9-30

Source code of this article:GitHub · click here || Gitee · point here

1、 Basic environment configuration

1. Three desk service

Prepare three centos7 services and clone the basic environment from the pseudo distributed environment.

133 hop01,134 hop02,136 hop03

2. Set host name

##Set name
hostnamectl set-hostname hop01
##Restart
reboot -f

3. Host name communication

vim /etc/hosts
#Add service node
192.168.37.133 hop01
192.168.37.134 hop02
192.168.37.136 hop03

4. SSH password free login

Configure three service SSH password free logins.

[[email protected] ~]# ssh-keygen -t rsa
... all the way back
[[email protected] ~]# cd .ssh
... assign permissions to the specified cluster service
[[email protected] .ssh]# ssh-copy-id hop01
[[email protected] .ssh]# ssh-copy-id hop02
[[email protected] .ssh]# ssh-copy-id hop03
... log in to hop02 without secret in hop01
[[email protected] ~]# ssh hop02

For the hop01 service, this operation should be performed in both hop02 and hop03 services.

5. Synchronization time

NTP component installation

#Installation
yum install ntpdate ntp -y
#Check
rpm -qa|grep ntp

Basic management commands

#View status
service ntpd status
#Start
service ntpd start
#Power on
chkconfig ntpd on

Modify time service hop01

#Modify NTP configuration
vim /etc/ntp.conf
#Add content
restrict 192.168.0.0 mask 255.255.255.0 nomodify notrap
server 127.0.0.1
fudge 127.0.0.1 stratum 10

Modify the time mechanism of hop02hop03, synchronize the time from hop01, and log off the network to obtain the time.

server 192.168.37.133
# server 0.centos.pool.ntp.org iburst
# server 1.centos.pool.ntp.org iburst
# server 2.centos.pool.ntp.org iburst
# server 3.centos.pool.ntp.org iburst

Write scheduled tasks

[[email protected] ~]# crontab -e
*/10 * * * * /usr/sbin/ntpdate hop01

Modify the service time of hop02 and hop03

#Specified time
date -s "2018-05-20 13:14:55"
#View time
date

In this way, the time will be continuously corrected or synchronized based on the time of hop01 service.

6. Environmental cleaning

Clone three centos7 services from the virtual machine in the pseudo distributed environment, and delete the data and log folders configured in the original Hadoop environment.

[[email protected] hadoop2.7]# rm -rf data/ logs/

2、 Cluster environment construction

1. Cluster configuration overview

Service list HDFS file Yarn dispatch Single service
hop01 DataNode NodeManager NameNode
hop02 DataNode NodeManager ResourceManager
hop03 DataNode NodeManager SecondaryNameNode

2. Modify configuration

vim core-site.xml

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://hop01:9000</value>
</property>

The three services here need to specify the current host name respectively.

vim hdfs-site.xml

<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>

<property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>hop03:50090</value>
</property>

Here, modify the number of replicas to 3 and specify the secondarynamenode service. The three services also modify and specify the secondarynamenode on the hop03 service.

vim yarn-site.xml

<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hop02</value>
</property>

Specify that the ResourceManager service is on hop02.

vim mapred-site.xml

<!--  Server side address -- >
<property>
<name>mapreduce.jobhistory.address</name>
<value>hop01:10020</value>
</property>

<!--  Server web address -- >
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hop01:19888</value>
</property>

Specify the viewing address of the relevant web side on the service hop01.

3. Cluster service configuration

route:/opt/hadoop2.7/etc/hadoop

File:vim slaves

hop01
hop02
hop03

The cluster list of three services is configured here. Synchronously modify the same configuration of other services.

4. Format namenode

Note that the namenode is configured on the hop01 service.

[[email protected] hadoop2.7]# bin/hdfs namenode -format

5. Start HDFS

[[email protected] hadoop2.7]# sbin/start-dfs.sh
Starting namenodes on [hop01]
hop01: starting namenode
hop03: starting datanode
hop02: starting datanode
hop01: starting datanode
Starting secondary namenodes [hop03]
hop03: starting secondarynamenode

Note that the printed information here is completely consistent with the configuration. Namenodes is started on hop01 and secondary namenodes is started on hop03. You can view and verify each service through the JPS command.

6. Start yarn

Note that Yan is configured on the hop02 service, so execute the start command on the hop02 service.

[[email protected] hadoop2.7]# sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager
hop03: starting nodemanager
hop01: starting nodemanager
hop02: starting nodemanager,

Note the start print log here. So far, the cluster planning services have been started.

[[email protected] hadoop2.7]# jps
4306 NodeManager
4043 DataNode
3949 NameNode
[[email protected] hadoop2.7]# jps
3733 ResourceManager
3829 NodeManager
3613 DataNode
[[email protected] hadoop2.7]# jps
3748 DataNode
3928 NodeManager
3803 SecondaryNameNode

Check the cluster process under each service, which is consistent with the planning configuration.

7. Web side interface

NameNode:http://hop01:50070
SecondaryNameNode:http://hop03:50090

3、 Source code address

GitHub · address
https://github.com/cicadasmile/big-data-parent
Gitee · address
https://gitee.com/cicadasmile/big-data-parent

Recommended reading: programming system sorting

entry name
[Java describes design patterns, algorithms, and data structures]GitHub==GitEE
[Java foundation, concurrency, object-oriented, web development]GitHub==GitEE
[detailed explanation of spring cloud microservice basic component case]GitHub==GitEE
[actual combat comprehensive case of springcloud microservice Architecture]GitHub==GitEE
[introduction to basic application of springboot framework to advanced]GitHub==GitEE
[common middleware for integrated development of springboot framework]GitHub==GitEE
[basic case of data management, distributed and architecture design]GitHub==GitEE
[big data series, storage, components, computing and other frameworks]GitHub==GitEE