Introduction to Hadoop deployment

Time:2021-3-8

Hadoop overview and deployment

reference resources:http://hadoop.apache.org/docs…

1. Hadoop overview

  • What is Hadoop?

Hadoop is a distributed system infrastructure developed by Apache foundation, which is used to solve the problems of massive data storage and analysis.

  • Advantages of Hadoop

    • High reliability: the bottom layer of Hadoop data storage adopts multiple copies
    • High expansion: cluster deployment can easily expand nodes
    • Efficiency: Hadoop works in parallel to speed up task processing
    • High error tolerance: can automatically reallocate failed tasks

2. Hadoop composition

  • The 2. X and 3. X versions are as follows:

    • MapReduce: computing
    • Yarn: resource scheduling
    • HDFS: data storage
    • Common: auxiliary tool

Note: in version 1. X, yarn and mapredcuer are not responsible for computing and resource scheduling

3. Deployment planning

  • Three virtual machines
IP host name operating system to configure node
192.168.122.10-Hadoop10 CentOS 7.5 1 core / 4G memory / 50g hard disk NameNode、DataNode、NodeManager
192.168.122.11-Hadoop11 CentOS 7.5 1 core / 4G memory / 50g hard disk ResourceManager、DataNode、NodeManager
192.168.122.12-Hadoop12 CentOS 7.5 1 core / 4G memory / 50g hard disk SecondaryNameNode、DataNode、NodeManager

4. Cluster deployment

4.1 system update and SSH free configuration

  • Update and upgrade
yum install  -y epel-release
yum update
  • Configure SSH password free login
[[email protected] ~]$ ssh-keygen -t rsa
//... enter continuously to generate the private key ID_ RSA and ID_ rsa.pub
//My user is v2admin, and all subsequent operations are based on this user
[[email protected] ~]$ ssh-copy-id hadoop10
[[email protected] ~]$ ssh-copy-id hadoop11
[[email protected] ~]$ ssh-copy-id hadoop12
//Hadoop 11 and Hadoop 12 do the same
  • Upload JDK and Hadoop packages to the / home / v2admin directory of three virtual machines
//My own operating system is Ubuntu 18.04, which directly uses SCP to upload.
//If you use Windows system, you can install lrzsz or upload it to the virtual machine by FTP
scp jdk-8u212-linux-x64.tar.gz hadoop-3.1.3.tar.gz  [email protected]:/home/v2admin

scp jdk-8u212-linux-x64.tar.gz hadoop-3.1.3.tar.gz  [email protected]:/home/v2admin

scp jdk-8u212-linux-x64.tar.gz hadoop-3.1.3.tar.gz  [email protected]:/home/v2admin

4.2 installing JDK

[[email protected] ~]$tar zxvf jdk-8u212-linux-x64.tar.gz
[[email protected] ~]$sudo mv jdk1.8.0_212/ /usr/local/jdk8

4.2 installing Hadoop

[[email protected] ~]$sudo  tar zxvf hadoop-3.1.3.tar.gz -C /opt
[ [email protected]  ~]$ sudo chown -R v2 admin:v2admin /opt/hadoop-3.1.3   //Modify the user and group to the current user

4.3 configure JDK and Hadoop environment variables

[ [email protected]  ~]$sudo VIM / etc / profile // add last face
......
# set jdk hadoop env
export JAVA_HOME=/usr/local/jdk8
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib

export HADOOP_HOME=/opt/hadoop-3.1.3
export PATH=${PATH}:${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
....

[[email protected] ~]$source /etc/profile
[ [email protected]  ~]$Java - Version // verify JDK
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)

[ [email protected]  ~]$Hadoop version // verify Hadoop
Hadoop 3.1.3
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579
Compiled by ztang on 2019-09-12T02:47Z
Compiled with protoc 2.5.0
From source with checksum ec785077c385118ac91aadde5ec9799
This command was run using /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-common-3.1.3.jar

4.4 script for distributing files

Because the configuration files of the three virtual machines are the same, if there is no script, you need to configure them one by one, which is very cumbersome
Script file name xrsync.sh
Give the execution permission and put it in the bin directory, which can be called directly like other shell commands

#!/bin/bash
if [ $# -lt 1 ]
then
        echo Not Enough Arguement!
fi

for host in hadoop10 hadoop11 hadoop12
do
        echo ============= $host ============
        for file in [email protected]
        do
                if [ -e $file ]
                then
                        pdir=$(cd -P $(dirname $file);pwd)
                        fname=$(basename $file)
                        ssh $host "mkdir -p $pdir"
                        rsync -av $pdir/$fname $host:$pdir
                else
                        echo $file does not exists!
                fi
    done
done

4.5 cluster configuration

  • Java of Hadoop under 4.5.1 configuration_ HOME
[[email protected] ~]$ cd /opt/hadoop-3.1.3/etc/hadoop
[[email protected] ~]$ vim hadoop-env.sh
//Modify Java_ Home content
export JAVA_HOME=/usr/local/jdk8

[ [email protected]  ~]$xrsync hadoop- env.sh   //Update the configuration files of the other two hosts synchronously
  • 4.5.2 core configuration

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <! -- address of namenode -- >
    <property>

        <name>fs.defaultFS</name>

        <value>hdfs://hadoop10:9820</value>

    </property>

 <! -- storage directory of Hadoop data -- >
    <property>

        <name>hadoop.data.dir</name>

        <value>/opt/module/hadoop-3.1.3/data</value>

    </property>

    <property>

        <name>hadoop.proxyuser.v2admin.hosts</name>

        <value>*</value>

    </property>

    <property>

        <name>hadoop.proxyuser.v2admin.groups</name>

        <value>*</value>

    </property>

<! -- designated user -- >
<property>

        <name>hadoop.http.staticuser.user</name>

        <value>v2admin</value>

 </property>
</configuration>
  • 4.5.3 HDFS configuration

hdfs-site.xml

<configuration>
 <! -- storage directory of namenode data -- >
  <property>

    <name>dfs.namenode.name.dir</name>

    <value>file://${hadoop.data.dir}/name</value>

  </property>

<! -- datanode data storage directory -- >
  <property>

    <name>dfs.datanode.data.dir</name>

    <value>file://${hadoop.data.dir}/data</value>

  </property>

<! -- storage directory of 2n data -- >
    <property>

    <name>dfs.namenode.checkpoint.dir</name>

    <value>file://${hadoop.data.dir}/namesecondary</value>

  </property>

    <property>

    <name>dfs.client.datanode-restart.timeout</name>

    <value>30</value>

  </property>

<! -- NN's Web access address -- >

    <property>

    <name>dfs.namenode.http-address</name>
    <value>hadoop10:9870</value>

  </property>

</configuration>
  • 4.5.4 yarn configuration

yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
    <property>

        <name>yarn.nodemanager.aux-services</name>

        <value>mapreduce_shuffle</value>

    </property>

    <property>

        <name>yarn.resourcemanager.hostname</name>

        <value>hadoop11</value>

    </property>

    <property>

        <name>yarn.nodemanager.env-whitelist</name>

        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

    </property>


    <property>

        <name>yarn.log-aggregation-enable</name>

        <value>true</value>

    </property>

    <property>  

        <name>yarn.log.server.url</name>  

        <value>http://hadoop10:19888/jobhistory/logs</value>  

    </property>

    <property>

        <name>yarn.log-aggregation.retain-seconds</name>

        <value>604800</value>

</property>

</configuration>
  • 4.5.5 MapReduce configuration

mapred-site.xml

<configuration>

  <property>

    <name>mapreduce.framework.name</name>

    <value>yarn</value>

  </property>

<! -- historical server address -- >

<property>

    <name>mapreduce.jobhistory.address</name>

    <value>hadoop10:10020</value>

</property>



<! -- web address of history server -- >

<property>

    <name>mapreduce.jobhistory.webapp.address</name>

    <value>hadoop10:19888</value>

</property>

</configuration>
  • 4.5.6 use script to distribute good configuration files on cluster

4.6 script to start cluster

To start a cluster, you need to perform relevant startup operations on each server. In order to start the cluster and view the startup information, write a startup script startMyCluster.sh

#!/bin/bash
if [ $# -lt 1 ]
 then
  echo "Not enough arguments Input !!!"
 exit
fi

case $1 in
#Start up
"start")
    echo "==========start hdfs============="
    ssh hadoop10 /opt/module/hadoop-3.1.3/sbin/start-dfs.sh
    echo "==========start historyServer============"
    ssh hadoop10 /opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver
    echo "==========start yarn============"
    ssh hadoop11 /opt/module/hadoop-3.1.3/sbin/start-yarn.sh

;;
#Closing
"stop")
        echo "==========stop hdfs============="
        ssh hadoop10 /opt/module/hadoop-3.1.3/sbin/stop-dfs.sh
        echo "==========stop yarn============"
        ssh hadoop11 /opt/module/hadoop-3.1.3/sbin/stop-yarn.sh
    echo "==========stop historyserver===="
    ssh hadoop10 /opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver

;;
#View startup information
"jps")
    for i in hadoop10 hadoop11 hadoop12
    do
        echo "==============$i jps================"
        ssh $i /usr/local/jdk8/bin/jps
    done
;;
*)
 echo "Input Args Error!!!"
;;
esac

Also put it in the / bin directory for direct call

4.7 start cluster and view startup information

[ [email protected]  ~]$  startMyCluster.sh  Start // start
==========start hdfs=============
Starting namenodes on [hadoop10]
Starting datanodes
Starting secondary namenodes [hadoop12]
==========start historyServer============
==========start yarn============
Starting resourcemanager
Starting nodemanagers
[ [email protected]  ~]$  startMyCluster.sh  JPS // view startup information
==============hadoop10 jps================
1831 NameNode
2504 Jps
2265 JobHistoryServer
1980 DataNode
2382 NodeManager
==============hadoop11 jps================
1635 DataNode
1814 ResourceManager
2297 Jps
1949 NodeManager
==============hadoop12 jps================
1795 NodeManager
1590 DataNode
1927 Jps
1706 SecondaryNameNode

4.8 possible problems

After the installation and deployment is completed, there may be some problems when startingNoClassDefFoundError: javax/activation/DataSource
I didn’t use 2. X before, but I plan to install 3. X this time. The reason for this problem is that there is no jar package in yarn’s lib
resolvent:

cd /opt/hadoop-3.1.3/share/hadoop/yarn/lib
wget https://repo1.maven.org/maven2/javax/activation/activation/1.1.1/activation-1.1.1.jar