Alibaba cloud builds Hadoop cluster

Time:2020-9-24

1 Introduction

Before describing the construction of Hadoop cluster, we should first understand the two terms Hadoop and cluster. Hadoop is a distributed system infrastructure developed by the Apache foundation, and a simple understanding is the basis for the application of big data technology. Cluster can be understood as multiple servers with Hadoop. The purpose of building Hadoop cluster is to manage multiple servers and coordinate the work among them. This paper selects three alicloud servers. From the figure below, you can have a general understanding of the whole big data architecture.

Alibaba cloud builds Hadoop cluster

Hadoop is mainly composed of HDFS (distributed file storage system), yarn (cluster resource management and scheduling) and MapReduce (distributed computing framework). Hadoop cluster is divided into master and slave. This paper configures an alicloud server as the host and slave. The other two are slaves. HDFS (distributed file storage system) is called namenode node on host and datanode node on slave. Namenode maintains the file system tree of HDFS and metadata of all files and folders in the file tree. It can be understood as the information in folder properties in win system. Datanode is the place where data is stored and retrieved. It can be understood as the actual data in the folder of win system.

Alibaba cloud builds Hadoop cluster

Yarn (cluster resource management and scheduling) is called ResourceManager node on the host and nodemanager on the slave. ResourceManager is the global resource manager, responsible for the resource management and allocation of the whole system. Nodemanager is the resource and task manager on the node. Regularly report the resource usage of the node to the ResourceManager.

The distributed computing framework can be used to process a large number of tasks together. The calculation process can be understood by Baidu.

2 Hadoop cluster building

Alibaba cloud builds Hadoop cluster

2.1 server system settings

Online tutorials mostly use virtual machines to create multiple Linux systems to build Hadoop clusters. I think the virtual machine has disadvantages, so I use Alibaba cloud server. The purchase and configuration process of alicloud server will be introduced later.

  • Hosts file modification

hostsIt is a system file without extension. Its basic function is to establish an associated “database” between some commonly used web addresses and their corresponding IP addresses. When the user enters a web address to log in, the system will automatically start fromhostsFind the corresponding IP address in the file. Once found, the system will immediately open the corresponding web page. If it is not found, the system will submit the web address to the DNS domain name resolution server for IP address resolution. For example, when you visit the local, you enter 127.0.0.1 and localhost are the same. The modification of hosts file is to add the mapping of IP and domain name of three servers.

vi /etc/hosts

Add a map

172.27.66.8 master
172.27.66.10 slave1
172.27.66.9 slave2

The effect of subsequent access to domain name and IP is the same.

Modify the host names of the three servers according to their IP addresses

hostnamectl set-hostname master
hostnamectl set-hostname slave1
hostnamectl set-hostname slave2
  • SSH password free login

In order to avoid the need for password access between SSH servers. The three servers execute the following commands respectively. After executing this command, the key is generated under root /. SSH.

ssh-keygen -t rsa
#Under slave1 and slave2, set ID_ rsa.pub Send to the host and re command
scp id_rsa.pub [email protected]:~/.ssh/id_rsa.pub.slave1
scp id_rsa.pub [email protected]:~/.ssh/id_rsa.pub.slave2

Under root /.Ssh, the ID_ rsa.pub 、id_ rsa.pub.slave1 、id_ rsa.pub.slave2 Append to authorized_ In keys.

cat id_rsa.pub >> authorized_keys 
cat id_rsa.pub.slave1 >> authorized_keys 
cat id_rsa.pub.slave2 >> authorized_keys

Then the authorized_ Keys are passed back to slave1 and slave2

scp authorized_keys [email protected]:~/.ssh 
scp authorized_keys [email protected]:~/.ssh

Finally, modify the permissions file.

chmod 755 ~
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys  

2.2 Hadoop installation

Download address of each component software of Hadoop:

Tsinghua mirror:https://mirrors.tuna.tsinghua…

  • Modification of configuration files of each module

The version of Hadoop is 3.2.1, and the software package hadoop-3.2.1 tar.gz Unzip it to / usr / local,

tar -zxvf hadoop-3.2.1.tar.gz

After installation, enter / usr / local / hadoop-3.2.1/etc/hadoop, and modify the configuration file as follows:core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml、workers,

#Command to open a file
vi core-site.xml

core-site.xmlAdd the following configuration between < configuration > – – < / configuration >, and pay attention to the modification hadoop.tmp.dir Path (according to the path of your own system)

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop-3.2.1/tmp</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>
</configuration>

hdfs-site.xmL add the following configuration between < configuration > – – < / configuration >, and pay attention to the modification dfs.datanode.data . dir and dfs.namenode.name . dir path. dfs.replication Replica parameters. The same as the number of datanodes.

<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop-3.2.1/hdfs/namenode</value>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop-3.2.1/hdfs/datanode</value>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>

    <property>
      <name>dfs.permissions</name>
      <value>false</value>
      <description>need not permissions</description>
    </property>

    <property>
        <name>dfs.namenode.http-address</name>
        <value>master:50070</value>
    </property>
</configuration>

yarn-site.xmlAdd the following configuration between <configuration > — – < configuration >.

<configuration>
<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>

    <property>
        <description>The address of the applications manager interface in the RM.</description>
        <name>yarn.resourcemanager.address</name>
        <value>${yarn.resourcemanager.hostname}:8032</value>
    </property>

    <property>
        <description>The address of the scheduler interface.</description>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>${yarn.resourcemanager.hostname}:8030</value>
    </property>

   <property>
        <description>The http address of the RM web application.</description>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>${yarn.resourcemanager.hostname}:18088</value>
   </property>

   <property>
        <description>The https adddress of the RM web application.</description>
        <name>yarn.resourcemanager.webapp.https.address</name>
        <value>${yarn.resourcemanager.hostname}:18090</value>
   </property>

   <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>${yarn.resourcemanager.hostname}:8031</value>
   </property>

   <property>
        <description>The address of the RM admin interface.</description>
        <name>yarn.resourcemanager.admin.address</name>
        <value>${yarn.resourcemanager.hostname}:8033</value>
   </property>

   <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

mapred-site.xmlAdd the following configuration between < configuration > – – < / configuration >.

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
 
    <property>
      <name>mapred.job.tracker</name>
      <value>master:9001</value>
    </property>
</configuration>

workersAdd the following below.

slave1
slave2
  • Environment variable setting

hadoop-env.sh、yarn-env.shThe purpose of adding Java address and environment variables under the two files is to facilitate access.

export JAVA_HOME=/usr/local/jdk1.8.0_261

Alibaba cloud builds Hadoop cluster

2.3 jdk8 installation

The Java version is jdk-8u261-linux-x64 tar.gz , decompress to /usr/local

tar -zxvf jdk-8u261-linux-x64.tar.gz -C /usr/local

Setting environment variables

vi /etc/profile
#Add the following
JAVA_HOME=/usr/local/jdk1.8.0_261
CLASSPATH=$JAVA_HOME/lib/
PATH=$PATH:$JAVA_HOME/bin
export PATH JAVA_HOME CLASSPATH
#Immediate effect environment variable
source /etc/profile

2.4 Hadoop cluster test

#Send the Hadoop files configured under the host to the two slaves as a whole
scp -r /usr/local/hadoop-3.2.1 [email protected]:/usr/local
scp -r /usr/local/hadoop-3.2.1 [email protected]:/usr/local

#Add Hadoop environment variables
export HADOOP_HOME=/usr/local/hadoop-3.2.1
export PATH="$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

#Format namenode under the host
/usr/local/hadoop-3.1.1/bin/hdfs namenode -format
#Starting and shutting down HDFS
start-dfs.sh
stop-dfs.sh
#Start and shut down yarn
start-yarn.sh
stop-yarn.sh
#Start all
start-all.sh
stop-all.sh

After starting the cluster, enter JPS on the host to display

Alibaba cloud builds Hadoop cluster

Input JPS display on slave

Alibaba cloud builds Hadoop cluster

Enter HDFS dfsadmin – report to display

Alibaba cloud builds Hadoop cluster

summary

Hadoop cluster is built step by step, which is not very difficult.

No, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no

This article is released by WeChat official account of big data analyst knowledge sharing.