1、 Centos7 Hadoop 3 3.1 installation (single machine distributed, pseudo distributed, distributed)





Official image download address:CentoshadoopJava
CentOS isCentOS-7-x86_64-DVD-2009
Hadoop ishadoop-3.3.1.tar.gz
Java asjdk-8u301-linux-x64.tar.gz
PS:Hadoop3. X The minimum version of Java is 1.8
Pure original installation, non CDH and HDP, or ambari


Modify host name

Modify the hostname in / etc / sysconfig / network

VI hostname in / etc / sysconfig / network

Or so

hostnamectl set-hostname Hadoop
#CTRL + D in SecureCRT

Modify the / etc / hosts file

vi /etc/hosts

vi /etc/hosts
#In addition to modifying localhost. In Local domain, which also adds the IP and corresponding names of other servers   localhost hadoop1 localhost4 localhost4.localdomain4
::1         localhost hadoop1 localhost6 localhost6.localdomain6 hadoop1 hadoop2 hadoop3

Restart the server

Turn off firewall

#Shut down
systemctl stop firewalld
#Prohibit startup and self start
systemctl disable firewalld

Create Hadoop user

#Create the user and use / bin / bash as the shell
$ useradd -m hadoop -s /bin/bash

#Set the password for Hadoop users. If you are prompted that the password is invalid, don't worry, and then enter it once
$ passwd hadoop

#Add execution permission to Hadoop
$ visudo
#98 line input: jump to 98 line and add a line Hadoop all = (all) all
$ root ALL=(ALL) ALL
$ hadoop ALL=(ALL) ALL

SSH installation password free login

Single machine password free login — configuring SSH password free login in Linux

Check whether SSH is installed

systemctl status sshd

If it is installed, the SSH service status (Acting) will be displayed. Otherwise, execute the following command to install

#- y means all agree. You don't have to press y every time
yum install openssh-clients -y
yum install openssh-server -y

Test available

#After confirming the connection according to the prompt, enter the current user password. If there is no password initially, you will be prompted to create a password
ssh localhost

Set password free login

#~It represents the user's home folder, that is, the directory "/ home / user name". If your user name is Hadoop, then ~ represents "/ home / Hadoop /"
cd ~/. SSH / # if there is no such directory, please execute SSH localhost once first
SSH keygen - t RSA # will prompt you. Just press enter
cat id_ rsa. pub >> authorized_ Keys # join authorization
chmod 600 ./ authorized_ Keys # modify file permissions

At this time, you can log in with SSH localhost command without entering a password

Configuring Java variables in Linux Environment

Find out whether there is a Java installation directory currently
If so, remove the built-in Java environment

[[email protected] ~]# java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[[email protected] ~]# which java
[[email protected] ~]# ls -lrt /usr/bin/java 
lrwxrwxrwx. 1 root root 22 Aug  6  2020 /usr/bin/java -> /etc/alternatives/java
[[email protected] ~]# ls -lrt /etc/alternatives/java
lrwxrwxrwx. 1 root root 73 Aug  6  2020 /etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-

Additional knowledge
The installation directory is / usr / lib / JVM
1. For JDK installed through rpm, we don’t see the configuration of environment variables in the profile file. When we call Java commands, they can be executed normally. They are installed strictly according to the path requirements of Linux for environment variables, and link the relevant commands to the / usr / bin directory. When we execute them anywhere, the system will execute them in / usr / bin.

/Usr / bin executable command installation directory
/The location where the library functions used by the usr / lib program are saved
/Usr / share / doc basic software user manual storage location
/Usr / share / man help file save location

Use Yum or up2date to install java-1.8.0-openjdkl.exe x86_ 64 by default, only JRE, JPS, jstat and other commands do not exist. At this time, you also need to install the development package. Find the suffix devel and execute Yum install java-1.8.0-openjdk-devel x86_ 64.
Add another uninstall Java

rpm -qa | grep jdk
#There will be one like this

#Perform uninstall
yum -y remove copy-jdk-configs-3.3-10.el7_5.noarch

If there is no Java environment, then

Configuring Java environment variables

One is to modify the / etc / profile file globally,
One is the current role. Modify ~ / bash_ Profile file.

Add the following contents to the file:

export JAVA_HOME=/usr/local/java/jdk1.8.0_301
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib

After saving and exiting, execute source / etc / profile (if it is the current role, refresh the corresponding bash_profile) to refresh the environment variables

Install hadoop-3.3.0 tar. gz

Upload to or download from the server
Find a directory to unzip, such as / opt

#- C is the specified decompression directory
tar -zxvf hadoop-3.3.0.tar.gz -C /opt

After decompression, check whether Hadoop is available

#Switch to the Hadoop installation directory for execution. If successful, the version information will be displayed
$ cd /opt/hadoop-3.3.0
$ ./bin/hadoop version

Here, we can also add Hadoop to the environment variable, so we don’t need to execute in the bin directory every time

#Edit profile file
vi /etc/profile
#Add Hadoop environment variable
export HADOOP_HOME=/opt/hadoop-3.3.0
#Refresh environment variables after saving
source /etc/profile


Single machine non distributed

It is mainly used for debugging.
Hadoop comes with a wealth of examples (you can see all the examples by running. / bin / Hadoop jar. / share / Hadoop / MapReduce / hadoop-mapreduce-examples-3.3.0.jar), including wordcount, teraport, join, grep, etc.
Here we are experimenting with grep

$ mkdir ./input
$ cp ./ etc/hadoop/*. xml ./ Input # takes the configuration file as the input file
$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep ./input ./output 'dfs[a-z.]+'
$ cat ./output/*

Pseudo distributed

1. Modify the configuration file

The Hadoop configuration file is located inInstallation directoryDownload / etc / Hadoop /

[[email protected] hadoop]# pwd

Two configuration files need to be modified for pseudo distributed core site XML and HDFS site xml

#Modify the configuration file Hadoop env sh

 # set to the root of your Java installation
export JAVA_HOME=/usr/local/java/jdk1.8.0_301

#Modify the configuration file core site xml


#Modify the configuration file HDFS site xml, 
#After setting up the cluster, Hadoop itself comes with a Web UI access page


2. Format namenode

./bin/hdfs namenode -format

3. Start nanenode and datanode processes


#After the startup is completed, you can judge whether the startup is successful by the command JPS
[[email protected] hadoop-3.3.0]$ jps
32081 NameNode
32561 Jps
32234 DataNode


If present
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Method I
Find the SBIN folder in the Hadoop installation directory
Modify four files in it

1. For start DFS SH and stop DFS SH file, add the following parameters:
#!/usr/bin/env bash

2. For start yarn SH and stop yard SH file, add the following parameters:

#!/usr/bin/env bash

Then restart
The second one is recommended (recommended, practical)

cd /opt/hadoop-3.3.0/etc/hadoop
vim /hadoop-env.sh

The log file is output in the logs folder under the installation directory.
You can access the web page, as previously configuredhttp://localhost:9870/

4. Operation cluster

  1. Create an input folder on the HDFS system
./bin/hdfs dfs -mkdir -p /user/test/input
  1. Upload the contents of the test file to the file system
./bin/hdfs dfs -put input/core-site.xml  /user/test/input/
  1. Check whether the uploaded file is correct
./bin/hdfs dfs -ls /user/test/input/

./bin/hdfs dfs -cat /user/test/input/core-site.xml
  1. Run MapReduce program
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount /user/test/input/core-site.xml /user/test/output/

./bin/hdfs dfs -cat /user/test/output/*

View in browser:在这里插入图片描述

5. Start yarn (pseudo distributed or not)

The above was adopted/ sbin/start-dfs. SH starting Hadoop is to start the MapReduce environment. We can start yarn and let yarn be responsible for resource management and task scheduling.

  1. Modify the configuration file mapred site xml:
#The amendments are as follows:

  1. Modify the configuration file yen site xml:

PS: there is no need to configure resourcemanagername hostname。
3. Start yarn
Before starting, you must ensure that namenode and datanode have been started:

#Start yarn


You can access the resource manager here-http://localhost:8088/
4. Start the history server
View task operation on the web-http://localhost:19888/

mapred --daemon start historyserver


#The historical process should be closed separately
mapred --daemon stop historyserver

6. Run test examples

$ bin/hdfs dfs -mkdir /user
  $ bin/hdfs dfs -mkdir /user/input
  #Use the configuration file as the input file
  $ bin/hdfs dfs -put etc/hadoop/*.xml /user/input   
  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount /user/input/core-site.xml /user/output/

  $ bin/hdfs dfs -cat /user/output/*

  #When Hadoop runs the program, the output directory cannot exist
  $ bin/hdfs dfs -rm -r /user/output

7. The configuration log is saved in HDFS

After the application runs, upload the program running log information to the HDFS system to easily view the program running details for development and debugging.

Modify the configuration file yen site xml


Note: to enable the log aggregation function, you need to restart nodemanager, ResourceManager and historymanager.
Re execute the wordcount above, and then you can find the corresponding log in jobhistory.


Prepare the three servers, follow the previous pre operation (note that SSH password free should be configured), and prepare for download and installation. (shut down the clone if VMware)
be careful!!! All three sets shall be configured with SSH security free
The SSH copy ID command is used here to copy the public key of the local host to the authorized public key of the remote host_ On the keys file, the SSH copy ID command will also give the user home directory (home) and ~ /. Of the remote host SSH, and ~ / ssh/authorized_ Keys set the appropriate permissions.

[[email protected] ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]
[[email protected] ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]
#In addition, the same is true

Linux manual — SSH copy ID


hadoop1 hadoop2 hadoop3
HDFS NameNode SecondaryNameNode
DataNode DataNode DataNode
YARN ResourceManager
NodeManager NodeManager NodeManager
####Cluster: modifying host names
[[email protected] ~] hostnamectl set-hostname hadoop1
[[email protected] ~] hostnamectl set-hostname hadoop2
[[email protected] ~] hostnamectl set-hostname hadoop3

If you are in SecureCRT, you can log in again with Ctrl + D to display the Hadoop host name

Modify profile


The location is in the Hadoop file path / etc / Hadoop
Locate line 52 and configure your own JDK path

  1. hadoop-env.sh
cd /opt/{hadoop}/etc/hadoop
#Specify JDK path
export JAVA_HOME=/usr/local/java/jdk1.8.0_301
  1. core-site.xml
  1. hdfs-site.xml
  1. yarn-site.xml
  1. mapred-site.xml
  1. workers Add master and slave nodes [add all] and delete localhost

In Hadoop 3 Before 2, workers were still slaves
The master node modifies [Hadoop 1] and transmits it to the other two nodes

Single point start

Format the namenode first and execute it on Hadoop 1

hdfs namenode -format

Note that try to format as few namenodes as possible

Start start. On the master node sh

sh /export/servers/hadoop-3.3.1/sbin/start-all.sh 
#Or enter the SBIN path
Start the service if such an error occurs

Starting namenodes on [hadoop1]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting secondary namenodes [hadoop3]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
Starting resourcemanager
ERROR: Attempting to operate on yarn resourcemanager as root
ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation.
Starting nodemanagers
ERROR: Attempting to operate on yarn nodemanager as root
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.

Solution I
vi /etc/profile
#Enter the following command to add the following configuration to the environment variable

take effect

source /etc/profile
Solution 2:

Set start DFS sh,stop-dfs. Sh (in SBIN of Hadoop installation directory) add the following parameters at the top of the two files


Set start yarn sh,stop-yarn. Sh (in SBIN of Hadoop installation directory) add the following parameters at the top of the two files


Start corresponding services as planned

# hadoop1
bin/hdfs --daemon start namenode
bin/hdfs --daemon start datanode

bin/yarn --daemon start nodemanager

If the datanode node is not started successfully or is automatically down after startup
1、 Clean up Hadoop data temp file
2、 Reset the initial namenode clusterid
View the log and get / var / data / Hadoop / DFs / data / current

# hadoop2
bin/hdfs --daemon start datanode

bin/yarn --daemon start resourcemanager
bin/yarn --daemon start nodemanager
# hadoop3
bin/hdfs --daemon start datanode
bin/hdfs --daemon start secondarynamenode

bin/yarn --daemon start nodemanager

Here, the mapping of 127.0.01 in / etc / hosts should be commented out. When it is not commented out, nodes cannot be connected during startup.

Cluster startup

Execute SBIN / start DFS directly on the namenode node sh

# hadoop1

#Hadoop 2 and Hadoop 3 start the datanode separately
bin/hdfs --daemon start datanode

# hadoop2

# hadoop1,hadoop3
bin/yarn --daemon start nodemanager

Shell regularly collects data to HDFS

Find the corresponding data / logs. If it does not exist, it is created in advance

Configure environment variables

Create upload2hdfs under this directory SH pin (jio) this document. Writing Java environment variables is mainly to improve the reliability of the system. It can run even without configuring environment variables

vi upload2HDFS.sh

Enter command

export JAVA_ Home = "your JDK installation path"
export JRE_HOME = ${JAVA_HOME}/jre
export CLASSPATH = .:{JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH = ${JAVA_HOME}/bin:$PATH

export HADOOP_ Home = "your Hadoop path"
export PATH = ${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH

Prepare log storage directory and files to be uploaded

##You need to create a file directory in advance
#Directory where logs are stored
log_ src_ Dir = "custom" # / export / data / logs / log
#Directory to be uploaded and stored
log_ toupload_ Dir = "custom / toupload" # / export / data / log / toupload

Set the path of log file upload

The naming format ends in time and prints information

#Set date
date1=`date-d last-day + %Y_%m_%d`
#Upload the log file to the root directory of HDFS
#Print environment variable information
echo "envs: hadoop_home: $AHDOOP_HOME"
#Read the directory of log files and judge whether there are files to upload
echo "log_src_dir:"$log_src_dir

Realize file upload

ls $log_src_dir | while read fileName
  if [[ "$fileName"==access.log.* ]];then
      date = `date + %Y%d_%H_%M_%S`
      #Move the file to the directory to be uploaded and rename it
      echo "moving $log_src_dir$fileName to
      mv $log_src_dir$fileName
      #Write the path of the file to be uploaded into a list file willdoing,
      echo $log_toupload_dir"xxxxx_click_log_$fileName"date >>

Finally, transfer the file from the directory to be uploaded to HDFS

#Locate the list file willdoing
ls $log_toupload_dir | grep will | grep -v "_COPY_" | grep -v "_HOME_" | while
read line
    #Print information
    echo "toupload is in file:"$line
    #Change the file list to be uploaded to willdoing_ COPY_
    mv $log_toupload_dir$line $log_toupload_dir$line"_COPY_"
    #Reading list file willdoing_ COPY_ Contents of (file names to be uploaded one by one)
    #The line here is the path of a file to be uploaded in the list
    cat $log_toupload_dir$line"_COPY_" | while read line
        #Print information
        echo "puting...$line to hdfs path...$hdfs_root_dir"
        hadoop fs -mkdir -p $hdfs_root_dir
        hadoop fs -put $line $hdfs_root_dir
    mv $log_toupload_dir$line"_COPY_" $log_toupload_dir$line"_DONE_"

It is now executed once a day at 12:00,Linux crontabExpressions perform scheduled tasks

0 0 * * * /shell/upload2HDFS.sh

The above crontab6 parameters are composed of time-sharing day, month and week commands

Summary description of startup command

sbin/start-all. SH starts all Hadoop daemons. Including namenode, secondary namenode, datanode, ResourceManager and nodemanager
sbin/stop-all. SH stops all Hadoop daemons. Including namenode, secondary namenode, datanode, ResourceManager and nodemanager

sbin/start-dfs. SH start Hadoop HDFS daemon namenode, secondarynamenode and datanode
sbin/stop-dfs. SH stop Hadoop HDFS daemon namenode, secondarynamenode and datanode

sbin/start-yarn. SH start ResourceManager and nodemanager
sbin/stop-yarn. SH stop ResourceManager and nodemanager

#Individual start / stop
bin/hdfs --daemon start namenode
bin/hdfs --daemon start datanode
bin/hdfs --daemon start secondarynamenode

bin/yarn --daemon start resourcemanager
bin/yarn --daemon start nodemanager

mapred --daemon start historyserver 
mapred --daemon stop historyserver  


#Old version commands
sbin/hadoop-daemons. SH start namenode starts the namenode daemon separately

sbin/hadoop-daemons. SH stop namenode stops the namenode daemon individually

sbin/hadoop-daemons. SH start datanode starts the datanode daemon separately

sbin/hadoop-daemons. SH stop datanode stops the datanode daemon separately

sbin/hadoop-daemons. SH start secondarynamenode starts the secondarynamenode daemon separately

sbin/hadoop-daemons. SH stop secondarynamenode individually stops the secondarynamenode daemon

sbin/yarn-daemon. SH start ResourceManager starts ResourceManager separately

sbin/yarn-daemons. SH start nodemanager start nodemanager separately

sbin/yarn-daemon. SH stop ResourceManager stops the ResourceManager individually

sbin/yarn-daemons. SH stopnodemanager stops nodemanager individually

sbin/mr-jobhistory-daemon. SH start historyserver start jobhistory manually

sbin/mr-jobhistory-daemon. SH stop historyserver stop jobhistory manually

Cloud server
Nat mode is recommended for virtual machines and configuration network cards

Hadoop 3 port number change

classification application Haddop 2.x port Haddop 3 port
NNPorts Namenode 8020 9820
NNPorts NN HTTP UI 50070 9870
NNPorts NN HTTPS UI 50470 9871
SNN ports SNN HTTP 50091 9869
SNN ports SNN HTTP UI 50090 9868
DN ports DN IPC 50020 9867
DN ports DN 50010 9866
DN ports DN HTTP UI 50075 9864
DN ports Namenode 50475 9865
##New features of Hadoop 3
  • Based on jdk1 8 (minimum version requirements)
  • Eliminate outdated APIs and implementations, and replace HFTP with webhfs
  • Classpath isolation: new to prevent conflicts between jar packages of different versions
  • Shell rewriting (fixed the bug of Hadoop 2 script, and the script commands at startup are also different. It is recommended to run Hadoop 3 script, which is about one-third different)
  • Support HDFS erasure encoding: the default EC policy can save 50% of storage space and withstand more storage failures (add recovery function on the basis of Hadoop 2)
  • A load balancer disk balancer is added inside the datanode, Load balancing between disks (assuming that the disks of three servers are full of data, and the data is stored in the datanode, you can buy one disk to insert, but the other disks are still full, and the new disk is empty, resulting in data skew. Therefore, Hadoop 3 provides a disk balancer to automatically distribute the full disks to other disks)
  • MapReduce task level local optimization
  • Automatic inference of MapReduce memory parameters
    • mapreduce. {map,reduce}. memory. MB and MapReduce {map,reduce}. java. Opts (these two items need to be configured in Hadoop 2, but in 3, the required memory will be automatically inferred according to the task execution level, so 3 is faster than 2)
    • CGroup based memory isolation and IO disk isolation
    • Supports changing the resource container resizing of the allocation container

reference resources:https://blog.csdn.net/qq_35975685/article/details/84311627

Recommended Today

Principle of optical motion capture system

Nokov measurement optical motion capture system is a motion capture system based on the principle of infrared optics. Compared with the positioning means such as inertial principle motion capture system and GPS positioning system, it has the characteristics of high precision, low delay, strong real-time performance and is mostly used in indoor scenes. The system […]