Detailed process of building yard (hadoop-2.2.0) environment

Time:2021-1-20

Sharpening a knife doesn’t miss a firewood cutter. Here’s a question:

What is the relationship between MapReduce and yarn?

A: yarn is not the next generation MapReduce (mrv2). The next generation MapReduce is exactly the same as the first generation MapReduce (mrv1) in programming interface and data processing engine (maptask and reducetask), It can be considered that mrv2 reuses these modules of mrv1. The difference is resource management and job management system. In mrv1, both resource management and job management are implemented by jobtracker, which integrates the two functions. In mrv2, the two parts are separated, Among them, the job management is implemented by applicationmaster, while the resource management is completed by the new system yard. Because yard is universal, yard can also be used as a resource management system for other computing frameworks, not only MapReduce, but also other computing frameworks, such as spark and storm. Generally speaking, we call the computing framework running on yard “x on” Such as “MapReduce on horn”, “spark on horn”, “storm on horn”, etc.

Hadoop 2.0 consists of three subsystems, namely HDFS, yarn and MapReduce. Yarn is a new resource management system, while MapReduce is only an application running on yarn. If yarn is regarded as a cloud operating system, MapReduce can be regarded as an app running on this operating system.

What’s the relationship between MapReduce and yarn? Today, we are going to formally build the environment.

Building environment preparation: specific reference《Building hadoop-0.20.2 environment》Step one to step six

System: ubuntu-12.04 (other versions are also available)

Mode: pseudo distributed

Build user: Hadoop

Hadoop-2.2.0 download address:http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.2.0/
Choose the installation package you need, here we choose hadoop-2.2.0 tar.gz
Attached Hadoop image link address:http://www.apache.org/dyn/closer.cgi/hadoop/common/

Statement 1: the directory where I configure hadoop-2.2.0 is / home / Hadoop
Statement 2: the yarn directory is created under / home / Hadoop. The hadoop-2.2.0 directory and Hadoop data directory are both under the yarn directory.
Statement 3: you can change / home / Hadoop to your own directory in the following build process.

Step 1: upload hadoop-2.2.0 tar.gz And extract it to / home / Hadoop / yarn directory. At this time, extract hadoop-2.2.0 directory from yarn directory

sudo chown -R hadoop:hadoop hadoop-2.2.0

Create Hadoop data directory:

mkdir -p /home/hadoop/yarn/yarn_data/hdfs/namenode
mkdir -p /home/hadoop/yarn/yarn_data/hdfs/datanode

Before setting up the configuration file, let’s give a general introduction to each folder in hadoop-2.2.0 directory, and pay attention to the differences between hadoop-2.2.0 and hadoop-1.

The outer startup script is in the SBIN directory

The called scripts of the inner layer are in the bin directory

Native so files are all in the Lib / native directory

The configuration program files are placed in libexec

The configuration files are all in the etc directory, corresponding to the conf directory of the previous version

All jar packages are in the share / Hadoop directory

Step 2: configure environment variables

I didn’t make the environment global, so I didn’t configure the system environment / etc / profile in hadoop-2.2.0
If configured, executesource /etc/profileTo make it effective.

Step 3: Core- site.xml hdfs- site.xml mapred- site.xml yarn- site.xml to configure

Next, we will configure it in the / home / Hadoop / yarn / hadoop-2.2.0/etc/hadoop directory.

core- site.xml to configure

<configuration >
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
        < description > specify the IP address and port number of namenode
    </property>
</configuration>

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        < description > number of backups
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/home/hadoop/yarn/yarn_data/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/home/hadoop/yarn/yarn_data/hdfs/datanode</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property> 
        <name>mapreduce.framework.name</name> 
        <value>yarn</value> 
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>localhost:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>localhost:19888</value>
    </property>
</configuration>        

yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
  <name>yarn.resourcemanager.address</name>
  <value>localhost:8032</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>localhost:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>localhost:8031</value>
</property>   
<property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>localhost:8033</value>
</property>   
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>localhost:8088</value>
</property>  
<property> 
<name>yarn.nodemanager.aux-services</name> 
<value>mapreduce_shuffle</value> 
</property>   
<property> 
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 
<value>org.apache.hadoop.mapred.ShuffleHandler</value> 
</property> 
</configuration>

Step 4: slave configuration

Because it is pseudo distributed, we only have localhost

Step 5: synchronize the configured hadoop-2.2.0 distribution to each data node

Because it’s pseudo distributed, skip this step.

Step 6: format namenode

Execute command:

bin/hdfs namenode –format

perhaps

bin/hadoop namenode –format

Step 7: start HDFS and yarn

Start HDFS:

sbin/start-dfs.sh

Start yarn:

sbin/start-yarn.sh

Or it can be executed

sbin/start-all.sh

Start HDFS and yarn together.

In addition, you need to start the history service, otherwise you cannot open the history link in the panel.

sbin/mr-jobhistory-daemon.sh start historyserver

Next, use the JPS command to view the startup process:

4504 ResourceManager
4066 DataNode
4761 NodeManager
5068 JobHistoryServer
4357 SecondaryNameNode
3833 NameNode
5127 Jps

Step 8: Test

HDFS test:

Create a file in HDFS: bin / Hadoop FS - MKDIR / wordcount
Upload file to HDFS: bin / Hadoop FS / home / Hadoop / file2.txt / wordcount
View the HDFS file directory: HDFS DFS – LS/

Yarn test: run wordcount test program,

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /wordcount /output2

Specific results:

bin/hadoop fs -cat /output2/*

The results are as follows

hadoop  1
hello   2
java    4
jsp 1

Here, the hadoop-2.2.0 environment is finished, and the configuration files are configured according to the specific requirements. There may be improper configuration. If you see it, please correct it.