hadoop cluster setup

Time:2021-1-27

Setup for cluster

Add User

sudo addgroup hadoop
    sudo adduser --ingroup hadoop hadoop
    Sudo usermod - A - G sudo Hadoop (add Hadoop to sudoers)

Set env

    export JAVA_HOME=jdk_path (eg: /usr/lib/jvm/java-6-sun)
    export PATH=${JAVA_HOME}/bin:${JAVA_HOME}/jre/bin:${PATH}
    export HADOOP_HOME=hadoop_root (eg: export HADOOP_HOME=/usr/local/hadoop)
    export PATH=$PATH:$HADOOP_HOME/bin
    unalias fs &> /dev/null
    alias fs="hadoop fs"
    unalias hls &> /dev/null
    alias hls="fs -ls"

Hadoop config

  • hadoop-env.sh
    vi etc/hadoop/hadoop-env.sh
    change JAVA_HOME
    export JAVA_HOME=dk_path (eg: /usr/lib/jvm/java-6-sun)
  • yarn-env.sh
    vi etc/hadoop/yarn-env.sh
    change JAVA_HOME
    export JAVA_HOME=dk_path (eg: /usr/lib/jvm/java-6-sun)

Configuring all machines

  • configure all machines
    su hadoop
    ssh-keygen -t rsa -P ""
    cd ~/.ssh
    cat id_rsa.pub >> authorized_keys
  • modify hosts for all machines
    192.168.202.92  master(hostname)
    192.168.202.13  slave

Attention: 1. master/slave should be the hostname, because of the mapreduce
use the hostname; 2. remove other binders for master/slave.

    127.0.0.1    localhost
    #127.0.1.1  sh030  (attention: because this binder, the slave cannot connect to master by sh030:54310)
    192.168.202.92  sh030
    192.168.202.13  zxx-desktop
    192.168.0.62    jack-desktop
  • copy master id_rsa.pub to slave authorized_keys
    cat id_rsa.pub | ssh [email protected] "cat >> /home/hadoop/.ssh/authorized_keys"
  • configure master only
    cat etc/hadoop/masters
    master (hostname)
    cat etc/hadoop/slaves ( is used only by the scripts like bin/start-dfs.sh hdfs)
    master
    slave

Attention: the master / slave should be the same name within the hosts
file

etc/hadoop/*-site.xml for all machines

  • core-site.xml
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/data/hadoop</value>
      <description>A base for other temporary directories.</description>
    </property>
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://master:54310</value>
      <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
    </property>
  • hdfs-site.xml
    <property>
      <name>dfs.replication</name>
      <value>2</value>
      <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
      </description>
    </property>  

    <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>testHadoop-162:50090</value>
    </property>

    <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:///data/hdfs/name</value>
    </property>

    <property>
      <name>dfs.namenode.checkpoint.dir</name>
      <value>file:///data/hdfs/checkpoint</value>
    </property>

    <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:///data/hdfs/data</value>
    </property>

    <property>
      <name>dfs.webhdfs.enabled</name>
      <value>true</value>
    </property>

    <property>
      <name>dfs.support.append</name>
      <value>true</value>
    </property>

    <property>
      <name>dfs.support.broken.append</name>
      <value>true</value>
    </property>
  • yarn-site.xml
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>sh030:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>sh030:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>sh030:8031</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>sh030:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>sh030:8088</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
  • mapred-site.xml
    <configuration> 
        <property> 
        <name>mapreduce.framework.name</name> 
        <value>yarn</value> 
        </property> 
        <property> 
        <name>mapreduce.jobhistory.address</name> 
        <value>master-hadoop:10020</value> 
        </property> 
        <property> 
        <name>mapreduce.jobhistory.webapp.address</name> 
        <value>master-hadoop:19888</value> 
        </property> 
    </configuration>

Formatting the HDFS filesystem

    bin/hadoop namenode -format

Attention: If configure from single node to cluster, should delete all the
file in /data/hadoop firstly. Otherwise, the slave datanode cannot launch. rm
-fr /data/hadoop/
*

Launch HDFS

    ./sbin/start-dfs.sh
    will launch NameNode SecondaryNameNode DataNode(master also as dataNode)

MapReduce

    ./sbin/start-yarn.sh
    jps on Master
    29252 DataNode (Master also as slave)
    29940 NodeManager (Master also as slave)
    29051 NameNode
    29732 ResourceManager
    29515 SecondaryNameNode

    jps on Slave
    27858 DataNode
    28116 NodeManager

Stopping the cluster

  • MapReduce
    ./sbin/stop-yarn.sh 
  • HDFS
    ./sbin/stop-dfs.sh

Test

  • put data
    hadoop fs -mkdir /testdata
    hadoop fs -put -f ./*.txt /testdata
  • mapreduce
    hadoop jar ./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.3.0-sources.jar org.apache.hadoop.examples.WordCount /testdata /testdata-output