hadoop cluster setup


Setup for cluster

Add User

sudo addgroup hadoop
    sudo adduser --ingroup hadoop hadoop
    Sudo usermod - A - G sudo Hadoop (add Hadoop to sudoers)

Set env

    export JAVA_HOME=jdk_path (eg: /usr/lib/jvm/java-6-sun)
    export PATH=${JAVA_HOME}/bin:${JAVA_HOME}/jre/bin:${PATH}
    export HADOOP_HOME=hadoop_root (eg: export HADOOP_HOME=/usr/local/hadoop)
    export PATH=$PATH:$HADOOP_HOME/bin
    unalias fs &> /dev/null
    alias fs="hadoop fs"
    unalias hls &> /dev/null
    alias hls="fs -ls"

Hadoop config

  • hadoop-env.sh
    vi etc/hadoop/hadoop-env.sh
    change JAVA_HOME
    export JAVA_HOME=dk_path (eg: /usr/lib/jvm/java-6-sun)
  • yarn-env.sh
    vi etc/hadoop/yarn-env.sh
    change JAVA_HOME
    export JAVA_HOME=dk_path (eg: /usr/lib/jvm/java-6-sun)

Configuring all machines

  • configure all machines
    su hadoop
    ssh-keygen -t rsa -P ""
    cd ~/.ssh
    cat id_rsa.pub >> authorized_keys
  • modify hosts for all machines  master(hostname)  slave

Attention: 1. master/slave should be the hostname, because of the mapreduce
use the hostname; 2. remove other binders for master/slave.    localhost
    #  sh030  (attention: because this binder, the slave cannot connect to master by sh030:54310)  sh030  zxx-desktop    jack-desktop
  • copy master id_rsa.pub to slave authorized_keys
    cat id_rsa.pub | ssh [email protected] "cat >> /home/hadoop/.ssh/authorized_keys"
  • configure master only
    cat etc/hadoop/masters
    master (hostname)
    cat etc/hadoop/slaves ( is used only by the scripts like bin/start-dfs.sh hdfs)

Attention: the master / slave should be the same name within the hosts

etc/hadoop/*-site.xml for all machines

  • core-site.xml
      <description>A base for other temporary directories.</description>
      <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
  • hdfs-site.xml
      <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.







  • yarn-site.xml
  • mapred-site.xml

Formatting the HDFS filesystem

    bin/hadoop namenode -format

Attention: If configure from single node to cluster, should delete all the
file in /data/hadoop firstly. Otherwise, the slave datanode cannot launch. rm
-fr /data/hadoop/

Launch HDFS

    will launch NameNode SecondaryNameNode DataNode(master also as dataNode)


    jps on Master
    29252 DataNode (Master also as slave)
    29940 NodeManager (Master also as slave)
    29051 NameNode
    29732 ResourceManager
    29515 SecondaryNameNode

    jps on Slave
    27858 DataNode
    28116 NodeManager

Stopping the cluster

  • MapReduce
  • HDFS


  • put data
    hadoop fs -mkdir /testdata
    hadoop fs -put -f ./*.txt /testdata
  • mapreduce
    hadoop jar ./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.3.0-sources.jar org.apache.hadoop.examples.WordCount /testdata /testdata-output