Hadoop’s “Hello world” — wordcount

Time:2020-11-25

After installing and configuring the Hadoop environment, you need to run an instance to verify whether the configuration is correct. Hadoop provides a simple wordcount program, which is actually a program for counting the number of words. This program can be regarded as “Hello world” in Hadoop.

MapReduce

principle

In fact, MapReduce uses the idea of divide and conquer, divides large-scale data into various nodes to complete together, and then integrates the results of each node to get the final result. These sub nodes can process data in parallel, greatly reducing the complexity of the work.

process

MapReduce can be divided into two stages. In fact, the word is divided into map and reduce, which are actually two functions. The map function generates an intermediate output, and then the reduce function takes a series of intermediate outputs produced by multiple map functions, and then generates a final output.

Wordcount display

Preliminary work

Start Hadoop

cd /usr/hadoop/hadoop-2.6.2/
sbin/start-dfs.sh
sbin/start-yarn.sh

Create local data file

cd ~/
mkdir ~/file
cd file
echo "Hello World" > test1.txt
echo "Hello Hadoop" > test2.txt

This creates two txt files with a string: Hello world and hello Hadoop. The result we want from wordcount is: Hello 2, world 1, Hadoop 1.

Create input folder on HDFS

hadoop fs -mkdir /input

Well created, we can use the

hadoop fs -ls /

To see the results:

Hadoop's

Transfer the data file to the input directory

hadoop fs -put ~/file/test*.txt /input

Again, we can go through

hadoop fs -ls /input 

To see if the upload is successful:

Hadoop's

If you can’t see any results, it indicates that there is a problem in the Hadoop configuration, or the firewall is not closed, resulting in the node connection failure.

Run the program

Run wordcount

Hadoop jar / your Hadoop root / share / Hadoop / MapReduce / hadoop-mapreduce-examples-2.6.2.jar wordcount / input / output

After running this command, Hadoop will start a JVM to run the MapReduce program, and create an output folder on the cluster to store the results in it.

Let’s look at the results:

Hadoop's

Note:

  • This directory must be filled in correctly, otherwise it will be reported that jar does not exist.

  • The output folder must be empty.

View results

There are now two files in the output folder. The results we need are in thepart-r-00000In this folder.

hadoop fs -cat /output/part-r-00000

We can see the final wordcount result:

Hadoop's

Wordcount source code analysis

Map process

Source code:

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    IntWritable one = new IntWritable(1);
    Text word = new Text();
    public void map(Object key, Text value, Context context) throws IOException,InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while(itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

Inherit mapper class and override map method.

We know that data in MapReduce is passed through < key, value >. We can see what the value and key values are like through the console. Add the following code to the map method:

System.out.println ("key= "+ key.toString ()); // view the key value
System.out.println ("value= "+ value.toString ()); // view value value

After running the program, the console output is as follows:

Hadoop's

We can see that the value value in the map method stores a line in the text file, and the key value is the offset of the first character of the line relative to the first address of the text file.

In the programStringTokenizerThe function of this class is to split each line into words and output < word, 1 > as the result of the map method.

Reduce process

Source code:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    IntWritable result = new IntWritable();
    public void reduce(Text    key, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException {
        int sum = 0;
        for(IntWritable val:values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key,result);
    }
}

Similarly, the reduce procedure also needs to inherit a reducer class and override the reduce method.

We can see that the input parameter of reduce isText keyandIterable<Intrable>。 We know that the input parameter key of the reduce method is a word, and values is a list composed of the count values of the corresponding words on each mapper. We can see that values implements an Iterable interface, which can be understood as values contains multiple intwritable integers, which is actually count values.

Then we just need to traverse the values and sum, we can get the total number of times each word.

Execute MapReduce

We have written the map function and reduce function, and now we are going to execute MapReduce.

Source code:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if(otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "wordcount");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true)?0:1);
    } 
}

In codejob.set*()The method is to set related parameters for the task and then call it.job.waitForCompletion()Method to perform the task.

Link to the original text: http://axuebin.com/blog/2016/02/14/hadoop-wordcount/