MapReduce of Hadoop (3) shuffle mechanism and partition partition

Time:2021-2-27

1. Shuffle mechanism

Shuffle refers to the data processing after the map() method and before reduce().

It is to distribute the result data output by maptask to reducetask according to the partition rules, and partition and sort the data in the process of distribution.

2. Partition

In MapReduce calculation, sometimes the final output data needs to be divided into different files. In this case, partition is needed.

for instance:
There is such a set of data

class1_aaa 50
class2_bbb 100
class3_ccc 80
class1_ddd 10
class2_eee 100
class3_fff 70
class1_hhh 150
class2_lll 100
class3_www 80

To achieve such a requirement, group and sum the values of the data in it, and output the results to a file respectively, then partition partition is needed here.

Map code example

public class ClMap extends Mapper<LongWritable, Text,Text, IntWritable> {
    //K and V of output
    Text outk = new Text();
    IntWritable outv = new IntWritable();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] contents = line.split(" ");
        String outkey = contents[0].split("_")[0];

        outk.set(outkey);
        outv.set(Integer.parseInt(contents[1]));

        context.write(outk,outv);
    }
}

Partition code example

public class ClPartitioner extends Partitioner<Text, IntWritable> {
    @Override
    public int getPartition(Text text, IntWritable intWritable, int numPartitions) {
        String strStarts = text.toString().split("_")[0];

        if("class1".equals(strStarts)){
            return 0;
        }else if("class2".equals(strStarts)){
            return 1;
        }else{
            return 2;
        }
    }
}

Reduce code example

public class ClReduce extends Reducer<Text, IntWritable,Text,IntWritable> {
    IntWritable outv = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        outv.set(sum);
        context.write(key,outv);
    }
}

Driver code example

public class ClDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        job.setJarByClass(ClDriver.class);

        job.setMapperClass(ClMap.class);
        job.setReducerClass(ClReduce.class);

        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //Number of reducetask processes
        job.setNumReduceTasks(3);
        //Set the partition to use
        job.setPartitionerClass(ClPartitioner.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        boolean b = job.waitForCompletion(true);

        System.exit(b ? 0:1);

    }
}