MapReduce in Hadoop (4) serialization and sorting

Time:2021-2-26

1. Hadoop serialization and custom implementation serialization

Serialization is to convert objects in memory into byte sequences, or other incoming protocols, and then transfer them over the network or persist them to disk.
Deserialization is the opposite process, which converts the received byte sequence or persistent data on disk into objects in memory.

Hadoop has its own serialization mechanism, such as Boolean writable, bytewritable, intwritable, floatwritable, longwritable, doublewriteable, text, mapwritable, arraywriteable, which are good serialization types for Hadoop implementation, and we can use them directly.

But sometimes, these basic serialization types can’t meet our needs, so we need to implement a serialization type ourselves.

How to realize it? Let’s take a look at the implementation of Hadoop, and then look at intwritable

public class IntWritable implements WritableComparable<IntWritable> 

It implements the writablecomparable interface. Our custom class can also implement this interface. Let’s first see what methods this interface has to implement

public interface WritableComparable<T> extends Writable, Comparable<T> {
}

It’s embarrassing. This interface has nothing. Let’s continue to look at its inherited interfaces. There are two.
The first writable interface

public interface Writable {
    //Serialization method
    void write(DataOutput var1) throws IOException;
    //Deserialization method
    void readFields(DataInput var1) throws IOException;
}

Two methods to implement, a serialization method and a deserialization method

Then we have a second interface

public interface Comparable<T> {
    public int compareTo(T o);
}

Why implement this interface?

In the shuffle process of MapReduce, the key must be sorted, so if the custom serialization class needs to be transferred in the key, the method in this interface must be implemented. Of course, if it is not needed, it can be omitted.

Example to implement a serializable stuinfobean

public class StuInfoBean implements WritableComparable<StuInfoBean> {
    private Integer stuid;
    private String name;
    private Integer age;
    private Boolean sex;

    public StuInfoBean(){}

    public Integer getStuid() {
        return stuid;
    }

    public void setStuid(Integer stuid) {
        this.stuid = stuid;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public Integer getAge() {
        return age;
    }

    public void setAge(Integer age) {
        this.age = age;
    }

    public Boolean getSex() {
        return sex;
    }

    public void setSex(Boolean sex) {
        this.sex = sex;
    }

    @Override
    public String toString() {
        return "StuInfoBean{" +
                "stuid=" + stuid +
                ", name='" + name + '\'' +
                ", age=" + age +
                ", sex=" + sex +
                '}';
    }

    /**
     *Implement if you need to, skip if you don't need to
     */
    @Override
    public int compareTo(StuInfoBean o) {
        return this.stuid.compareTo(o.getStuid());
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeInt(stuid);
        dataOutput.writeUTF(name);
        dataOutput.writeInt(age);
        dataOutput.writeBoolean(sex);
    }

    /**
     *Note here that the order of the reverse sequence should be consistent with that of the serialization
     */
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        stuid = dataInput.readInt();
        name = dataInput.readUTF();
        age = dataInput.readInt();
        sex = dataInput.readBoolean();
    }
}

2. Sorting

Maptask collects the kV pairs output by our map () method and puts them into the memory buffer. This is a ring data structure, which is actually a byte array, called kvbuffer.
It contains our data and index data kvmeta.
Although the size of kvbuffer can be set, it is limited in the end. When it is full, the data in the memory will be flushed to the disk. This process is overflow. There may be more than one file generated by overflow, and multiple overflow files will be merged into a large overflow file.
In this process, you need to call partitioner to partition and sort the key.
As I said before, let’s talk about sort.

Maptask and reducetask will sort the data according to the key, and the data in any application will be sorted in Hadoop regardless of whether the business logic needs it or not.

The default is to sort by dictionary, and the implementation method is quick sort.

The implementation of custom type sorting is the CompareTo method.

  @Override
    public int compareTo(StuInfoBean o) {
        return this.stuid.compareTo(o.getStuid());
    }

3. Combiner

This is also merging. Every map may produce a large number of local outputs. The function of combiner is to merge the outputs of the map side first, so as to reduce the data transmission between the map and the reduce node, and improve the network IO performance. It is one of the optimization means of MapReduce.
But there is a premise that the final business logic cannot be changed. The output kV of combiner should correspond to the input kV type of reducer.

Combiner is a component other than mapper and reducer in Mr program, and its parent class is reduce.

for instance

class1_aaa 50
class2_bbb 100
class3_ccc 80
class1_ddd 10
class2_eee 100
class3_fff 70
class1_hhh 150
class2_lll 100
class3_www 80

If you need to find the sum of each attribute value, you can customize the mycombiner to sum the local values first and then summarize them to reduce.

Mapper code

public class ClMap extends Mapper<LongWritable, Text,Text, IntWritable> {
    //K and V of output
    Text outk = new Text();
    IntWritable outv = new IntWritable();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] contents = line.split(" ");
        String outkey = contents[0].split("_")[0];

        outk.set(outkey);
        outv.set(Integer.parseInt(contents[1]));

        context.write(outk,outv);
    }
}

Mycombiner code

public class MyCombiner extends Reducer<Text, IntWritable,Text,IntWritable> {
    private IntWritable v= new IntWritable();
    
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum+=value.get();
        }
        v.set(sum);
        context.write(key,v);
    }
}

Reduce code

public class ClReduce extends Reducer<Text, IntWritable,Text,IntWritable> {
//    private Text outk = new Text();
    private IntWritable outv = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        for (IntWritable value : values) {
            outv.set(value.get());
        }
        context.write(key,outv);
    }
}

Driver code

public class ClDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        job.setJarByClass(ClDriver.class);

        job.setMapperClass(ClMap.class);
        job.setReducerClass(ClReduce.class);

        job.setMapOutputValueClass(IntWritable.class);
        
        //Set to use mycombiner
        job.setCombinerClass(MyCombiner.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean b = job.waitForCompletion(true);

        System.exit(b ? 0:1);

    }
}