Hadoop series (4) MapReduce passing custom beans and sorting the results, analyzing the concurrent number of mapper and reducer

Time:2020-2-6

Case requirements: realize the statistics and grouping of uplink and downlink traffic of mobile phone number
The test data is as follows:

Hadoop series (4) MapReduce passing custom beans and sorting the results, analyzing the concurrent number of mapper and reducer

Analysis:
In general, the input and output types of mapper and reducer can be longwritable, text, etc. if we want to pass custom beans, we need to conform to the Hadoop serialization specification. Check the source code of longwritable to see its implementationWritableComparable<LongWritable>Interface:

/** A WritableComparable for longs. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class LongWritable implements WritableComparable<LongWritable> {...}

Similarly, to be passed by Hadoop’s MapReduce framework, our custom beans need to implement the same interface. In fact, writablecompatible interface is a combination of writable and compatible interfaces, which enables beans to be serialized and compared respectively:

@InterfaceAudience.Public
@InterfaceStability.Stable
public interface WritableComparable<T> extends Writable, Comparable<T> {
}

Different from the default serialization method of JDK, Hadoop excludes the inheritance structure of beans and the serialization of implementation interfaces, and only keeps the internal fields of beans, which saves the bandwidth of network transmission.

Next, we implement a bean that conforms to the Hadoop serialization specification.

FlowBean:
(the setter and getter methods of corresponding fields are omitted)

public class FlowBean implements Writable {
    
    private String phone;
    private long upStream;
    private long downStream;
    private long sumStream;
    
    /**
     *When deserializing, the reflection mechanism needs to call the null parameter constructor
     */
    public FlowBean() {}
    
    public FlowBean(String phone, long upStream, long downStream) {
        super();
        this.phone = phone;
        this.upStream = upStream;
        this.downStream = downStream;
        this.sumStream = upStream + downStream;
    }

    /**
     *Deserialize the object's data from the data stream
     *The order in which objects are read must be the same as the order in which fields are serialized
     */
    public void readFields(DataInput input) throws IOException {
        phone = input.readUTF();
        upStream = input.readLong();
        downStream = input.readLong();
        sumStream = input.readLong();
    }

    /**
     *Serialize object into stream
     */
    public void write(DataOutput output) throws IOException {
        output.writeUTF(phone);
        output.writeLong(upStream);
        output.writeLong(downStream);
        output.writeLong(sumStream);
    }

    /**
     *Reduce result output format
     */
    @Override
    public String toString() {
        return "" + upStream + "\t" + downStream + "\t" + sumStream;
    }

}

FlowMapper:

public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
    
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        
        String line = value.toString();
        String[] fields = StringUtils.split(line, "\t");
        
        String phone = fields[1];
        long upStream = Long.parseLong(fields[7]);
        long downStream = Long.parseLong(fields[8]);
        
        context.write(new Text(phone), new FlowBean(phone, upStream, downStream));
        
    }

}

FlowReducer:

public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
    
    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context)
            throws IOException, InterruptedException {
        
        long upStreamCounter = 0;
        long downStreamCounter = 0;
        
        for (FlowBean bean: values) {
            upStreamCounter += bean.getUpStream();
            downStreamCounter += bean.getDownStream();
        }
        
        context.write(key, new FlowBean(key.toString(), upStreamCounter, downStreamCounter));
        
    }

}

FlowRunner:
The standard implementation of runner is to inherit the configured class and implement the tool interface.

public class FlowRunner extends Configured implements Tool{

    public int run(String[] args) throws Exception {
        
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(FlowRunner.class);
        
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);
        
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        return job.waitForCompletion(true)?0:1;
    }
    
    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new FlowRunner(), args);
        System.exit(res);
    }

}

To sort the output results, for example, by sorting the total traffic from high to low, flowbean can directly implement the writablecompatible interface:

public class FlowBean implements WritableComparable<FlowBean> {...}

Also override the CompareTo method:

    public int compareTo(FlowBean o) {
        return sumStream > o.sumStream ? -1 : 1;
    }

Modify mapper and reducer codes as follows:

public class SortMR {
    
    public static class SortMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            
            String line = value.toString();
            String[] fields = StringUtils.split(line, "\t");
            String phone = fields[0];
            long upStream = Long.parseLong(fields[1]);
            long downStream = Long.parseLong(fields[2]);
            
            context.write(new FlowBean(phone, upStream, downStream), NullWritable.get());
        }
    }
    
    public static class SortReducer extends Reducer<FlowBean, NullWritable, Text, FlowBean> {
        @Override
        protected void reduce(FlowBean key, Iterable<NullWritable> values, Context context)
                throws IOException, InterruptedException {
            
            String phone = key.getPhone();
            context.write(new Text(phone), key);
        }
    }
    
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(SortMR.class);
        
        job.setMapperClass(SortMapper.class);
        job.setReducerClass(SortReducer.class);
        
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        System.exit(job.waitForCompletion(true)?0:1);
    }
}

The execution results submitted to yarn are as follows:

Hadoop series (4) MapReduce passing custom beans and sorting the results, analyzing the concurrent number of mapper and reducer

If we want to group the execution results, that is, the traffic statistics results of mobile phone numbers in different sections are output to different files, we need to set the number of concurrent tasks of reducer.
First, define a partitioner class, as follows:

public class AreaPartitioner<KEY, VALUE> extends Partitioner<KEY, VALUE>{

    private static HashMap<String,Integer> areaMap = new HashMap<String, Integer>();
    
    static{
        areaMap.put("135", 0);
        areaMap.put("136", 1);
        areaMap.put("137", 2);
        areaMap.put("138", 3);
        areaMap.put("139", 4);
    }
    
    @Override
    public int getPartition(KEY key, VALUE value, int numPartitions) {
        //Get the mobile phone number from the key, query the mobile phone's home dictionary, and different provinces return different group numbers
        int areaCoder  = areaMap.get(key.toString().substring(0, 3))==null?5:areaMap.get(key.toString().substring(0, 3));
        return areaCoder;
    }

}

Then configure as follows in configuration:

//Set custom group logic definition
    job.setPartitionerClass(AreaPartitioner.class);
    
    //Set the concurrent number of reduce tasks to be consistent with the number of groups. If the number of groups is excessive, an empty reducer result file will be generated and no error will be reported;
    //If it is less than the number of packets, an error will be reported; if it is set to 1, it is the same as the default situation, and only one reducer process will execute, producing a reducer result.
    job.setNumReduceTasks(6);

The result of job execution will be as follows:

Hadoop series (4) MapReduce passing custom beans and sorting the results, analyzing the concurrent number of mapper and reducer

If the test data is copied in four copies, as follows:

Hadoop series (4) MapReduce passing custom beans and sorting the results, analyzing the concurrent number of mapper and reducer

Submit the task to yarn for processing, and view the java process before the map task is started and completed:

Hadoop series (4) MapReduce passing custom beans and sorting the results, analyzing the concurrent number of mapper and reducer

It can be found that there will be 5 yarnchild processes executing the map task at the same time. Because each small file will occupy a block, each block needs a process for map task processing, so the more files are, the more map task processes are, the more resources are consumed, and the lower efficiency is.

In fact, the number of concurrent map tasks is determined by the number of slices. As many slices as there are, start as many map tasks to execute. Slicing is a logical concept that refers to the offset of data in a file. The specific size of the slice should be adjusted according to the size of the file being processed. The specific process from map to reduce I / O task is called shuffle.

Recommended Today

Incomplete delivery order log of SAP SD basic knowledge

Incomplete delivery order log of SAP SD basic knowledge   If we call the incomplete project log, the system checks whether the data in the outbound delivery is complete. From the generated list, we can directly jump to the screen of maintaining incomplete fields.   We can call log of incomplete items from delivery processing, […]