Parsing parquet logs with Java MapReduce

Time:2020-11-24

1. Single input format

Input format specifies the input format

//Specify input format
job.setMapperClass(ParquetMap.class);
job.setInputFormatClass(ParquetInputFormat.class);
ParquetInputFormat.addInputPath(job, new Path(args[1]));
ParquetInputFormat.setReadSupportClass(job, CheckLevelRunner.MyReadSupport.class);

//This provides a way to define how to read a file
public static final class MyReadSupport extends DelegatingReadSupport<Group> {
    public MyReadSupport() {
        super(new GroupReadSupport());
    }

    @Override
    public org.apache.parquet.hadoop.api.ReadSupport.ReadContext init(InitContext context) {
        return super.init(context);
    }
}

//Analysis of parquet's map
static class ParquetMap extends Mapper<Void, Group, Text, Text> {
        protected void map(Void key, Group value, Mapper<Void, Group, Text, Text>.Context context) {
            try {
                //Take the value of key1 field
                String md5sha1=value.getString("key1", 0);
                //Map outputs outputkey and outputvalue
                context.write(new Text(outputKey), new Text(outputValue));
            } catch (Exception e) {
                return;
            }
        }
    }

Parsing parquet encountered an empty file:
Parsing parquet logs with Java MapReduce
At this time, you can set the MapReduce fault tolerance parameter:
Mapreduce.map.failures . maxpercent: this parameter indicates that when the map task failure ratio exceeds this value, the whole job will fail. The default value is 0. If the number of empty files is less than 5%, the task is successful, and if it is greater than 5%, the task fails.

job.getConfiguration().set("mapreduce.map.failures.maxpercent", "5");

2. Multi input format

One directory has a file format of text and the other is parquet. Use multipleinputs to set up multiple maps based on the input source to process data.

//Set multi input and multi mapper
MultipleInputs.addInputPath(job, new Path(path1), TextInputFormat.class, NormalMap.class);
MultipleInputs.addInputPath(job, new Path(path2), ParquetInputFormat.class, ParquetMap.class);
ParquetInputFormat.setReadSupportClass(job, CheckLevelRunner.MyReadSupport.class);

Problems encountered in calling HTTP interface in 3.mapreduce

This error will be reported when the program is deployed to the server
Exception in thread “main” java.lang.NoSuchFieldError: INSTANCE,
According to the survey, there are two versions of httpclient, and the 4 version is introduced org.apache.hadoop This package contains the version of httpclient 3.1, and the two versions conflict. Finally, we removed the version introduced by ourselves and used the httpclient of 3.1 in Hadoop package.

Reference articles

https://www.cnblogs.com/EnzoD…
https://blog.csdn.net/woloqun…
https://blog.csdn.net/csdnmrl…