Serialization of spark2 (javaserializer / kryoserializer)

Time:2019-11-21

Environmental Science

JDK    1.8.0 
Hadoop 2.6.0
Scala  2.11.8  
Spark  2.1.2
Oozie  4.1
Hue    3.9 

Simple explanation

  • Official document: Data Serialization
  • The default serializer of spark is javaserializer, which can support automatic serialization of all objects, but it is inefficient.
  • Kryoserializer is much more efficient than javaserializer, but it does not support serialization of all objects (such as???) You need to register the custom class manually when using. If you don’t register, the performance is worse than the Java serializer.
  • You can configure spark. Kryo. Registrationrequired = true to detect whether the custom class is registered

Example

  • Related configuration items
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired=true
  • Wordcount sample for Java
import dw.common.util.HdfsHelper;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;
import java.util.Arrays;
public class WordCount {
    public static void main(String[] args) throws ClassNotFoundException {
        //Input file
        String wordFile = "/user/qhy/input/wordcount/idea.txt";
        SparkConf sparkConf = new SparkConf();
        sparkConf.registerKryoClasses(new Class<?>[]{
                java.lang.Class.class,
                Object[].class,
                Class.forName("scala.reflect.ClassTag$$anon$1")
        });
        SparkSession spark = SparkSession.builder()
                .appName("WordCount")
                .config(sparkConf)
                .config("spark.executor.instances", 10)
                .config("spark.executor.memory", "4g")
                .config("spark.executor.cores", 1)
                .config("spark.hadoop.mapreduce.output.fileoutputformat.compress", false)
                .getOrCreate();
        JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
        JavaRDD<String> hdfstext = jsc.textFile(wordFile);
        // segmentation
        JavaRDD<String> words = hdfstext.flatMap(line ->
                                        Arrays.asList(line.split("\s+")).iterator());
        //Single count 1
        JavaPairRDD<String, Integer> pairs = words.mapToPair(word -> new Tuple2<>(word, 1));
        // cumulative 1
        JavaPairRDD<String, Integer> wordCounts = pairs.reduceByKey((v1, v2) -> v1 + v2);
        //Exchange K, V
        JavaPairRDD<Integer, String> swapWordCounts = wordCounts.mapToPair(tuple2 -> tuple2.swap());
        // descending order
        swapWordCounts = swapWordCounts.sortByKey(false, 1).repartition(1);
        swapWordCounts.map(tuple -> tuple._1 + tuple._2);
        //Save results to HDFS
        String outDir = "/user/qhy/output/wordcount";
        HdfsHelper.deleteDir(jsc, outDir);
        swapWordCounts.saveAsTextFile(outDir);
        jsc.close();
    }
}

FAQ

  • Kryo error. The log information contains the following keywords
java.io.EOFException
java.io.IOException: java.lang.NullPointerException
java.lang.IndexOutOfBoundsException
com.esotericsoftware.kryo.KryoException
TorrentBroadcast

May first

spark.serializer=org.apache.spark.serializer.KryoSerializer

Switch back to the default

spark.serializer=org.apache.spark.serializer.JavaSerializer

To make sure that the error was not caused by another cause.
The reason for this error may be that there are two jar files kryo-2.21.jar and kryo-shaded-3.0.3.jar in the execution environment at the same time. Just delete kryo-2.21.jar.
If oozie scheduling is used, oozie needs to be restarted, otherwise an error may be reported. (JA008: File does not exist)
The HDFS directory of the two jars in Walker’s actual environment is.../share/lib/lib_20190930130812/spark

  • If the following error is reported, try replacing javaserializer with kryoserializer
ERROR scheduler.TaskSetManager: Task 0.0 in stage 1.0 (TID 0) had a not serializable result: org.apache.hadoop.io.Text

Related reading

  • Oozie (hue) dispatches spark2
  • Solutions to the problems of executing spark error reporting eofexception kryo and serializedlambda
  • spark-compression-and-serialization
  • Oozie is looking for wrong version of jar files

This article is from Walker snapshot