Spark Java uses foreach / foreachpartition of dataframe

Time:2021-3-11

Spark has been updated to 2. X, and dataframe is under the control of dataset, so the API is unified accordingly. This article is no longer applicable to version 2.0.0 and above.


Dataframe natively supports direct output to JDBC, but if the target table has self increasing fields (such as ID), then dataframe cannot write directly. because DataFrame.write (). JDBC () requires that the schema of the dataframe and the table structure of the target table must be completely consistent (even the order of the fields must be consistent), otherwise it will throw an exception. Of course, if you choose rewrite in savemode, spark will delete your original table, and then generate one according to the schema of the dataframe…. Field types can be very, very exotic….
So we had to go through DataFrame.collect (), turn the whole dataframe into list < row > to the driver, and then write it through the native JDBC method. However, if the volume of dataframe is too large, it is easy to cause driver oom (especially, we generally do not allocate too much memory to the driver). It’s a real tangle.
Looking at the JDBC source code of spark, we find that in fact, we use the foreachpartition method to insert the data of each row in each partition of the dataframe. So why can’t we use it directly?

Spark JdbcUtils.scala Part of the source code:

  def saveTable(df: DataFrame,url: String,table: String,properties: Properties = new Properties()) {
    val dialect = JdbcDialects.get(url)
    val nullTypes: Array[Int] = df.schema.fields.map { field =>
      dialect.getJDBCType(field.dataType).map(_.jdbcNullType).getOrElse(
        field.dataType match {
          case IntegerType => java.sql.Types.INTEGER
          case LongType => java.sql.Types.BIGINT
          case DoubleType => java.sql.Types.DOUBLE
          case FloatType => java.sql.Types.REAL
          case ShortType => java.sql.Types.INTEGER
          case ByteType => java.sql.Types.INTEGER
          case BooleanType => java.sql.Types.BIT
          case StringType => java.sql.Types.CLOB
          case BinaryType => java.sql.Types.BLOB
          case TimestampType => java.sql.Types.TIMESTAMP
          case DateType => java.sql.Types.DATE
          case t: DecimalType => java.sql.Types.DECIMAL
          case _ => throw new IllegalArgumentException(
            s"Can't translate null value for field $field")
        })
    }

    val rddSchema = df.schema
    val driver: String = DriverRegistry.getDriverClassName(url)
    val getConnection: () => Connection = JDBCRDD.getConnector(driver, url, properties)
    // ****************** here ****************** 
    df.foreachPartition { iterator =>
      savePartition(getConnection, table, iterator, rddSchema, nullTypes)
    }
  }
 

Well… Since Scala can be implemented, Java should be able to play as his father!
Let’s look at the method prototype of foreachpartition

def foreachPartition(f: Iterator[Row] => Unit)

It’s also a favorite anonymous function of functional languages… I hate to write lambda, so let’s implement an anonymous class. The abstract classes to be implemented are:
scala.runtime.AbstractFunction1 < iterator < row >, boxedunit > two template parameters, the first is very intuitive, which is the iterator of row, as the function parameter. The second boxedunit is the return value of the function. People who are not familiar with Scala may be confused. In fact, this is Scala’s void. Because of the feature of scala functional programming, something must be returned at the end of the code block, so they came up with a unit to replace void (which should have nothing). For Java, we can use it directly BoxedUnit.UNIT To get this “nothing” thing.
Come and play!

df.foreachPartition(new AbstractFunction1<Iterator<Row>, BoxedUnit>() {
    @Override
    public BoxedUnit apply(Iterator<Row> it) {
        while (it.hasNext()){
            System.out.println(it.next().toString());
        }
        return BoxedUnit.UNIT;
    }
});

Well, Maven complete, spark submit~
OK, it’s abnormal
org.apache.spark.SparkException: Task not serializable
Task cannot be serialized
Well, when UDF was implemented before, UDF1 / 2 / 3 / 4… All interfaces are extended serializable. That is to say, during spark running, driver will serialize UDF interface classes, deserialize UDF interface in executor, and execute call method… This is not difficult to understand. The class we throw into foreachpartition should also implement serializable. In this way, we have to create an abstract class that inherits abstractfunction1 < iterator < row >, boxedrunit >, and implements serializable. Give us these anonymous classes to implement!

import org.apache.spark.sql.Row;
import scala.runtime.AbstractFunction1;
import scala.runtime.BoxedUnit;

import java.io.Serializable;

public abstract class JavaForeachPartitionFunc extends AbstractFunction1<Iterator<Row>, BoxedUnit> implements Serializable {
}

But I have to return every time BoxedUnit.UNIT It’s too awkward. There’s no Java style.

import org.apache.spark.sql.Row;
import scala.collection.Iterator;
import scala.runtime.AbstractFunction1;
import scala.runtime.BoxedUnit;

import java.io.Serializable;

public abstract class JavaForeachPartitionFunc extends AbstractFunction1<Iterator<Row>, BoxedUnit> implements Serializable {
    @Override
    public BoxedUnit apply(Iterator<Row> it) {
        call(it);
        return BoxedUnit.UNIT;
    }
    
    public abstract void call(Iterator<Row> it);
}

So we can directly override the call method and play with code full of Java style!

df.foreachPartition(new JavaForeachPartitionFunc() {
    @Override
    public void call(Iterator<Row> it) {
        while (it.hasNext()){
            System.out.println(it.next().toString());
        }
    }
});

be careful! The method of anonymous class we implemented is actually executed on the executor, so println is output to stdout of the executor machine. This is the way to get the web Executor of Spark, click the Executor page of the specific Application to see the configuration of the virtual machine cluster used in debugging, the same configuration as the walking tractor, and don’t make complaints about it.
Spark Java uses foreach / foreachpartition of dataframe

The foreach method is the same. Just replace iterator < row > with row. How to do it? Take your time~~~
have fun~