Read and write millions of data of spark SQL to MySQL in batches

Time:2021-1-27

How to read MySQL by spark SQL

Spark SQL also includes a data source that can read data from other databases using JDBC. Compared with using jdbcrdd, this function should be used preferentially. This is because the results are returned as dataframes, which can be easily processed in spark SQL or connected to other data sources. JDBC data sources are also easier to use in Java or Python because it does not require users to provide classtags.

You can use the data sources API to load a table in a remote database as a dataframe or spark SQL temporary view. Users can specify JDBC connection properties in the data source options. User and password are usually used as connection properties for logging in to the data source. In addition to connection properties, spark supports the following case insensitive options:

Property name explain
url JDBC URL to connect to
dbtable JDBC table read or written to
query Specify the query statement
driver The name of the jdbc driver class used to connect to the URL
partitionColumn, lowerBound, upperBound If you specify these options, you must specify all of them. In addition,numPartitionsMust specify
numPartitions The maximum number of partitions that can be used for parallel processing in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to be written exceeds this limit, we cancoalesce(numPartitions)Call before writing to reduce it to this limit
queryTimeout Default to0, query timeout
fetchsize The fetch size of JDBC, which determines how many rows to fetch at a time. This can help improve the performance of the jdbc driver
batchsize The default is 1000, and the JDBC batch size can help improve the performance of the jdbc driver.
isolationLevel Transaction isolation level for the current connection. It can be aNONEREAD_COMMITTEDREAD_UNCOMMITTEDREPEATABLE_READ, orSERIALIZABLECorresponding to the connection object definition of JDBC, the default value is the standard transaction isolation levelREAD_UNCOMMITTED. This option is for writing only.
sessionInitStatement After opening each database session to a remote database, this option executes a custom SQL statement to implement the session initialization code before starting to read data.
truncate This is a JDBC writer related option. WhenSaveMode.OverwriteWhen enabled, the contents of the target table are cleared instead of deleting and rebuilding its existing tables. Default tofalse
pushDownPredicate Option to enable or disable predicate push down to a JDBC data source. The default value is true. In this case, spark will push the filter down to the JDBC data source as much as possible.

Source code

  • SparkSession
/**
   * Returns a [[DataFrameReader]] that can be used to read non-streaming data in as a
   * `DataFrame`.
   * {{{
   *   sparkSession.read.parquet("/path/to/file.parquet")
   *   sparkSession.read.schema(schema).json("/path/to/file.json")
   * }}}
   *
   * @since 2.0.0
   */
  def read: DataFrameReader = new DataFrameReader(self)
  • DataFrameReader
//... omit code
  /**
   *All data is processed by a partition of RDD. If you have a large table, you are likely to see oom
   *Can use DataFrameDF.rdd.partitions . size method view
   */
  def jdbc(url: String, table: String, properties: Properties): DataFrame = {
    assertNoSpecifiedSchema("jdbc")
    this.extraOptions ++= properties.asScala
    this.extraOptions += (JDBCOptions.JDBC_URL -> url, JDBCOptions.JDBC_TABLE_NAME -> table)
    format("jdbc").load()
  }
/**
   *@ param URL database URL
   *@ param table name
   *@ param columnname partition field name
   *The minimum value of @ param Lowerbound 'columnname', which is used for partition step size
   *The maximum value of @ param upper bound 'columnname', used for partition step size
   *Number of @ paramnumpartitions partitions 
   *@ param connectionproperties other parameters
   * @since 1.4.0
   */
  def jdbc(
      url: String,
      table: String,
      columnName: String,
      lowerBound: Long,
      upperBound: Long,
      numPartitions: Int,
      connectionProperties: Properties): DataFrame = {
    this.extraOptions ++= Map(
      JDBCOptions.JDBC_PARTITION_COLUMN -> columnName,
      JDBCOptions.JDBC_LOWER_BOUND -> lowerBound.toString,
      JDBCOptions.JDBC_UPPER_BOUND -> upperBound.toString,
      JDBCOptions.JDBC_NUM_PARTITIONS -> numPartitions.toString)
    jdbc(url, table, connectionProperties)
  }

  /**
   *@ param predicates where condition of each partition
   *For example: "ID < = 1000", "score > 1000 and score < = 2000"
   *It will be divided into two zones
   * @since 1.4.0
   */
  def jdbc(
      url: String,
      table: String,
      predicates: Array[String],
      connectionProperties: Properties): DataFrame = {
    assertNoSpecifiedSchema("jdbc")
    val params = extraOptions.toMap ++ connectionProperties.asScala.toMap
    val options = new JDBCOptions(url, table, params)
    val parts: Array[Partition] = predicates.zipWithIndex.map { case (part, i) =>
      JDBCPartition(part, i) : Partition
    }
    val relation = JDBCRelation(parts, options)(sparkSession)
    sparkSession.baseRelationToDataFrame(relation)
  }

Examples

private def runJdbcDatasetExample(spark: SparkSession): Unit = {
    
    //Load data from JDBC source
    val jdbcDF = spark.read
      .format("jdbc")
      .option("url", "jdbc:mysql://127.0.0.1:3306/test")
      .option("dbtable", "mytable")
      .option("user", "root")
      .option("password", "root")
      .load()

    val connectionProperties = new Properties()
    connectionProperties.put("user", "root")
    connectionProperties.put("password", "root")
    val jdbcDF2 = spark.read
      .jdbc("jdbc:mysql://127.0.0.1:3306/test", "mytable", connectionProperties)
    //Specifies the data type of the read schema
    connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
    val jdbcDF3 = spark.read
      .jdbc("jdbc:mysql://127.0.0.1:3306/test", "mytable", connectionProperties)

  }

It is worth noting that in the above way, if no partition is specified, spark will use a partition to read data by default. In this way, oom will appear when the amount of data is particularly large. After reading the data, call DataFrameDF.rdd.partitions . size method to view the number of partitions.

Batch write of spark SQL to MySQL

The code example is as follows:

object BatchInsertMySQL {
  case class Person(name: String, age: Int)
  def main(args: Array[String]): Unit = {

    //Creating a sparksession object
    val conf = new SparkConf()
      .setAppName("BatchInsertMySQL")
    val spark: SparkSession =  SparkSession.builder()
      .config(conf)
      .getOrCreate()
    import spark.implicits._
    //MySQL connection parameters
    val url = JDBCUtils.url
    val user = JDBCUtils.user
    val pwd = JDBCUtils.password

    //Create the properties object and set the user name and password to connect to MySQL
    val properties: Properties = new Properties()

    properties.setProperty ("user", user) // user name
    properties.setProperty ("password", PWD) // password
    properties.setProperty("driver", "com.mysql.jdbc.Driver")
    properties.setProperty("numPartitions","10")

    //Reading table data in MySQL
    val testDF: DataFrame = spark.read.jdbc(url, "test", properties)
     Println ("number of partitions in testdf):"+ testDF.rdd.partitions .size)
   testDF.createOrReplaceTempView("test")
   testDF.persist(StorageLevel.MEMORY_AND_DISK)
   testDF.printSchema()

    val result =
      S' '-- SQL code
               """.stripMargin

    val resultBatch = spark.sql(result).as[Person]
    Println ("number of partitions in resultbatch:+ resultBatch.rdd.partitions .size)

    //Batch write to MySQL
    //Here, it is best to re partition the processed results
    //Due to the large amount of data, there will be a lot of data in each partition
    resultBatch.repartition(500).foreachPartition(record => {

      val list = new ListBuffer[Person]
      record.foreach(person => {
        val name = Person.name
        val age = Person.age
        list.append(Person(name,age))
      })
      Upsertdatematch (list) // batch insert data
    })
    //Method of batch inserting MySQL
    def upsertPerson(list: ListBuffer[Person]): Unit = {

      var connect: Connection = null
      var pstmt: PreparedStatement = null

      try {
        connect = JDBCUtils.getConnection()
        //Disable auto submit
        connect.setAutoCommit(false)

        val sql = "REPLACE INTO `person`(name, age)" +
          " VALUES(?, ?)"

        pstmt = connect.prepareStatement(sql)

        var batchIndex = 0
        for (person <- list) {
          pstmt.setString(1, person.name)
          pstmt.setString(2, person.age)
          //Add batch
          pstmt.addBatch()
          batchIndex +=1
          //Control the number of submissions,
          //MySQL batch write try to limit the amount of submitted batch data, otherwise MySQL will be written to hang up!!!
          if(batchIndex % 1000 == 0 && batchIndex !=0){
            pstmt.executeBatch()
            pstmt.clearBatch()
          }

        }
        //Submit batch
        pstmt.executeBatch()
        connect.commit()
      } catch {
        case e: Exception =>
          e.printStackTrace()
      } finally {
        JDBCUtils.closeConnection(connect, pstmt)
      }
    }

    spark.close()
  }
}

JDBC connection tool class:

object JDBCUtils {
  val user = "root"
  val password = "root"
  val url = "jdbc:mysql://localhost:3306/mydb"
  Class.forName("com.mysql.jdbc.Driver")
  //Get a connection
  def getConnection() = {
    DriverManager.getConnection(url,user,password)
  }
//Release the connection
  def closeConnection(connection: Connection, pstmt: PreparedStatement): Unit = {
    try {
      if (pstmt != null) {
        pstmt.close()
      }
    } catch {
      case e: Exception => e.printStackTrace()
    } finally {
      if (connection != null) {
        connection.close()
      }
    }
  }
}

summary

When spark writes a large amount of data to MySQL, try to re partition the DF before writing to avoid too much data in the partition. When writing, pay attention to the use offoreachPartitionIn this way, you can get a connection for each partition, and set the batch submission in the partition. The submitted batch is not easy to be too large, so as to avoid hanging the database.

Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.