How to read MySQL by spark SQL
Spark SQL also includes a data source that can read data from other databases using JDBC. Compared with using jdbcrdd, this function should be used preferentially. This is because the results are returned as dataframes, which can be easily processed in spark SQL or connected to other data sources. JDBC data sources are also easier to use in Java or Python because it does not require users to provide classtags.
You can use the data sources API to load a table in a remote database as a dataframe or spark SQL temporary view. Users can specify JDBC connection properties in the data source options. User and password are usually used as connection properties for logging in to the data source. In addition to connection properties, spark supports the following case insensitive options:
Property name | explain |
---|---|
url |
JDBC URL to connect to |
dbtable |
JDBC table read or written to |
query |
Specify the query statement |
driver |
The name of the jdbc driver class used to connect to the URL |
partitionColumn, lowerBound, upperBound |
If you specify these options, you must specify all of them. In addition,numPartitions Must specify |
numPartitions |
The maximum number of partitions that can be used for parallel processing in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to be written exceeds this limit, we cancoalesce(numPartitions) Call before writing to reduce it to this limit |
queryTimeout |
Default to0 , query timeout |
fetchsize |
The fetch size of JDBC, which determines how many rows to fetch at a time. This can help improve the performance of the jdbc driver |
batchsize |
The default is 1000, and the JDBC batch size can help improve the performance of the jdbc driver. |
isolationLevel |
Transaction isolation level for the current connection. It can be aNONE ,READ_COMMITTED ,READ_UNCOMMITTED ,REPEATABLE_READ , orSERIALIZABLE Corresponding to the connection object definition of JDBC, the default value is the standard transaction isolation levelREAD_UNCOMMITTED . This option is for writing only. |
sessionInitStatement |
After opening each database session to a remote database, this option executes a custom SQL statement to implement the session initialization code before starting to read data. |
truncate |
This is a JDBC writer related option. WhenSaveMode.Overwrite When enabled, the contents of the target table are cleared instead of deleting and rebuilding its existing tables. Default tofalse |
pushDownPredicate |
Option to enable or disable predicate push down to a JDBC data source. The default value is true. In this case, spark will push the filter down to the JDBC data source as much as possible. |
Source code
- SparkSession
/**
* Returns a [[DataFrameReader]] that can be used to read non-streaming data in as a
* `DataFrame`.
* {{{
* sparkSession.read.parquet("/path/to/file.parquet")
* sparkSession.read.schema(schema).json("/path/to/file.json")
* }}}
*
* @since 2.0.0
*/
def read: DataFrameReader = new DataFrameReader(self)
- DataFrameReader
//... omit code
/**
*All data is processed by a partition of RDD. If you have a large table, you are likely to see oom
*Can use DataFrameDF.rdd.partitions . size method view
*/
def jdbc(url: String, table: String, properties: Properties): DataFrame = {
assertNoSpecifiedSchema("jdbc")
this.extraOptions ++= properties.asScala
this.extraOptions += (JDBCOptions.JDBC_URL -> url, JDBCOptions.JDBC_TABLE_NAME -> table)
format("jdbc").load()
}
/**
*@ param URL database URL
*@ param table name
*@ param columnname partition field name
*The minimum value of @ param Lowerbound 'columnname', which is used for partition step size
*The maximum value of @ param upper bound 'columnname', used for partition step size
*Number of @ paramnumpartitions partitions
*@ param connectionproperties other parameters
* @since 1.4.0
*/
def jdbc(
url: String,
table: String,
columnName: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
connectionProperties: Properties): DataFrame = {
this.extraOptions ++= Map(
JDBCOptions.JDBC_PARTITION_COLUMN -> columnName,
JDBCOptions.JDBC_LOWER_BOUND -> lowerBound.toString,
JDBCOptions.JDBC_UPPER_BOUND -> upperBound.toString,
JDBCOptions.JDBC_NUM_PARTITIONS -> numPartitions.toString)
jdbc(url, table, connectionProperties)
}
/**
*@ param predicates where condition of each partition
*For example: "ID < = 1000", "score > 1000 and score < = 2000"
*It will be divided into two zones
* @since 1.4.0
*/
def jdbc(
url: String,
table: String,
predicates: Array[String],
connectionProperties: Properties): DataFrame = {
assertNoSpecifiedSchema("jdbc")
val params = extraOptions.toMap ++ connectionProperties.asScala.toMap
val options = new JDBCOptions(url, table, params)
val parts: Array[Partition] = predicates.zipWithIndex.map { case (part, i) =>
JDBCPartition(part, i) : Partition
}
val relation = JDBCRelation(parts, options)(sparkSession)
sparkSession.baseRelationToDataFrame(relation)
}
Examples
private def runJdbcDatasetExample(spark: SparkSession): Unit = {
//Load data from JDBC source
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/test")
.option("dbtable", "mytable")
.option("user", "root")
.option("password", "root")
.load()
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "root")
val jdbcDF2 = spark.read
.jdbc("jdbc:mysql://127.0.0.1:3306/test", "mytable", connectionProperties)
//Specifies the data type of the read schema
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
.jdbc("jdbc:mysql://127.0.0.1:3306/test", "mytable", connectionProperties)
}
It is worth noting that in the above way, if no partition is specified, spark will use a partition to read data by default. In this way, oom will appear when the amount of data is particularly large. After reading the data, call DataFrameDF.rdd.partitions . size method to view the number of partitions.
Batch write of spark SQL to MySQL
The code example is as follows:
object BatchInsertMySQL {
case class Person(name: String, age: Int)
def main(args: Array[String]): Unit = {
//Creating a sparksession object
val conf = new SparkConf()
.setAppName("BatchInsertMySQL")
val spark: SparkSession = SparkSession.builder()
.config(conf)
.getOrCreate()
import spark.implicits._
//MySQL connection parameters
val url = JDBCUtils.url
val user = JDBCUtils.user
val pwd = JDBCUtils.password
//Create the properties object and set the user name and password to connect to MySQL
val properties: Properties = new Properties()
properties.setProperty ("user", user) // user name
properties.setProperty ("password", PWD) // password
properties.setProperty("driver", "com.mysql.jdbc.Driver")
properties.setProperty("numPartitions","10")
//Reading table data in MySQL
val testDF: DataFrame = spark.read.jdbc(url, "test", properties)
Println ("number of partitions in testdf):"+ testDF.rdd.partitions .size)
testDF.createOrReplaceTempView("test")
testDF.persist(StorageLevel.MEMORY_AND_DISK)
testDF.printSchema()
val result =
S' '-- SQL code
""".stripMargin
val resultBatch = spark.sql(result).as[Person]
Println ("number of partitions in resultbatch:+ resultBatch.rdd.partitions .size)
//Batch write to MySQL
//Here, it is best to re partition the processed results
//Due to the large amount of data, there will be a lot of data in each partition
resultBatch.repartition(500).foreachPartition(record => {
val list = new ListBuffer[Person]
record.foreach(person => {
val name = Person.name
val age = Person.age
list.append(Person(name,age))
})
Upsertdatematch (list) // batch insert data
})
//Method of batch inserting MySQL
def upsertPerson(list: ListBuffer[Person]): Unit = {
var connect: Connection = null
var pstmt: PreparedStatement = null
try {
connect = JDBCUtils.getConnection()
//Disable auto submit
connect.setAutoCommit(false)
val sql = "REPLACE INTO `person`(name, age)" +
" VALUES(?, ?)"
pstmt = connect.prepareStatement(sql)
var batchIndex = 0
for (person <- list) {
pstmt.setString(1, person.name)
pstmt.setString(2, person.age)
//Add batch
pstmt.addBatch()
batchIndex +=1
//Control the number of submissions,
//MySQL batch write try to limit the amount of submitted batch data, otherwise MySQL will be written to hang up!!!
if(batchIndex % 1000 == 0 && batchIndex !=0){
pstmt.executeBatch()
pstmt.clearBatch()
}
}
//Submit batch
pstmt.executeBatch()
connect.commit()
} catch {
case e: Exception =>
e.printStackTrace()
} finally {
JDBCUtils.closeConnection(connect, pstmt)
}
}
spark.close()
}
}
JDBC connection tool class:
object JDBCUtils {
val user = "root"
val password = "root"
val url = "jdbc:mysql://localhost:3306/mydb"
Class.forName("com.mysql.jdbc.Driver")
//Get a connection
def getConnection() = {
DriverManager.getConnection(url,user,password)
}
//Release the connection
def closeConnection(connection: Connection, pstmt: PreparedStatement): Unit = {
try {
if (pstmt != null) {
pstmt.close()
}
} catch {
case e: Exception => e.printStackTrace()
} finally {
if (connection != null) {
connection.close()
}
}
}
}
summary
When spark writes a large amount of data to MySQL, try to re partition the DF before writing to avoid too much data in the partition. When writing, pay attention to the use offoreachPartitionIn this way, you can get a connection for each partition, and set the batch submission in the partition. The submitted batch is not easy to be too large, so as to avoid hanging the database.
Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.