Performance tuning of spark application


Spark is a memory based distributed computing engine, which is famous for its high efficiency and stability. However, in the actual application development process, developers will encounter a variety of problems, one of which is related to performance. In this paper, the author will talk about how to improve the application performance as much as possible.

There are four main focuses in the tuning of distributed computing engine: CPU, memory, network overhead and I / O

  • Improve CPU utilization.

  • Avoid oom.

  • Reduce network overhead.

  • Reduce I / O operations.

Chapter 1 data skew

Data skew means that the amount of data in one or several partitions is particularly large, which means that it takes quite a long time to complete the calculation of these partitions.

If a large amount of data is concentrated in a partition, the partition will become a bottleneck in calculation. Figure 1 is a schematic diagram of concurrent execution of spark application. In spark, different stages of the same application are executed serially, while different tasks in the same stage can be executed concurrently. The number of tasks is determined by the number of partitions. If the amount of data in a partition is particularly large, the completion time of the corresponding task will be particularly long, resulting in that the next stage cannot be started and the whole process will be completed A job takes a long time to complete.

One way to avoid data skew is to select the appropriate key or define the relevant partitioner. In spark, block uses ByteBuffer to store data, and the maximum amount of data that ByteBuffer can store is no more than 2GB. If a key has a large amount of data, the spark-1476 exception will be encountered when calling the cache or persist function.

The API listed below will cause shuffle operation, which is the key point of data skew

  1. groupByKey

  2. reduceByKey

  3. aggregateByKey

  4. sortByKey

  5. join

  6. cogroup

  7. cartesian

  8. coalesce

  9. repartition

  10. repartitionAndSortWithinPartitions

Performance tuning of spark application
Figure 1: Spark task concurrency model

def rdd: RDD[T]

// TODO View bounds are deprecated, should use context bounds
// Might need to change ClassManifest for ClassTag in spark 1.0.0
case class DemoPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest](
  rdd: RDD[(K, V)]) extends RDDWrapper[(K, V)] {
  // Here we use a single Long to try to ensure the sort is balanced, 
  // but for really large dataset, we may want to consider
  // using a tuple of many Longs or even a GUID
  def sortByKeyGrouped(numPartitions: Int): RDD[(K, V)] = => ((kv._1, Random.nextLong()), kv._2)).sortByKey()
    .grouped(numPartitions).map(t => (t._1._1, t._2))

case class DemoRDD[T: ClassManifest](rdd: RDD[T]) extends RDDWrapper[T] {
  def grouped(size: Int): RDD[T] = {
    // TODO Version where withIndex is cached
    val withIndex = rdd.mapPartitions(_.zipWithIndex)

    val startValues =
      withIndex.mapPartitionsWithIndex((i, iter) => 
        Iterator((i, iter.toIterable.last))).toArray().toList
      .sortBy(_._1).map(_._2._2.toLong).scan(-1L)(_ + _).map(_ + 1L)

    withIndex.mapPartitionsWithIndex((i, iter) => {
      case (value, index) => (startValues(i) + index.toLong, value)
    .partitionBy(new Partitioner {
      def numPartitions: Int = size
      def getPartition(key: Any): Int = 
        (key.asInstanceOf[Long] * numPartitions.toLong / startValues.last).toInt

Defining implicit transformations

  implicit def toDemoRDD[T: ClassManifest](rdd: RDD[T]): DemoRDD[T] = 
    new DemoRDD[T](rdd)
  implicit def toDemoPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest](
    rdd: RDD[(K, V)]): DemoPairRDD[K, V] = DemoPairRDD(rdd)
  implicit def toRDD[T](rdd: RDDWrapper[T]): RDD[T] = rdd.rdd

You can use it in spark shell

import RDDConversions._


Chapter 2 reducing network communication overhead

Spark’s shuffle process consumes a lot of resources. The shuffle process means that in the corresponding computing node, the calculation results should be stored to disk first, and the subsequent stage needs to read the results of the previous stage again. Data writing and reading means disk I / O operation. Compared with memory operation, disk I / O operation is very inefficient.

Use iostat to check the usage of disk I / O. frequent disk I / O operations are usually accompanied by high CPU load.

If the data and computing nodes are on the same machine, the network overhead can be avoided, otherwise the corresponding network overhead will be added. Use iftop to view the network bandwidth usage and see which nodes have a large number of network transmission.

Figure 2 is a schematic diagram of data transmission between spark nodes. The calculation function of spark task is sent to the executor by driver through akka channel, while the data of shuffle is realized through netty network interface. Because the parameters in akka channel spark.akka.framesize It determines the maximum value of messages that can be transmitted, so we should avoid introducing large local variables in spark task.

Performance tuning of spark application
Figure 2: data transmission between spark nodes

Section 1 choose the appropriate concurrency number

In order to improve the efficiency of spark application, as far as possible to improve the utilization of CPU. The number of concurrency should be twice the number of available CPU physical cores. Here, the number of concurrency is too low, the CPU is not fully utilized, and the number of concurrency is too large. Because spark is that every task has to be distributed to the computing node, the overhead of task startup will rise.

The number of concurrent changes, through the configuration parameters to change spark.default.parallelism If it is SQL, it may be modified spark.sql.shuffle . partitions.

Item 1 reproduction vs. coalesce

Repartition and coalesce can achieve dynamic adjustment of data partition, but it should be noted that repartition will lead to shuffle operation, while coalesce will not.

Section 2 reducebykey vs. groupby

Groupby operation should be avoided as far as possible. First, it may cause a lot of network overhead. Second, it may cause oom. Take wordcount as an example to demonstrate the difference between reducebykey and groupby

    sc.textFile(“”).map(l=>l.split(“,”)).map(w=>(w,1)).reduceByKey(_ + _)

Performance tuning of spark application
Figure 3: shuffle process of reducebykey

The shuffle process is shown in Figure 2


Performance tuning of spark application
Figure 4: shuffle process of groupbykey

Suggestion: use reducebykey, aggregatebykey, foldbykey and combinebykey as much as possible

Suppose there is an RDD as follows, find the mean value of each key

val data = sc.parallelize( List((0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)) )

Method 1: reduce bykey>(r._1, (r.2,1))).reduceByKey((a,b)=>(a._1 + b._1, a._2 + b._2)).map(r=>(r._1,(r._2._1/r._2._2)).foreach(println)

Method 2: combinebykey

     (x:(Double, Int), value:Double)=> (x._1+value, x._2 + 1),     (x:(Double,Int), y:(Double, Int))=>(x._1 + y._1, x._2 + y._2))

Section 3 broadcasthashjoin vs. shufflehashjoin

In the process of join, we often encounter the join of large table and small table. In order to improve the efficiency, we can use broadcasthashjoin to broadcast the content of small table to each executor in advance, which will avoid the shuffle process of small table and greatly improve the operation efficiency.

In fact, the core of broadcasthashjoin is to use the broadcast function. If you understand the advantages of broadcast, you can better understand the advantages of broadcasthashjoin.

Here is a simple example of using broadcast.

val lst = 1 to 100 toList
val exampleRDD = sc.makeRDD(1 to 20 toSeq, 2)
val broadcastLst = sc.broadcast(lst)

Section 4 map vs. mappartitions

Sometimes it is necessary to store the calculation results to the external database, so the connection to the external database will be established. We should make more elements share the same data connection as much as possible, instead of establishing database connection for each element.

In this case, mappartitions and foreachpartitions will be much more efficient than map operations.

Section 5 local data reading

The cost of mobile computing is far lower than that of mobile data.

Each task in spark needs corresponding input data, so the location of input data becomes very important for task performance. According to the speed of data acquisition, from fast to slow are as follows:




Spark will give priority to the fastest data acquisition method when executing tasks. If you want to start tasks on as many machines as possible, you can lower the cost of the task spark.locality.wait The default value is 3S.

In addition to HDFS, spark can support more and more data sources, such as Cassandra, HBase, mongodb and other well-known NoSQL databases. With the rising of elastic search, the combination of spark and elastic search to provide high-speed query solutions has become a useful attempt.

The same problem faced by the external data sources mentioned above is how to make spark read the data quickly. The basic way to achieve this goal is to deploy the computing node and data node together as much as possible. For example, when deploying Hadoop cluster, the datanode of HDFS and spark worker can share one machine.

Take Cassandra as an example. If the deployment of spark and Cassandra ‘ spark.locality.wait You can start spark task on a machine that does not have Cassandra deployed.

For Cassandra, you can deploy spark worker on the machine where Cassandra is deployed. It should be noted that Cassandra’s com action operation will consume a lot of CPU. Therefore, these factors need to be considered together when configuring CPU cores for spark worker.

This part of the code logic can refer to the source code tasksetmanager:: addpendingtask

private def addPendingTask(index: Int, readding: Boolean = false) {
  // Utility method that adds `index` to a list only if readding=false or it's not already there
  def addTo(list: ArrayBuffer[Int]) {
    if (!readding || !list.contains(index)) {
      list += index

  for (loc <- tasks(index).preferredLocations) {
    loc match {
      case e: ExecutorCacheTaskLocation =>
        addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer))
      case e: HDFSCacheTaskLocation => {
        val exe = sched.getExecutorsAliveOnHost(
        exe match {
          case Some(set) => {
            for (e <- set) {
              addTo(pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer))
            logInfo(s"Pending task $index has a cached location at ${} " +
              ", where there are executors " + set.mkString(","))
          case None => logDebug(s"Pending task $index has a cached location at ${} " +
              ", but there are no executors alive there.")
      case _ => Unit
    addTo(pendingTasksForHost.getOrElseUpdate(, new ArrayBuffer))
    for (rack <- sched.getRackForHost( {
      addTo(pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer))

  if (tasks(index).preferredLocations == Nil) {

  if (!readding) {
    allPendingTasks += index  // No point scanning this whole list to find the old task there

If you want spark to support new storage sources, and then develop the corresponding RDD, the location related part is the custom getpreferredlocations function. Take esrdd in elasticsearch Hadoop as an example, its code implementation is as follows.

override def getPreferredLocations(split: Partition): Seq[String] = {
  val esSplit = split.asInstanceOf[EsPartition]
  val ip = esSplit.esPartition.nodeIp
  if (ip != null) Seq(ip) else Nil

Section 6 serialization

Using a good serialization algorithm can improve the running speed and reduce the use of memory.

When spark shuffles, it needs to store the data to disk first, and the stored content is serialized. The process of serialization involves two basic factors, one is the speed of serialization, the other is the size of the content after serialization.

Compared with the default Java serializer, kryoserializer has great advantages in the speed of serialization and the size of serialization results. Therefore, it is recommended to use kryoserializer in application configuration

spark.serializer  org.apache.spark.serializer.KryoSerializer

The default cache does not serialize the cached objects, and the storagelevel used is memory_ Only, which means taking up a large amount of memory. The cache content can be serialized by specifying parameters in persist.


It should be pointed out that the persist function will not cache the data until the job is executed, which belongs to delayed execution; while the unpersist function will be executed immediately, and the cache will be cleared immediately.

More can be