RDD partition algorithm in spark

Time:2019-11-26

Partition algorithm of RDD in spark


def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
      (0 until numSlices).iterator.map { i =>
        val start = ((i * length) / numSlices).toInt
        val end = (((i + 1) * length) / numSlices).toInt
        (start, end)
      }
}


/**

  • Number of numslices partitions
  • (0 until numslices). Iterator is to change the number of partitions into iterator, and then use the map algorithm to
  • 0 – > 0 is converted to 0 – > 0 (0, n). That is to say, partition 0 reads 0 to N data sets.
  • The implementation of map algorithm is as follows:
  • val start = ((i * length) / numSlices).toInt
  • val end = (((i + 1) * length) / numSlices).toInt
  • Finally return to iterator [(start, end)]
  • In this way, the data set can be distributed to each partition as evenly as possible

*
*/

In big data cluster, it is often necessary to partition the data and assign it to each node in the cluster to execute, so as to mobilize cluster resources to execute the same task synchronously, which will greatly speed up the efficiency of task execution. So the excellent partition algorithm is an indispensable part.