Spark operation aggregate, aggregate ByKey


1. aggregate function

The elements in each partition are aggregated, and then the results and initial values of each partition are combined with the combine function. The final type returned by this function does not need to be the same as the element type in RDD.

The seqOp operation aggregates the elements in each partition, and then combOp aggregates the aggregated results of all partitions again, the initial values of both operations are zeroValue.The seqOp operation traverses all the elements in the partition (T), the first T and zeroValue operations, and the results are then treated as zeroValue operations with the second T until the entire partition is traversed. The combOp operation is to aggregate the results of the partitions and then aggregate them. The aggregate function returns a value different from RDD. Therefore, an operation seqOp is needed to merge the elements T in the partition into one U, and another operation combOp aggregates all U.


scala> val rdd = List(1,2,3,4,5,6,7,8,9)
rdd: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> rdd.par.aggregate((0,0))(
(acc,number) => (acc._1 + number, acc._2 + 1),
(par1,par2) => (par1._1 + par2._1, par1._2 + par2._2)
res0: (Int, Int) = (45,9)

scala> res0._1 / res0._2
res1: Int = 5

The process is roughly as follows:

First, the initial value is (0,0), which will be used in the next two steps.
Then, (acc, number) => (acc. _1 + number, acc. _2 + 1), number is the T in the function definition, here is the element in the List. So the process of ACC. _1 + number, acc. _2 + 1 is as follows.

1.  0+1,  0+1
2.  1+2,  1+1
3.  3+3,  2+1
4.  6+4,  3+1
5.  10+5,  4+1
6.  15+6,  5+1
7.  21+7,  6+1
8.  28+8,  7+1
9.  36+9,  8+1

The result is (45,9). This is a single-threaded computing process. The actual Spark execution is distributed computing, which may divide the List into several partitions. If three partitions, P1 (1, 2, 3, 4), P2 (5, 6, 7, 8), P3 (9), are computed, the result of each partition is (10, 4), (26, 4), (9, 1). Thus, execution (1, par2) => (par1. _1 + 2. _1, par1. _2 + 2.) is (10 + 26 + 9, 4 + 4 + 1), and then leveling (45, 9). The average is simple.

2. aggregateByKey function:

Aggregation of the same Key value in PairRDD also uses a neutral initial value in the aggregation process. Similar to aggregate functions, aggregateByKey does not need to return the same type as value in RDD. Because aggregateByKey aggregates values in the same Key, the aggregateByKey’function ultimately returns PairRDD, which corresponds to Key and aggregated values, while aggregate functions directly return non-RDD results.


import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object AggregateByKeyOp {
 def main(args:Array[String]){
   val sparkConf: SparkConf = new SparkConf().setAppName("AggregateByKey").setMaster("local")
  val sc: SparkContext = new SparkContext(sparkConf)
   val data=List((1,3),(1,2),(1,4),(2,3))
   val rdd=sc.parallelize(data, 2)
   // Combining values in different partitions, a, B yields a data type of zeroValue
   def combOp(a:String,b:String):String={
    println("combOp: "+a+"\t"+b)
   // Combining values in the same partition, the data type of a is zeroValue, and the data type of B is original value.
   def seqOp(a:String,b:Int):String={
   // ZeroValue: Neutral value, defines the type of return value, and participates in the operation
   // seqOp: Used to merge values in the same partition
   // combOp: Used to merge values in different partitons
   val aggregateByKeyRDD=rdd.aggregateByKey("100")(seqOp, combOp)

Operation results:

Split the data into two partitions

// Partition-One Data
// Partition 2 data

// Merge data that partition the same key
Seq: 100 3/(1,3) Start merging with neutral values. The merging result is 1003
Seq: 1003 2/(1,2) merge result is 10032

// Consolidation of data with the same key in partition 2
Seq: 100 4/(1,4) Start merging with neutral value 1004
Seq: 100 3/(2,3) Start merging with neutral value 1003

Merge the results of the two partitions
// Key 2 exists only in one partition and does not need to be merged (2,1003)

// Key 1, which exists in two partitions and has the same data type, merges
comb: 10032     1004

The above is the whole content of this article. I hope it will be helpful to everyone’s study, and I hope you will support developpaer more.

Recommended Today

The method of obtaining the resolution of display by pyqt5

The code is as follows import sys from PyQt5.QtWidgets import QApplication, QWidget class Example(QWidget): def __init__(self): super().__init__() self.initUI() #Interface drawing to initui method def initUI(self): self.desktop = QApplication.desktop() #Get display resolution size self.screenRect = self.desktop.screenGeometry() self.height = self.screenRect.height() self.width = self.screenRect.width() print(self.height) print(self.width) #Show window if __name__ == ‘__main__’: #Create applications and objects app […]