### abstract

Spark is an Apache project, which is billed as “lightning fast cluster computing”. It has a thriving open source community and is currently the most active Apache project. Spark provides a faster and more general data processing platform. Compared with Hadoop, spark can make your program run 100 times faster in memory or 10 times faster on disk. At the same time, spark also makes the traditional map reduce job development easier and faster. This paper will briefly introduce several classic Hadoop Mr implementation with spark, so that we are familiar with the development of spark.

### Max min

Finding the maximum value and minimum value has always been a classic case of Hadoop. We use spark to realize it, so as to feel the idea and implementation of MR in spark. If you don’t say much, go straight to code:

```
@Test
def testMaxMin: Unit = {
val sconf = new SparkConf().setAppName("test")
val sc = new SparkContext(sconf)
//Initialize test data
val data = sc.parallelize(Array(10,7,3,4,5,6,7,8,1001,6,2))
//Method 1
val res = data.map(x => ("key", x)).groupByKey().map(x => {
var min = Integer.MAX_VALUE
var max = Integer.MIN_VALUE
for(num <- x._2){
if(num>max){
max = num
}
if(num<min){
min = num
}
}
(max,min)
}).collect.foreach(x => {
println("max\t"+x._1)
println("min\t"+x._2)
})
//In the second method, the maximum and minimum values are obtained by comparing chicken thieves
val max = data.reduce((a,b) => Math.max(a,b))
val min = data.reduce((a,b) => Math.min(a,b))
println("max : " + max)
println("min : " + min)
sc.stop
}
```

Expected results:

```
max: 1001
min: 2
```

The idea is similar to Mr in Hadoop. Set a key, value as the set that needs to find the maximum and minimum values, and then aggregate the groupbykey together. The second method is simpler and performs better.

### Mean value problem

Finding the average value of each key is a common case. When dealing with similar problems in spark, the combinebykey function is often used. For a detailed introduction, please Google the usage, and see the code below:

```
@Test
def testAvg(): Unit ={
val sconf = new SparkConf().setAppName("test")
val sc = new SparkContext(sconf)
//Initialize test data
val foo = sc.parallelize(List(Tuple2("a", 1), Tuple2("a", 3), Tuple2("b", 2), Tuple2("b", 8)));
//We need to use the function combinebykey here. Please Google
val results=foo.combineByKey(
(v)=>(v,1),
(acc:(Int,Int),v) =>(acc._1+v,acc._2+1),
(acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2)
).map{case(key,value)=>(key,value._1/value._2.toDouble)}
results.collect().foreach(println)
}
```

Let each partition calculate the sum and count of all integers corresponding to each key in a single partition, and then return a pair (sum, count) after shuffling, accumulate all sum and count corresponding to each key, and then divide them to get the average value

### Topn problem

Top n problem is also a classic case of Hadoop reflecting Mr thought. How to solve it conveniently and quickly in spark

```
@Test
def testTopN(): Unit ={
val sconf = new SparkConf().setAppName("test")
val sc = new SparkContext(sconf)
//Initial call test data
val foo = sc.parallelize(Array(
("a", 1),
("a", 2),
("a", 3),
("b", 3),
("b", 1),
("a", 4),
("b", 4),
("b", 2)
))
//Test here, take top 2.
val groupsSort=foo.groupByKey().map(tu=>{
val key=tu._1
val values=tu._2
val sortValues=values.toList.sortWith(_>_).take(2)
(key,sortValues)
})
//Convert the format to print
val flattenedTopNPerGroup =
groupsSort.flatMap({case (key, numbers) => numbers.map(key -> _)})
flattenedTopNPerGroup.foreach((value: Any) => {
println(value)
})
sc.stop
}
```

The idea is very simple. Group the data group bykey by key, and then take the largest two of each group. Expected results:

```
(a,4)
(a,3)
(b,4)
(b,3)
```

The above briefly introduces the implementation of three common Hadoop cases in spark. If readers have contacted or written some Hadoop MapReduce jobs, will they feel that it is much easier and faster to write in spark.

More spark classic cases are introduced, looking forward to the next decomposition…

**Author information**

Maxleap team_ Data analysis team member: Tan Yang [original]

Starting from: https://blog.maxleap.cn/archi…

**The author’s previous masterpieces**

Analysis of time series data

Analysis of Apache spark caching and checkpointing