Using spark to solve some classic MapReduce problems

Time:2020-6-29

abstract

Spark is an Apache project, which is billed as “lightning fast cluster computing”. It has a thriving open source community and is currently the most active Apache project. Spark provides a faster and more general data processing platform. Compared with Hadoop, spark can make your program run 100 times faster in memory or 10 times faster on disk. At the same time, spark also makes the traditional map reduce job development easier and faster. This paper will briefly introduce several classic Hadoop Mr implementation with spark, so that we are familiar with the development of spark.

Max min

Finding the maximum value and minimum value has always been a classic case of Hadoop. We use spark to realize it, so as to feel the idea and implementation of MR in spark. If you don’t say much, go straight to code:

@Test
  def testMaxMin: Unit = {
    val sconf = new SparkConf().setAppName("test")
    val sc = new SparkContext(sconf)
    //Initialize test data
    val data = sc.parallelize(Array(10,7,3,4,5,6,7,8,1001,6,2))
    //Method 1
    val res = data.map(x => ("key", x)).groupByKey().map(x => {
      var min = Integer.MAX_VALUE
      var max = Integer.MIN_VALUE
      for(num <- x._2){
        if(num>max){
          max = num
        }
        if(num<min){
          min = num
        }
      }
      (max,min)
    }).collect.foreach(x => {
      println("max\t"+x._1)
      println("min\t"+x._2)
    })

    //In the second method, the maximum and minimum values are obtained by comparing chicken thieves
    val max = data.reduce((a,b) => Math.max(a,b))
    val min = data.reduce((a,b) => Math.min(a,b))
    println("max : " + max)
    println("min : " + min)
    sc.stop
  }

Expected results:

max: 1001
min: 2

The idea is similar to Mr in Hadoop. Set a key, value as the set that needs to find the maximum and minimum values, and then aggregate the groupbykey together. The second method is simpler and performs better.

Mean value problem

Finding the average value of each key is a common case. When dealing with similar problems in spark, the combinebykey function is often used. For a detailed introduction, please Google the usage, and see the code below:

@Test
  def testAvg(): Unit ={
    val sconf = new SparkConf().setAppName("test")
    val sc = new SparkContext(sconf)
    //Initialize test data
    val foo = sc.parallelize(List(Tuple2("a", 1), Tuple2("a", 3), Tuple2("b", 2), Tuple2("b", 8)));
    //We need to use the function combinebykey here. Please Google
    val results=foo.combineByKey(
      (v)=>(v,1),
      (acc:(Int,Int),v) =>(acc._1+v,acc._2+1),
      (acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2)
    ).map{case(key,value)=>(key,value._1/value._2.toDouble)}
    results.collect().foreach(println)
  }

Let each partition calculate the sum and count of all integers corresponding to each key in a single partition, and then return a pair (sum, count) after shuffling, accumulate all sum and count corresponding to each key, and then divide them to get the average value

Topn problem

Top n problem is also a classic case of Hadoop reflecting Mr thought. How to solve it conveniently and quickly in spark

@Test
  def testTopN(): Unit ={
    val sconf = new SparkConf().setAppName("test")
    val sc = new SparkContext(sconf)
    //Initial call test data
    val foo = sc.parallelize(Array(
      ("a", 1),
      ("a", 2),
      ("a", 3),
      ("b", 3),
      ("b", 1),
      ("a", 4),
      ("b", 4),
      ("b", 2)
    ))
    //Test here, take top 2.
    val groupsSort=foo.groupByKey().map(tu=>{
      val key=tu._1
      val values=tu._2
      val sortValues=values.toList.sortWith(_>_).take(2)
      (key,sortValues)
    })
    //Convert the format to print
    val flattenedTopNPerGroup =
      groupsSort.flatMap({case (key, numbers) => numbers.map(key -> _)})
    flattenedTopNPerGroup.foreach((value: Any) => {
      println(value)
    })
    sc.stop

  }

The idea is very simple. Group the data group bykey by key, and then take the largest two of each group. Expected results:

(a,4)
(a,3)
(b,4)
(b,3)

The above briefly introduces the implementation of three common Hadoop cases in spark. If readers have contacted or written some Hadoop MapReduce jobs, will they feel that it is much easier and faster to write in spark.

More spark classic cases are introduced, looking forward to the next decomposition…


Author information
Maxleap team_ Data analysis team member: Tan Yang [original]
Starting from: https://blog.maxleap.cn/archi…

The author’s previous masterpieces
Analysis of time series data
Analysis of Apache spark caching and checkpointing

Recommended Today

What are the new methods of visual + map technology?

Last week, Ren Xiaofeng, chief scientist of Alibaba Gaode map, made a technical exchange with you on the development of computer vision related technology and the application in the field of map travel at the online live broadcast activity of “cloud dialogue” between senior students of Alibaba. The interaction between live broadcast is hot. Especially […]