Using spark to solve some classic MapReduce problems



Spark is an Apache project, which is billed as “lightning fast cluster computing”. It has a thriving open source community and is currently the most active Apache project. Spark provides a faster and more general data processing platform. Compared with Hadoop, spark can make your program run 100 times faster in memory or 10 times faster on disk. At the same time, spark also makes the traditional map reduce job development easier and faster. This paper will briefly introduce several classic Hadoop Mr implementation with spark, so that we are familiar with the development of spark.

Max min

Finding the maximum value and minimum value has always been a classic case of Hadoop. We use spark to realize it, so as to feel the idea and implementation of MR in spark. If you don’t say much, go straight to code:

  def testMaxMin: Unit = {
    val sconf = new SparkConf().setAppName("test")
    val sc = new SparkContext(sconf)
    //Initialize test data
    val data = sc.parallelize(Array(10,7,3,4,5,6,7,8,1001,6,2))
    //Method 1
    val res = => ("key", x)).groupByKey().map(x => {
      var min = Integer.MAX_VALUE
      var max = Integer.MIN_VALUE
      for(num <- x._2){
          max = num
          min = num
    }).collect.foreach(x => {

    //In the second method, the maximum and minimum values are obtained by comparing chicken thieves
    val max = data.reduce((a,b) => Math.max(a,b))
    val min = data.reduce((a,b) => Math.min(a,b))
    println("max : " + max)
    println("min : " + min)

Expected results:

max: 1001
min: 2

The idea is similar to Mr in Hadoop. Set a key, value as the set that needs to find the maximum and minimum values, and then aggregate the groupbykey together. The second method is simpler and performs better.

Mean value problem

Finding the average value of each key is a common case. When dealing with similar problems in spark, the combinebykey function is often used. For a detailed introduction, please Google the usage, and see the code below:

  def testAvg(): Unit ={
    val sconf = new SparkConf().setAppName("test")
    val sc = new SparkContext(sconf)
    //Initialize test data
    val foo = sc.parallelize(List(Tuple2("a", 1), Tuple2("a", 3), Tuple2("b", 2), Tuple2("b", 8)));
    //We need to use the function combinebykey here. Please Google
    val results=foo.combineByKey(
      (acc:(Int,Int),v) =>(acc._1+v,acc._2+1),

Let each partition calculate the sum and count of all integers corresponding to each key in a single partition, and then return a pair (sum, count) after shuffling, accumulate all sum and count corresponding to each key, and then divide them to get the average value

Topn problem

Top n problem is also a classic case of Hadoop reflecting Mr thought. How to solve it conveniently and quickly in spark

  def testTopN(): Unit ={
    val sconf = new SparkConf().setAppName("test")
    val sc = new SparkContext(sconf)
    //Initial call test data
    val foo = sc.parallelize(Array(
      ("a", 1),
      ("a", 2),
      ("a", 3),
      ("b", 3),
      ("b", 1),
      ("a", 4),
      ("b", 4),
      ("b", 2)
    //Test here, take top 2.
    val groupsSort=foo.groupByKey().map(tu=>{
      val key=tu._1
      val values=tu._2
      val sortValues=values.toList.sortWith(_>_).take(2)
    //Convert the format to print
    val flattenedTopNPerGroup =
      groupsSort.flatMap({case (key, numbers) => -> _)})
    flattenedTopNPerGroup.foreach((value: Any) => {


The idea is very simple. Group the data group bykey by key, and then take the largest two of each group. Expected results:


The above briefly introduces the implementation of three common Hadoop cases in spark. If readers have contacted or written some Hadoop MapReduce jobs, will they feel that it is much easier and faster to write in spark.

More spark classic cases are introduced, looking forward to the next decomposition…

Author information
Maxleap team_ Data analysis team member: Tan Yang [original]
Starting from:…

The author’s previous masterpieces
Analysis of time series data
Analysis of Apache spark caching and checkpointing

Recommended Today

Unit testing on Android studio 1.5 (simple)

1、 Note: Unit testing on Android studio (hereinafter referred to as as) is very convenient and does not require additional configuration. It’s all configured by itself. 2、 Test package generated automatically by as: After the package is opened, it will look like the following. You need to inherit the applicationtest class in the future 3、 […]