[Mr. Zhao Qiang] calculate aggregation using MapReduce in mongodb

Time:2021-10-19

[Mr. Zhao Qiang] calculate aggregation using MapReduce in mongodb

MapReduce can calculate very complex aggregation logic and is very flexible. However, MapReduce is very slow and should not be used in real-time data analysis. MapReduce can be executed in parallel on multiple servers. Each server is only responsible for completing part of the wordload. Finally, the wordload is sent to the master server for consolidation, the final result set is calculated and returned to the client.
The basic idea of MapReduce is shown in the following figure:

[Mr. Zhao Qiang] calculate aggregation using MapReduce in mongodb

In this example, we take a summation as an example. First, execute the map phase, which divides a large task into several small tasks, and each small task runs on different nodes to support distributed computing. This phase is called map (as shown in the blue box); The output result of each small task is calculated twice, and finally the result 55 is obtained. This stage is called reduce (as shown in the red box).

The calculation of aggregation by MapReduce is mainly divided into three steps: map, shuffle and reduce. Map and reduce need to be explicitly defined, and shuffle is implemented by mongodb.

  • Map: map the operation to each doc to generate key and value
  • Shuffle: group by key and combine values with the same key into an array
  • Reduce: reduce the value array to a single value

Let’s take the following test data (employee data) as an example to demonstrate for you.

db.emp.insert(
[
{_id:7369,ename:'SMITH' ,job:'CLERK'    ,mgr:7902,hiredate:'17-12-80',sal:800,comm:0,deptno:20},
{_id:7499,ename:'ALLEN' ,job:'SALESMAN' ,mgr:7698,hiredate:'20-02-81',sal:1600,comm:300 ,deptno:30},
{_id:7521,ename:'WARD'  ,job:'SALESMAN' ,mgr:7698,hiredate:'22-02-81',sal:1250,comm:500 ,deptno:30},
{_id:7566,ename:'JONES' ,job:'MANAGER'  ,mgr:7839,hiredate:'02-04-81',sal:2975,comm:0,deptno:20},
{_id:7654,ename:'MARTIN',job:'SALESMAN' ,mgr:7698,hiredate:'28-09-81',sal:1250,comm:1400,deptno:30},
{_id:7698,ename:'BLAKE' ,job:'MANAGER'  ,mgr:7839,hiredate:'01-05-81',sal:2850,comm:0,deptno:30},
{_id:7782,ename:'CLARK' ,job:'MANAGER'  ,mgr:7839,hiredate:'09-06-81',sal:2450,comm:0,deptno:10},
{_id:7788,ename:'SCOTT' ,job:'ANALYST'  ,mgr:7566,hiredate:'19-04-87',sal:3000,comm:0,deptno:20},
{_id:7839,ename:'KING'  ,job:'PRESIDENT',mgr:0,hiredate:'17-11-81',sal:5000,comm:0,deptno:10},
{_id:7844,ename:'TURNER',job:'SALESMAN' ,mgr:7698,hiredate:'08-09-81',sal:1500,comm:0,deptno:30},
{_id:7876,ename:'ADAMS' ,job:'CLERK'    ,mgr:7788,hiredate:'23-05-87',sal:1100,comm:0,deptno:20},
{_id:7900,ename:'JAMES' ,job:'CLERK'    ,mgr:7698,hiredate:'03-12-81',sal:950,comm:0,deptno:30},
{_id:7902,ename:'FORD'  ,job:'ANALYST'  ,mgr:7566,hiredate:'03-12-81',sal:3000,comm:0,deptno:20},
{_id:7934,ename:'MILLER',job:'CLERK'    ,mgr:7782,hiredate:'23-01-82',sal:1300,comm:0,deptno:10}
]
);

(case 1) find the number of employees in each position in the employee table

var map1=function(){emit(this.job,1)}
var reduce1=function(job,count){return Array.sum(count)}
db.emp.mapReduce(map1,reduce1,{out:"mrdemo1"})

(case 2) calculate the total salary of each department in the employee table

var map2=function(){emit(this.deptno,this.sal)}
var reduce2=function(deptno,sal){return Array.sum(sal)}
db.emp.mapReduce(map2,reduce2,{out:"mrdemo2"})

(case 3) troubleshooting the map function

Define your own emit function:
var emit = function(key, value) {
print("emit");
print("key: " + key + "  value: " + tojson(value));
}

Test a piece of data:
emp7839=db.emp.findOne({_id:7839})
map2.apply(emp7839)
Output the following results:
emit
key: 10  value: 5000

Test multiple pieces of data:
var myCursor=db.emp.find()
while (myCursor.hasNext()) {
    var doc = myCursor.next();
    print ("document _id= " + tojson(doc._id));
    map2.apply(doc);
    print();
}

(case 4) troubleshooting the reduce function

A simple test case
var myTestValues = [ 5, 5, 10 ];
var reduce1=function(key,values){return Array.sum(values)}
reduce1("mykey",myTestValues)

Test: the value of reduce contains multiple values
Test data: salary, bonus:
var myTestObjects = [
                      { sal: 1000, comm: 5 },
                      { sal: 2000, comm: 10 },
                      { sal: 3000, comm: 15 }
                    ];
Develop reduce method:
var reduce2=function(key,values) {
   reducedValue = { sal: 0, comm: 0 };
   for(var i=0;i<values.length;i++) {
     reducedValue.sal += values[i].sal;
     reducedValue.comm += values[i].comm;
   }  
   return reducedValue;
}

Test:
reduce2("aa",myTestObjects)

[Mr. Zhao Qiang] calculate aggregation using MapReduce in mongodb