[Mr. Zhao Qiang] using MapReduce method to calculate aggregation in mongodb


MapReduce can compute very complex aggregation logic and is very flexible. However, MapReduce is very slow and should not be used in real-time data analysis. MapReduce can be executed in parallel on multiple servers. Each server is only responsible for completing part of the wordload. Finally, the wordload is sent to the master server for merging, and the final result set is calculated and returned to the client.
The basic idea of MapReduce is shown in the figure below:

In this example, let’s take a summation as an example. First, the map phase is executed to split a large task into several small tasks, and each small task runs on different nodes to support distributed computing. This phase is called map (as shown in the blue box); the output results of each small task are calculated again, and finally the result 55 is obtained. This phase is called reduce (as shown in the red box).

Using MapReduce to calculate aggregation can be divided into three steps: map, shuffle and reduce. Map and reduce need to be explicitly defined, and shuffle is implemented by mongodb.

  • Map: map operations to each doc to generate key and value
  • Shuffle: group by key and combine values with the same key into an array
  • Reduce: reduce the value array to a single value

Let’s take the following test data (employee data) as an example to demonstrate.

{_id:7369,ename:'SMITH' ,job:'CLERK'    ,mgr:7902,hiredate:'17-12-80',sal:800,comm:0,deptno:20},
{_id:7499,ename:'ALLEN' ,job:'SALESMAN' ,mgr:7698,hiredate:'20-02-81',sal:1600,comm:300 ,deptno:30},
{_id:7521,ename:'WARD'  ,job:'SALESMAN' ,mgr:7698,hiredate:'22-02-81',sal:1250,comm:500 ,deptno:30},
{_id:7566,ename:'JONES' ,job:'MANAGER'  ,mgr:7839,hiredate:'02-04-81',sal:2975,comm:0,deptno:20},
{_id:7654,ename:'MARTIN',job:'SALESMAN' ,mgr:7698,hiredate:'28-09-81',sal:1250,comm:1400,deptno:30},
{_id:7698,ename:'BLAKE' ,job:'MANAGER'  ,mgr:7839,hiredate:'01-05-81',sal:2850,comm:0,deptno:30},
{_id:7782,ename:'CLARK' ,job:'MANAGER'  ,mgr:7839,hiredate:'09-06-81',sal:2450,comm:0,deptno:10},
{_id:7788,ename:'SCOTT' ,job:'ANALYST'  ,mgr:7566,hiredate:'19-04-87',sal:3000,comm:0,deptno:20},
{_id:7839,ename:'KING'  ,job:'PRESIDENT',mgr:0,hiredate:'17-11-81',sal:5000,comm:0,deptno:10},
{_id:7844,ename:'TURNER',job:'SALESMAN' ,mgr:7698,hiredate:'08-09-81',sal:1500,comm:0,deptno:30},
{_id:7876,ename:'ADAMS' ,job:'CLERK'    ,mgr:7788,hiredate:'23-05-87',sal:1100,comm:0,deptno:20},
{_id:7900,ename:'JAMES' ,job:'CLERK'    ,mgr:7698,hiredate:'03-12-81',sal:950,comm:0,deptno:30},
{_id:7902,ename:'FORD'  ,job:'ANALYST'  ,mgr:7566,hiredate:'03-12-81',sal:3000,comm:0,deptno:20},
{_id:7934,ename:'MILLER',job:'CLERK'    ,mgr:7782,hiredate:'23-01-82',sal:1300,comm:0,deptno:10}


(case 1) calculate the number of employees in each position in the employee table

var map1=function(){emit(this.job,1)}
var reduce1=function(job,count){return Array.sum(count)}


(case 2) calculate the total salary of each department in the employee table

var map2=function(){emit(this.deptno,this.sal)}
var reduce2=function(deptno,sal){return Array.sum(sal)}


(case 3) trouble the map function

Define your own emit function:
var emit = function(key, value) {
print("key: " + key + "  value: " + tojson(value));

Test a piece of data:
The results are as follows
key: 10  value: 5000

Test multiple data:
var myCursor=db.emp.find()
while (myCursor.hasNext()) {
    var doc = myCursor.next();
    print ("document _id= " + tojson(doc._id));


(case 4) trouble the reduce function

A simple test case
var myTestValues = [ 5, 5, 10 ];
var reduce1=function(key,values){return Array.sum(values)}

Test: the value of reduce contains multiple values
Test data: salary, bonus:
var myTestObjects = [
                      { sal: 1000, comm: 5 },
                      { sal: 2000, comm: 10 },
                      { sal: 3000, comm: 15 }
Develop the reduce method
var reduce2=function(key,values) {
   reducedValue = { sal: 0, comm: 0 };
   for(var i=0;i