MongoDB Guide – 16, Aggregation

Time:2019-7-30

Last article: MongoDB Guide – 15, Special Index and Collection: Geospatial Index, Using GridFS to Store Files
Next article: MongoDB Guide – 17, MapReduce

If you have data stored in MongoDB, you may want to do more than just extract it; you may want to analyze it and use it. This chapter describes the aggregation tools provided by MongoDB:

  • Aggregation framework;
  • MapReduce;
  • Several simple aggregation commands: count, distinct, and group.

Aggregation framework

Documents in a collection can be transformed and combined using an aggregation framework. Basically, you can create a pipeline with multiple components to process a series of documents. These components include filtering, projecting, grouping, sorting, limiting and skipping.
For example, if you have a collection of magazine articles, you may want to find the author who publishes the most articles. Assuming that each article is saved as a document in MongoDB, you can create a pipeline in the following steps.

  1. Project the authors in each article document.
  2. The authors are sorted according to their names and the number of times each name appears is counted.
  3. The authors are arranged in descending order according to the number of appearances of their names.
  4. Restrict the return results to the first five.

Each of these steps corresponds to an operator in the aggregation framework:

  1. {“$project” : {“author” : 1}}

This allows “author” to be projected from each document.
This grammar is similar to the field selector in the query: you can select fields that need to be projected by specifying “field name”: 1, or you can exclude unnecessary fields by specifying “field name”: 0. After the’$project’operation, each document in the result set is represented as {“_id”: id, “author”, “authorName”}. These results will only exist in memory and will not be written to disk.

  1. {“$group” : {“_id” : “$author”, “count” : {“$sum” : 1}}}

In this way, authors are sorted by their names. Every time a writer’s name appears, it adds 1 to the author’s count.
Firstly, the field “author” that needs to be grouped is specified. This is specified by “_id”: “$author”. You can imagine this operation as follows: after the operation is completed, each author corresponds to only one result document, so “author” becomes the unique identifier of the document (“_id”).
The second field means to add 1 to the “count” field for each document in the grouping. Note that there is no “count” field in the newly added document; this “$group” creates a new field.
After this step, each document in the result set will be structured as follows:

{"_id" : "authorName", "count" : articleCount}。
  1. {“$sort” : {“count” : -1}}

This operation will sort the documents in the result set in descending order according to the “count” field.

  1. {“$limit” : 5}

This operation limits the final return result to the first five documents in the current result.
When running in MongoDB, these operations are passed to the aggregate () function, respectively:

> db.articles.aggregate({"$project" : {"author" : 1}},
... {"$group" : {"_id" : "$author", "count" : {"$sum" : 1}}},
... {"$sort" : {"count" : -1}},
... {"$limit" : 5})
{
    "result" : [
        {
            "_id" : "R. L. Stine",
            "count" : 430
        }, 
        {
            "_id" : "Edgar Wallace",
            "count" : 175
        },
        {
            "_id" : "Nora Roberts",
            "count" : 145
        },
        {
            "_id" : "Erle Stanley Gardner",
            "count" : 140
        },
        {
            "_id" : "Agatha Christie",
            "count" : 85
        }
    ],
    "ok" : 1
}

Aggregate () returns an array of documents containing the five authors who published the most articles.

If the pipeline does not give the expected results, it needs to be debugged. When debugging, only the first pipeline operator can be specified first. If the expected result is obtained at this time, specify a second pipe operator. For the previous example, first try to aggregate using only the’$project’operator; if the result of this operator is valid, add the’$group’ operator; if the result is valid, add the’$sort’; and finally add the’$limit’operator. This allows you to gradually locate the operator that caused the problem.

When this book was written, the aggregation framework could not write to the collection, so all the results must be returned to the client. Therefore, aggregation results must be limited to 16 MB (the maximum response message size supported by MongoDB).

Pipeline operator

Each operator accepts a series of documents, does some type conversion for these documents, and finally passes the converted document as a result to the next operator (for the last pipeline operator, the result is returned to the client).
Different pipe operators can be combined in any order and can be repeated any number of times. For example, you can do “$match” first, then “$group” and then “$match” (matching different query conditions from the previous “$match”.

 $match

The $match is used to filter the collection of documents, and then aggregate on the filtered subset of documents. For example, if you want to make statistics for Oregon users, you can use {match: {“state”: {“OR”}. “$match” can use all the regular query operators (“$gt”, “$lt”, “$in” and so on). One exception to this rule is that you cannot use the geospatial operator in’$match’.
Usually, “$match” should be placed in front of the pipeline as far as possible in actual use. This has two advantages: one is that unnecessary documents can be quickly filtered out to reduce pipeline workload; the other is that if “$match” is executed before projection and grouping, queries can be indexed.

 $project

Compared with “normal” queries, projection operations in pipelines are more powerful. Using “$project” you can extract fields from subdocuments, rename fields, and perform some interesting operations on these fields.
The simplest’$project’operation is to select the desired field from the document. You can specify to include or not to contain a field whose syntax is similar to the second parameter in the query. If the following code is executed on the original collection, only one “author” field is included in the returned result document.

> db.articles.aggregate({"$project" : {"author" : 1, "_id" : 0}})

By default, if a “_id” field exists in the document, it will be returned (“_id” field can be removed by some pipe operators or by previous projection operations). You can use the above code to remove “_id” from the result document. The rules that contain and exclude fields are consistent with the syntax in regular queries.
The projected fields can also be renamed. For example, the “_id” of each user document can be renamed “userId” in the return result:

> db.users.aggregate({"$project" : {"userId" : "$_id", "_id" : 0}})
{
    "result" : [
        {
            "userId" : ObjectId("50e4b32427b160e099ddbee7")
        },
        {
            "userId" : ObjectId("50e4b32527b160e099ddbee8")
        }
        ...
    ],
    "ok" : 1
}

The “$field name” syntax here is to refer to the value of the field name field (in the example above, “_id”) in the aggregation framework. For example, “$age” is replaced with the content of the “age” field (either numeric or string), and “$tags. 3” is replaced with the fourth element in the tags array. So the “$_id” in the example above is replaced by the value of the “_id” field for each document entering the pipeline.
Note that you must explicitly specify that “_id” is excluded, otherwise the value of this field will be returned twice: once marked “userId” and once marked “id”. You can use this technique to generate multiple copies of fields for later use in’$group’.
When a field is renamed, MongoDB does not record the historical name of the field. Therefore, if there is an index on the “originalFieldname” field, the aggregation framework can’t use this index in the sorting operation below, although one can immediately see that the “new Fieldname” and “originalFieldname” in the code below represent the same field.

> db.articles.aggregate({"$project" : {"newFieldname" : "$originalFieldname"}},
... {"$sort" : {"newFieldname" : 1}})

Therefore, you should try to use the index before modifying the field name.

1. Pipeline expression

The simplest’$project’expression is to include and exclude fields, as well as the field name (‘$field name’). But there are more powerful options. You can also use expressions to combine multiple literals and variables in one value.
Several expressions in the aggregation framework can be combined or nested at any depth to create complex expressions.

2. Mathematical expression

Arithmetic expressions can be used to manipulate numerical values. By specifying a set of values, you can use this expression to operate on it. For example, the following expression adds the values of the “salary” and “bonus” fields.

> db.employees.aggregate(
... {
...     "$project" : {
...          "totalPay" : {
...              "$add" : ["$salary", "$bonus"]
...          }
...      }
... })

Multiple expressions can be nested together to form more complex expressions. Suppose we want to deduct 401 (k) from the total amount. You can use the “$subtract” expression:

401 (k) is a pension scheme in the United States. —— Translator’s Notes

> db.employees.aggregate(
... {
...     "$project" : {
...         "totalPay" : {
...              "$subtract" : [{"$add" : ["$salary", "$bonus"]}, "$401k"]
...         }
...     }
... })

Expressions can be nested at any level.
The following is the syntax for each operator:

  • “$add” : [expr1[, expr2, …, exprN]]

This operator accepts one or more expressions as parameters and adds them together.

  • “$subtract” : [expr1, expr2]

Accept two expressions as parameters and subtract the second expression from the first expression as a result.

  • “$multiply” : [expr1[, expr2, …, exprN]]

Accept one or more expressions and multiply them.

  • “$divide” : [expr1, expr2]

Accept two expressions and divide the quotient of the first expression by the quotient of the second expression as a result.

  • “$mod” : [expr1, expr2]

Accept two expressions, divide the first expression by the remainder of the second expression as a result.

3. date expression

Many aggregations are time-based: What happened last week? What happened last month? What happened in the past year? Therefore, the aggregation framework contains expressions for extracting date information: “$year”, “$month”, “$week”, “$dayOfMonth”, “$dayOfWeek”, “$dayOfYear”, “$hour”, “$minute” and “$second”. Date operation can only be performed on fields of date type, but not on fields of numerical type.
Each date type operates similarly: accepts a date expression and returns a value. The following code returns each employee’s month of employment:

> db.employees.aggregate(
... {
...     "$project" : {
...         "hiredIn" : {"$month" : "$hireDate"}
...     }
... })

You can also use literal date. The following code calculates the working hours of each employee in the company:

> db.employees.aggregate(
... {
...     "$project" : { 
...         "tenure" : {
...             "$subtract" : [{"$year" : new Date()}, {"$year" : "$hireDate"}]
...         }
...     }
... })

4. String expression

There are also some basic string operations that can be used, and their signatures are as follows:

  • “$substr” : [expr, startOffset, numToReturn]

The first parameter expr must be a string, and this operation intercepts a substring of the string (numToReturn byte starting from the startOffset byte, note that it is a byte, not a character). In multi-byte encoding, this is especially important.) Expr must be a string.

  • “$concat” : [expr1[, expr2, …, exprN]]

Connect the given expressions (or strings) together as a return result.

  • “$toLower” : expr

The parameter expr must be a string value, which returns the lowercase form of expr.

  • “$toUpper” : expr

The parameter expr must be a string value, which returns the upper case form of expr.
The operation of changing the case of a character ensures only that it is valid for Roman characters.
Here is an example of generating an email address in [email protected] format. It extracts the first character of “$first name” and connects it to multiple constant strings and “$last name” to form a string:

> db.employees.aggregate(
... {
...     "$project" : {
...         "email" : {
...             "$concat" : [
...                 {"$substr" : ["$firstName", 0, 1]},
...                 ".",
...                 "$lastName",
...                 "@example.com"
...             ]
...         }
...     }
... })

5. Logical expression

There are some logical expressions that can be used to control statements.
Here are some comparison expressions.

  • “$cmp” : [expr1, expr2]

Compare expr1 with expr2. If expr1 equals expr2, return 0; if expr1 < expr2, return a negative number; if expr1 > expr2, return a positive number.

  • “$strcasecmp” : [string1, string2]

Compare string1 and string2 to distinguish case from case. Only valid for strings composed of Roman characters.

  • “$eq”/”$ne”/”$gt”/”$gte”/”$lt”/”$lte” : [expr1, expr2]

Perform corresponding comparison operations on expr1 and expr2, and return the result of comparison (true or false).

Below are several Boolean expressions.

  • “$and” : [expr1[, expr2, …, exprN]]

If the values of all expressions are true, return true, or false.

  • “$or” : [expr1[, expr2, …, exprN]]

As long as any expression has a value of true, it returns true, otherwise it returns false.

  • “$not” : expr

Oppose expr.

There are also two control statements.

  • “$cond” : [booleanExpr, trueExpr, falseExpr]

If the booleanExpr value is true, then return true Expr, otherwise return false Expr.

  • “$ifNull” : [expr, replacementExpr]

If expr is null, return replacementExpr, otherwise return expr.

With these operators, more complex logic can be used in aggregation, and different codes can be executed for different data, resulting in different results.
Pipelines have specific requirements for the form of input data, so these operators should pay special attention when passing in data. Arithmetic operators must accept values, date operators must accept dates, string operators must accept strings, and if there are missing characters, these operators will report errors. If your data sets are inconsistent, you can use this condition to detect missing values and fill them in.

6. An example of extraction

If a professor wants to score a student by some complicated calculation: 10% attendance, 30% daily test scores and 60% final exams (100% if the teacher’s favorite student). You can use the following code:

> db.students.aggregate(
... {
...     "$project" : {
...         "grade" : {
...             "$cond" : [
...                 "$teachersPet",
...                 100, // if
...                 {    // else
...                     "$add" : [
...                         {"$multiply" : [.1, "$attendanceAvg"]},
...                         {"$multiply" : [.3, "$quizzAvg"]},
...                         {"$multiply" : [.6, "$testAvg"]}
...                     ]
...                 }
...             ]
...         }
...     }
... }) 

$group

The $group operation groups documents according to different values of a particular field. Here are some examples of grouping.

  • If we use minutes as a unit of measurement and want to find out the average humidity of the day, we can group them according to the “day” field.
  • If there is a group of students, they are expected to be divided into multiple groups according to the score level. They can be grouped according to the “grade” field.
  • If there is a user set and you want to know how many users there are in each city, you can group the set according to the two fields of “state” and “city”, and each “city” / “state” corresponds to a group. Grouping should not be based solely on the “city” field, because different states may have cities with the same name.

If you select a field that needs to be grouped, you can pass the selected field to the “_id” field of the “$group” function. For the above example, the corresponding code is as follows:

{"$group" : {"_id" : "$day"}}
{"$group" : {"_id" : "$grade"}}
{"$group" : {"_id" : {"state" : "$state", "city" : "$city"}}}

If these codes are executed, each grouping in the result set corresponds to a document with only one field (grouping key). For example, the results of grouping students according to their grades may be: {result”: [{“_id”:”A+”}, {“_id”:”A”}, {“_id”:”A”}, {A-“},…, {“_id”:”F”},””ok”: 1}. With the above code, you can get each different value in a particular field, but all examples require some calculation based on these groups. Therefore, you can add fields and use grouping operators to do some calculations on documents in each grouping.

1. Grouping operators

These grouping operators allow each grouping to be computed and the corresponding results are obtained. Section 7.1 describes the role of the’$sum’grouping operator: for each document in the grouping, it adds 1 to the calculation result, so that the number of documents in each grouping can be obtained.

2. Arithmetic operators

There are two operators that can be used to calculate the value of a numeric type field: “$sum” and “$average”.

  • “$sum” : value

For each document in the grouping, value is added to the calculated result. Note that the above example uses a literal number 1, but you can also use more complex values here. For example, if you have a collection of sales data for each country, you can use the following code to get the total revenue for each country:

> db.sales.aggregate(
... {
...     "$group" : {
...         "_id" : "$country",
...         "totalRevenue" : {"$sum" : "$revenue"}
...     }
... })
  • “$avg” : value

Returns the average value of each group.
For example, the following code returns the average income of each country and the sales of each country:

> db.sales.aggregate(
... {
...     "$group" : {
...         "_id" : "$country",
...         "totalRevenue" : {"$avg" : "$revenue"},
...         "numSales" : {"$sum" : 1}
...     }
... }) 

3. Extreme operator

The following four operators can be used to get the “edge” values in the data set.

  • “$max”: expr returns the maximum value in the group.
  • “$min” : expr

Returns the minimum value within the group.

  • “$first”: expr returns the first value of the grouping, ignoring all subsequent values. This operation is meaningful only when the data order is clearly known after sorting.
  • “$last” : expr

Instead of “$first”, return the last value of the group.

“$max” and “$min” look at each document to get the extremum. Therefore, if the data is disordered, the two operators can also work effectively; if the data is ordered, the two operators will be wasted. Assuming that there is a data set of students’test scores, we need to find the highest and lowest scores.

> db.scores.aggregate(
... {
...     "$group" : {
...         "_id" : "$grade",
...         "lowestScore" : {"$min" : "$score"},
...         "highestScore" : {"$max" : "$score"}
...         }
... })

On the other hand, if the dataset is sorted by the desired field, the “$first” and “$last” operators are very useful. The following code yields the same results as the above code:

> db.scores.aggregate(
... {
...     "$sort" : {"score" : 1}
... },
... {
...     "$group" : {
...         "_id" : "$grade",
...         "lowestScore" : {"$first" : "$score"},
...         "highestScore" : {"$last" : "$score"}
...     }
... })

If the data is sorted, then $first and $last are more efficient than $min and $max. If you are not ready to sort the data, then using $min and $max directly is more efficient than sorting first and then using $first and $last.

4. Array operators

There are two operators that can perform array operations.

  • “$addToSet” : expr

If expr is not included in the current array, add it to the array. In the return result set, each element appears at most once, and the order of the elements is uncertain.

  • “$push” : expr

Whatever expr is, add it to the array. Returns an array containing all values.

5. Grouping Behavior

There are two operators that can’t process documents in the streaming way described earlier, “$group” is one of them. Most of the operators work in a streaming manner. New documents can be processed as long as new documents are entered. However, “$group” must wait until all documents are received before grouping documents, and then sending each grouping to the next operator in the pipeline. This means that in the case of fragmentation, “$group” will be executed on each fragment first, and then the grouping results on each fragment will be sent to mongos for final unified grouping. The remaining pipeline work will also run on mongos (not on fragmentation).

 $unwind

Unwind splits each value in an array into separate documents. For example, if you have a blog post with multiple comments, you can use $unwind to split each comment into a separate document:

> db.blog.findOne()
{
    "_id" : ObjectId("50eeffc4c82a5271290530be"),
    "author" : "k",
    "post" : "Hello, world!",
    "comments" : [
        {
            "author" : "mark",
            "date" : ISODate("2013-01-10T17:52:04.148Z"),
            "text" : "Nice post"
        },
        {
            "author" : "bill",
            "date" : ISODate("2013-01-10T17:52:04.148Z"),
            "text" : "I agree"
        }
    ]
}
> db.blog.aggregate({"$unwind" : "$comments"})
{
    "results" :
        {
            "_id" : ObjectId("50eeffc4c82a5271290530be"),
            "author" : "k",
            "post" : "Hello, world!",
            "comments" : {
                "author" : "mark",
                "date" : ISODate("2013-01-10T17:52:04.148Z"),
                "text" : "Nice post"
            }
        },
        {
            "_id" : ObjectId("50eeffc4c82a5271290530be"),
            "author" : "k",
            "post" : "Hello, world!",
            "comments" : {
                "author" : "bill",
                "date" : ISODate("2013-01-10T17:52:04.148Z"),
                "text" : "I agree"
            }
        }
    ],
    "ok" : 1
}

If you want to get a specific subdocument in a query, this operator is very useful: first use “$unwind” to get all the subdocuments, and then use “$match” to get the desired document. For example, if you want to get all the comments of a particular user (just get the comments, and don’t need to return the articles to which the comments belong), it’s impossible to use ordinary queries. However, by extracting, splitting and matching, it is easy:

> db.blog.aggregate({"$project" : {"comments" : "$comments"}},
... {"$unwind" : "$comments"},
... {"$match" : {"comments.author" : "Mark"}})

Since the final result is still a “comments” sub-document, you may want to do another projection to make the output more elegant.

$sort

It can be sorted according to any field (or multiple fields), the same grammar as in normal queries. If you want to sort a large number of documents, it is strongly recommended to sort in the first phase of the pipeline, when the sort operation can use the index. Otherwise, the sorting process will be slow and take up a lot of memory.
You can use fields that actually exist in documents in sorting, or fields that are renamed at projection time:

> db.employees.aggregate(
... {
...     "$project" : {
...         "compensation" : {
...             "$add" : ["$salary", "$bonus"]
...         },
...         "name" : 1
...     }
... },
... {
...     "$sort" : {"compensation" : -1, "name" : 1}
... })

This example will rank employees, and the final result will be in the order of pay from high to low, and names from A to Z.
Sort direction can be 1 (ascending) and – 1 (descending).
Like the’$group’mentioned earlier,’$sort’ is also an operator that cannot work in a streaming manner. “$sort” must also receive all documents before sorting. In the slicing environment, first sorting on each slice, and then sending the sorting results of each slice to mongos for further processing.

 $limit

The $limit accepts a number N and returns the first N documents in the result set.

 $skip

Skp also accepts a number n, discards the first N documents in the result set, and returns the remaining documents as a result. In “normal” queries, this operator is inefficient if you need to skip a large amount of data. The same is true in aggregation because it must match all documents that need to be skipped before discarding them.

Use of pipes

You should try to filter out as many documents and fields as possible at the beginning of the pipeline (before performing “$project”, “$group” or “$unwind” operations). Pipelines cannot use indexes in filtering and sorting unless they use data directly from the original collection. If possible, the aggregation pipeline will try to sort the operations so that the index can be used effectively.
MongoDB does not allow a single aggregation operation to take up too much system memory: if MongoDB finds that an aggregation operation takes up more than 20% of the memory, the operation will directly output errors. Allowing output results to be piped into a collection for later use (which minimizes the memory required).
If you can quickly reduce the size of the result set through the’$match’operation, you can use pipes for real-time aggregation. Because the pipeline will contain more documents and become more and more complex, it is almost impossible to get the real-time operation results of the pipeline.

Last article: MongoDB Guide – 15, Special Index and Collection: Geospatial Index, Using GridFS to Store Files
Next article: MongoDB Guide – 17, MapReduce