Simple mongodb replication



This article was first published by the author in
InfoQ:Mongodb replication in simple terms
Mongodb Chinese community:Mongodb replication in simple terms

Because I opened my blog, I moved the previous good articles to facilitate everyone’s browsing.


Recently, the author has encountered many replication related problems in the production environment. Looking up the online materials, it is found that although the official documents are systematic, they are not deep enough. Some online in-depth articles are directly displayed in the source code, which is not conducive to everyone’s understanding. Therefore, this paper combines the first two and finally shows the whole architecture of mongodb replication in a simple way. This paper is divided into the following five steps:

  • Introduction to mongodb replication
  • Mongodb add slave Library
  • Detailed explanation of mongodb replication process
  • Mongodb high availability
  • Mongodb replication summary

1. Introduction to mongodb replication

Firstly, this chapter will briefly introduce some basic concepts of mongodb replication for your understanding of the following contents.

1.1 basic introduction

Mongodb has two modes: replica set and master-slave replication. Today we will introduce the replica set mode, because the master-slave mode is completely abandoned in mongodb 3.6. The mongodb replica set has three roles: primary, secondary and arbiter. Today, I’d like to introduce the internal principle of data synchronization between primary and secondary. The mongodb replica set schema is as follows:

Simple mongodb replication

1.2、MongoDB Oplog

Mongodb oplog is the replication medium of mongodb primary and secondary during and after replication establishment, that is, all write operations in the primary will be recorded in mongodb oplog, and then the oplog will be pulled from the primary database and applied to its own database. Oplog here is a collection of mongodb local databases. It is capped collection, which means that it is fixed size and recycled. As shown below:

Simple mongodb replication

Description of contents and fields in mongodb oplog:

    "ts" : Timestamp(1446011584, 2),
    "h" : NumberLong("1687359108795812092"),
    "v" : 2,
    "op" : "i",
    "ns" : "test.nosql",
    "o" : { "_id" : ObjectId("563062c0b085733f34ab4129"), "name" : "mongodb", "score" : "100" }

    TS: operation time, current timestamp + counter, counter is reset every second
    h: Globally unique identifier of the operation
    v: Oplog version information
    OP: operation type
        i: Insert operation
        u: Update operation
        d: Delete operation
        c: Execute commands (such as createdatabase, dropDatabase)
    n: Empty operation, special purpose
    NS: the set for which the operation is directed
    o: Operation content, if it is an update operation
    O2: Operation query criteria. Only the update operation contains this field

1.3 development of mongodb replication

Mongodb has iterated over many versions. In the figure below, I summarize some important improvements in mongodb replication in the commonly used versions on the market.

Simple mongodb replication

For details, please refer to the official release note of mongodb:…

2. Mongodb add slave Library

2.1. Add slave library command

It is easy to add a slave library to mongodb. After installing the slave library, you can add it directly by executing rs.add() or replsetreconfig commands in the main library. In fact, both commands call replsetreconfig commands in the end. If you are interested, you can read the JS code of mongodb client.

2.2 specific steps

Then let’s look at the general steps of adding a new slave library to the replica set, as shown in the figure below. The secondary on the right is my new slave library.

Simple mongodb replication

From the above figure, we can see that there are seven steps. Let’s see what mongodb does in each step:

1. Master library receives add slave library command
2. The master database updates the replica set configuration and establishes a heartbeat mechanism with the new slave database
3. Receive the heartbeat message from the main database and establish a heartbeat with the main database
4. Other libraries receive the new version replica set configuration information from the main library and update their own configuration
5. Other slave libraries establish heartbeat mechanism with new slave libraries
6. The new slave receives heartbeat information from other slave libraries and establishes heartbeat mechanism with other slave libraries
7. The newly added node updates the replica set configuration information to the local.system.replset set. Mongodb will query whether the local.system.replset is configured with replset information in a loop. Once the relevant information is found, it will trigger the start of the replication thread, and then judge whether full replication is required. If necessary, full replication is required instead of incremental replication.
8. The final synchronization establishment is completed

be careful:

All nodes in the replica set have a mutual heartbeat mechanism every 2 seconds. After mongodb version 3.2, we can control the heartbeat frequency through the heartbeat intervalmillis parameter.

The above process can be viewed in combination with the replica set node status (rs.status command):

  • Startup / / when each node of the replica set is started, mongod loads the replica set configuration information, and then changes the status to startup2
  • Startup2 / / after loading the configuration, decide whether to perform initial sync. If necessary, stay in the startup2 state, and if not, enter the recovering state
  • Recovering / / it is in the stage where external reading / writing is not available, mainly when incremental data is tracked after initial sync.

3. Detailed explanation of mongodb replication process

Above, we know the general process of adding a slave database. Now let’s look at the specific details of master-slave data synchronization. When adding from the library to the replica set, you will judge whether you need initial SYC (full synchronization) or incremental synchronization. What are the conditions for judging?

3.1 judge full synchronization and incremental synchronization

  • If the set in the local database is empty, perform full synchronization.
  • If the minvalid collection stores_ Initialsyncflag, perform full synchronization (used for init sync failure processing)
  • If initialsyncrequested is true, perform full synchronization (used for resync command, which is only used for master / slave architecture, and replica sets cannot be used)

If one of the above three conditions is satisfied, full synchronization is required.

We can conclude that when adding the library to the replica set from the beginning, we can only perform initial sync first. Let’s take a look at the specific process of initial sync

3.2. Full synchronization process (init sync)

3.2.1. Find synchronization source

First of all, mongodb adopts the cascade replication architecture by default, that is, the main database is not necessarily selected as its own synchronization source by default. If you don’t want to cascade replication, you can control it through the chainingallowed parameter. In the case of cascading replication, you can also specify the synchronization source you want to replicate through the replsetsyncfrom command. Therefore, the synchronization source here is actually its master library relative to the slave library. What is the selection process of synchronization source?

The mongodb slave library will filter the synchronization sources that meet its own requirements at other nodes of the replica set through the following criteria.

  • If chainingallowed is set to false, only the primary database can be selected as the synchronization source
  • Find the node with the smallest Ping time and new data than itself (when the replica set is initialized or when a new node is added to the replica set, the new node Ping other nodes of the replica set at least twice)
  • The synchronization source is compared with the latest Optime of the main library. If the delay of the main library exceeds 30s, the synchronization source is not selected.
  • In the first filtering, nodes older than their own data will be eliminated first. If there are no nodes for the first time, these nodes need to be included for the second time to prevent no nodes from being used as synchronization sources in the end.
  • Finally, confirm whether the node is prohibited from participating in the election. If so, skip the node.

Through the above filtering, the last filtered node is used as a new synchronization source.

In fact, the mongodb synchronization source is not always stable except when it is selected during initial sync and incremental replication. It may change the synchronization source under the following circumstances:

  • Ping your own synchronization source
  • Your sync source role has changed
  • The delay between your own synchronization source and any node of the replica set exceeds 30s

3.2.2 delete all databases except local in mongodb

3.2.3. Pull stock data of main warehouse

Here we come to the core logic of initial sync. I’ll show you the specific process of mongodb doing initial sync in the form of diagrams and steps.

Simple mongodb replication

Note: this figure is for versions before mongodb 3.4

The synchronization process is as follows:

*     0. Add _initialSyncFlag to minValid collection to tell us to restart initial sync if we crash in the middle of this procedure
 *1. Record start time
 *     2. Clone.
 *     3. Set minValid1 to sync target's latest op time.
 *     4. Apply ops from start to minValid1, fetching missing docs as needed.(Apply Oplog 1)
 *     5. Set minValid2 to sync target's latest op time.
 *     6. Apply ops from minValid1 to minValid2.(Apply Oplog 2)
 *     7. Build indexes.
 *     8. Set minValid3 to sync target's latest op time.
 *     9. Apply ops from minValid2 to minValid3.(Apply Oplog 3)
       10. Cleanup minValid collection: remove _initialSyncFlag field, set ts to minValid3 OpTime

Note: the above steps directly copy the comments in the mongodb source code.

The above steps have the following improvements in Mongo 3.4 initial Sync:

  • When creating the collection, the index is created at the same time (the same as the main library). Before mongodb version 3.4, only the index is created_ ID index, and other indexes are created after the data copy is completed.
  • While creating the set and copying the data, copy the oplog to the local database. After the data copy is completed, start applying the local oplog data.
  • New retry mechanism for initial sync failure due to network problems.
  • During initial sync, it is found that the collection has been renamed, and initial sync will be restarted.

The above four new features improve the efficiency and reliability of initial sync, so you’d better use the latest version of mongodb 3.4 or 3.6 when using mongodb. Mongodb 3.6 has some exciting features, which are not described here.
After the full synchronization is completed, mongodb will enter the incremental synchronization process.

3.3 incremental synchronization process

We introduced initial sync above, that is, the stock data of the synchronization source has been taken. How can the subsequent data written by the main database be synchronized? Let’s introduce it to you with figures and specific steps:

Simple mongodb replication

Note: This is not necessarily the primary. It was just mentioned that the synchronization source may also be the secondary. The primary is mainly used here for your understanding.

We can see that there are six steps above, and the specific tasks of each step are as follows:

1. After the initialization synchronization of the sencondary is completed, start incremental replication, create a cursor on the primary set through the produce thread, and request to obtain data in real time.
2. The primary node returns oplog data to the secondary node.
3. Sencondary reads the oplog sent by the primary node and writes it to the queue.
4. The synchronization thread of sencondary will always consume the queue through the trypopandwaitformore method. When certain conditions are met each time, the conditions are as follows:

  • The total data is greater than 100MB
  • Some data has been fetched but less than 100MB, but there is no data in the queue at present. At this time, it will be blocked and wait for one second. If there is no data, the data fetching is completed this time.

After one of the above two conditions is met, the data will be processed by the prefetchops method. The prefetchops method mainly divides the data at the database level to facilitate subsequent multithreading to write to the database. If the wiredtiger engine is used, the segmentation here is based on the doc ID.

5. Finally, the divided data is written to the database in batch in a multi-threaded manner (mongodb will block all reads when writing data in batch from the database).
6. Then write the oplog data in the queue to the set in the sencondary.

4. Mongodb high availability

Above, we introduce the data synchronization of mongodb replication. We know that in addition to data synchronization, replication also has an important place of high availability. Generally, we need to customize our own scheme or adopt the third-party open source scheme for the database. Mongodb has implemented a high availability scheme internally. Let me give you a detailed introduction to the high availability of mongodb.

4.1 trigger switching scenario

First, let’s look at those situations that will trigger mongodb to perform master-slave switching.

1. Initialize a new replica set
2. The slave library cannot be connected to the master library (more than 10s by default, which can be controlled by the heartbeattimeoutsecs parameter). The slave library initiates an election
3. The primary library voluntarily abandons the primary role

  • Actively execute rs.stepdown command
  • When the master database cannot communicate with most nodes
  • When modifying replica set configuration (triggered in Mongo version 2.6, other versions to be determined)

When modifying the following configurations:

  • _id
  • votes
  • priotity
  • arbiterOnly
  • slaveDelay
  • hidden
  • buildIndexes

4. When removing the slave Library (triggered in mongodb 2.6, not mongodb 3.4, other versions to be determined)

4.2 heartbeat mechanism

Through the above scenario of triggering the handover, we know that the heartbeat information of mongodb is an important condition for mongodb to judge whether the other party survives. When certain conditions are met, the mongodb master database or slave database will trigger the handover. Let me give you a detailed introduction to the heartbeat mechanism

We know that all nodes in the mongodb replica set keep heartbeat with each other, and the heartbeat frequency is once every 2 seconds by default. It can also be controlled through heartbeat interval millis. When a new node is added, all nodes in the replica set need to establish a heartbeat with the new node. What is the heartbeat information?

Heartbeat information content:

    BSONObjBuilder cmdBuilder;
    cmdBuilder.append("replSetHeartbeat", setName);
    cmdBuilder.append("v", myCfgVersion);
    cmdBuilder.append("pv", 1);
    cmdBuilder.append("checkEmpty", checkEmpty);
    cmdBuilder.append("from", from);
    if (me > -1) {
    cmdBuilder.append("fromId", me);

Note: the above code extracts the heartbeat information fragment constructed in the mongodb source code.

The details are shown in the mongodb log as follows:

    command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "shard1", v: 21, pv: 1, checkEmpty: false, from: "", fromId: 3 } ntoreturn:1 keyUpdates:0

By default, all nodes in the replica set send the above information to other remaining nodes every 2 seconds. After receiving the information, other nodes will call the replsetcommand command to process the heartbeat information. After processing, the following information will be returned:

result.append("set", theReplSet->name());
            MemberState currentState = theReplSet->state();
            result.append("state", currentState.s);  //  Current node status
            if (currentState == MemberState::RS_PRIMARY) {
                result.appendDate("electionTime", theReplSet->getElectionTime().asDate());
            result.append("e", theReplSet-> iAmElectable());  // Can I participate in the election
            result.append("hbmsg", theReplSet->hbmsg());
            result.append("time", (long long) time(0));
            result.appendDate("opTime", theReplSet->lastOpTimeWritten.asDate());
            const Member *syncTarget = replset::BackgroundSync::get()->getSyncTarget();
            if (syncTarget) {
                result.append("syncingTo", syncTarget->fullName());

            int v = theReplSet->config().version;
            result.append("v", v);
            if( v > cmdObj["v"].Int() )
                result << "config" <config().asBson();

Note: the above information is returned under normal conditions, and there are some abnormal processing scenarios, which will not be described in detail here.

4.3 switching process

Earlier, we learned about the scenario of triggering handover and the heartbeat mechanism before the mongodb replica set node. Let’s look at the specific process of switching:
1. The slave library cannot connect to the master library, or the master library abandons the primary role.
2. The current role of the node will be obtained from the library according to the heartbeat message and compared with the previous one
3. If the role changes, start executing the msgchecknewstate method
4. Finally call the electself method in the msgchecknewstate method (there will be some judgment to decide whether to finally call the electself method)
5. The electself method finally sends the replsetelect command to other nodes in the replica set to request voting.
The command is as follows:

BSONObj electCmd = BSON(
                       "replSetElect" << 1 <<
                       "set" << <<
                       "who" << me.fullName() <<
                       "whoid" << me.hbinfo().id() <<
                       "cfgver" <version <<
                       "round" << OID::gen() /* this is just for diagnostics */

The specific logs are as follows:

2017-12-14T10:13:26.917+0800 [conn27669] run command admin.$cmd { replSetElect: 1, set: "shard1", who: "", whoid: 4, cfgver: 27, round: ObjectId('5a31de4601fbde95ae38b4d2') }

6. When other replica sets receive replsetelect, they will compare the cfgver information, confirm whether the node sending the command is in the replica set, and confirm whether the priority of the node is the highest among all nodes in the replica set. Finally, the voting information will be sent to the node only when the conditions are met.
7. If the number of votes counted by the node initiating the voting is greater than half of the number of votes that the replica set can participate in, the preemption is successful and becomes a new primary.
8. If other slave libraries find that their synchronization source role has changed, it will trigger re selection of synchronization source.


We know that data loss may occur during switching, mainly because the master database is down, but the newly written data has not been synchronized to the slave database in time. At this time, data loss will occur.

In this case, mongodb adds a rollback mechanism. After the primary database is restored, it is added to the replication set again. At this time, the old primary database will compare the oplog information with the synchronization source. At this time, it can be divided into the following two cases:
1. No oplog information newer than the old master database was found in the synchronization source.
2. The latest oplog information of the synchronization source is different from the hash contents of Optime and oplog of the old main database.

For the above two cases, mongodb will rollback. The rollback process is to reverse compare the oplog information until the corresponding oplog is found in the old master database and synchronization source, and then record all the oplog in this period to the file in the rollback directory. If the following conditions occur, the rollback will be terminated:

  • Compare the Optime of the old master database with that of the synchronization source. If it exceeds 30 minutes, the rollback will be abandoned.
  • In the process of rollback, if a single oplog is found to exceed 512M, the rollback will be abandoned.
  • If there is a dropDatabase operation, the rollback is discarded.
  • If the final generated rollback record exceeds 300m, the rollback will also be abandoned.

We already know the rollback principle of mongodb, but how can we avoid rollback in the production environment? After all, rollback is troublesome, and the business logic for time sequence is unacceptable. Mongodb also provides a corresponding scheme, writeconcern. I won’t elaborate here. Interested friends can learn more about it. In fact, this is also a choice in cap.

5. Mongodb replication summary

The internal principle of mongodb replication has been introduced to you. In fact, the above involves many details, which are not listed one by one. If you are interested, you can sort it out by yourself. It should also be noted here that the iteration speed of mongodb version is relatively fast, so this article only focuses on mongodb 2.6 to mongodb 3.4. However, there may be some detailed changes in some versions, but the general logic remains unchanged. Finally, if you have any questions, you can also contact me.