The principle of mongodb replica set synchronization

Time:2019-12-16

Mongodb’s synchronization principle is less introduced by official documents, and there are not too many online materials. Here is a little thing sorted out by combining official documents, online materials and test logs.
Because each partition of mongodb is also a replica set, only the synchronization principle of replica set is needed.

I. initial sync

Generally speaking, mongodb replica set synchronization mainly consists of two steps:

1 \. Initial sync, full sync
2 \. Replication, i.e. sync oplog

First, the full data is synchronized through init sync, and then the oplog synchronization incremental data on the primary is replayed continuously through replication. After the full synchronization is completed, the member changes from startup2 to secondary

1.1 initialization synchronization process

1) start full synchronization and obtain the latest timestamp T1 on the synchronization source
2) full synchronous data collection and indexing (time consuming)
3) get the latest time stamp T2 on the synchronization source
4) replay all oplog between T1 and T2
5) end of full synchronization

In short, it is to traverse all the collections of all the DB on the primary node, copy the data to its own node, and then read and replay the oplog in the period from the beginning to the end of full synchronization.

After the initial sync is completed, the secondary will establish the tailable cursor of local.oplog.rs on the primary node, continuously obtain the newly written oplog from the primary node and apply it to itself.

1.2 initialize synchronization scenario

When the following conditions occur to the secondary node,Need advanced full synchronization

1) oplog is empty
2) set the local.replset.minvalid set ⾥ initialsyncflag field to true (for init sync failure processing)
3) the memory flag initialsyncrequested is set to true (for resync command, resync command is only used for master / slave architecture, and replica set cannot be used)

These three scenarios correspond to each other(for Scene 2 and Scene 3, please refer to Zhang Youdong’s great God blog)

1) the new node adds any oplog, which requires advanced initial sync at this time.
2) at the beginning of initial sync, the initial syncflag field will be set to true, and then set to false after the normal end. If the node restarts, it is found that the initial syncflag is true, which means that the last full synchronization failed in the middle, at this time, the initial sync should be re entered
3) when the user sends the resync command, the initialsyncrequested will be set to true, and the initial sync will be forced to restart

1.3 explanation of questions

1.3.1 when the data is fully synchronized, will the oplog of the source data be overwritten, resulting in the failure of full synchronization?

Not in version 3.4 and later.
The following figure shows the improvement of full synchronization in 3.4 (from Zhang Youdong’s blog):

The principle of mongodb replica set synchronization

The official document is:

Initial sync constructs all collection indexes when copying documents for each collection. In earlier versions (before 3.4) of mongodb, only the "Ou ID" builds the index at this stage.
When initial sync copies data, the newly added oplog records will be saved locally (newly added in 3.4).

II. Replication

2.1 process of sync oplog

After the completion of full synchronization, secondary starts to establish tailable cursor from the end time point, and continuously pulls oplog from the synchronization source and replays it to itself. This process is not completed by one thread. In order to improve the synchronization efficiency, mongodb divides the pull oplog and replays oplog into different threads for execution.
The specific threads and functions are as follows (this part is not found in the official documents temporarily, from Zhang Youdong’s great God blog):

  • Producer thread: this thread continuously pulls the oplog from the synchronization source and adds it to a blockqueue queue queue to save. The maximum storage capacity of blockqueue is 240mb. When the threshold is exceeded, the oplog can only be pulled after it is consumed by replbatcher.
  • Replbatcher thread: this thread is responsible for taking oplog out of the queue of producer thread one by one and putting it into the queue maintained by itself. This queue allows up to 5000 elements, and the total size of elements does not exceed 512MB. When the queue is full, it needs to wait for oplogapplication to consume
  • Oplogapplication will take out all elements of the current queue of replbatch thread, and distribute them to different replwriter threads according to docid (or collection name if the storage engine does not support document lock). Replwriter thread will apply all oplog to itself. Wait for all oplog to be applied, and oplogapplication thread will write all oplog order to local.oplog RS collection.

For the above description, draw a diagram for easy understanding:

The principle of mongodb replica set synchronization

Statistics of producer’s buffer and apply threads can be queried through dB. Serverstatus(). Metrics. Repl.

2.2 explanation of process questions

2.2.1 why are there so many threads in oplog playback?

Like mysql, one thread does one thing, and pulling oplog is a single thread, and other threads play it back; multiple playback threads speed up.

2.2.2 why does replbatcher thread need to transit?

When oplog is replayed, the order should be maintained. When DDL commands such as create and drop are encountered, these commands and other add, delete, modify and query commands cannot be executed. These controls are completed by replbatcher.

2.2.3 how to solve the problem that secondary node oplog replay can’t catch up with primary node?

Method 1: set a larger number of playback threads

*Mongod command line specifies: mongod -- setparameter replwriterthreadcount = 32
  *Specified in the configuration file
setParameter:
  replWriterThreadCount: 32

Method 2: increase the size of oplog
Method 3: distribute the writeopstooplog step to multiple replwriter threads for concurrent execution. See that the official developer log has implemented this (in version 3.4.0-rc2)

2.3 precautions
  • Initial sync single thread replication data, the efficiency is relatively low, the production environment should try to avoid the emergence of initial sync, need to reasonably configure oplog.
  • When a new node is added, the initial sync can be avoided through the physical replication mode. Copy the dbpath on the primary node to the new node, and then start it directly.
  • When the secondary synchronization lag is caused by too high concurrent writes on the primary, and the sizebytes value of DB. Serverstatus(). Metrics. Repl. Buffer keeps approaching maxsizebytes, you can increase it by adjusting the number of replwriter concurrent threads on the secondary.

III. log analysis

3.1 initialize synchronization log

Set the log level verbosity to 1 and filter the logs
cat mg36000.log |egrep “clone|index|oplog” >b.log
Finally, take out some filtered logs.

3.4.21 log of newly added nodes

Because there are too many logs, it doesn't make sense to post too many logs. Here's a post for a db01 library
Log of the collection.
It can be found that the collection index is created first, and then the clone collection data and index data are collected, thus completing the clone of the collection. Finally, change the configuration to the next collection.
2019-08-21T16:50:10.880+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-27-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1 }, "name" : "num_1", "ns" : "db01.test2" }),
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: { v: 2, key: { num: 1.0 }, name: "num_1", ns: "db01.test2" }
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.882+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-28-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.test2" }),
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.test2" }
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.901+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: num_1
2019-08-21T16:50:10.906+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: _id_
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11] collection clone finished: db01.test2
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11]     collection: db01.test2, stats: { ns: "db01.test2", documentsToCopy: 2000, documentsCopied: 2000, indexes: 2, fetchedBatches: 1, start: new Date(1566377410875), end: new Date(1566377410913), elapsedMillis: 38 }
2019-08-21T16:50:10.920+0800 D STORAGE  [InitialSyncInserters-db01.collection10] create uri: table:db01/index-30-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),

3.6.12 add new node log

The difference between 3.6 and 3.4 is that the thread to copy the database is clear: repl writer worker to replay (see the document, 3.4 already does)
It is also clear that cursors are used.
There is no difference between the others and 3.4. It is also to create indexes and then clone data.
2019-08-22T13:59:39.444+0800 D STORAGE  [repl writer worker 9] create uri: table:db01/index-32-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=true)
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9] build index on: db01.collection1 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.collection1" }
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T13:59:39.447+0800 D REPL     [replication-1] Collection cloner running with 1 cursors established.
2019-08-22T13:59:39.681+0800 D INDEX    [repl writer worker 7]      bulk commit starting for index: _id_
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7] collection clone finished: db01.collection1
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     database: db01, stats: { dbname: "db01", collections: 1, clonedCollections: 1, start: new Date(1566453579439), end: new Date(1566453579725), elapsedMillis: 286 }
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     collection: db01.collection1, stats: { ns: "db01.collection1", documentsToCopy: 50000, documentsCopied: 50000, indexes: 1, fetchedBatches: 1, start: new Date(1566453579440), end: new Date(1566453579725), elapsedMillis: 285 }
2019-08-22T13:59:39.731+0800 D STORAGE  [repl writer worker 8] create uri: table:test/index-34-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "test.user1" }),log=(enabled=true)

4.0.11 add new node log

Using cursors is basically the same as 3.6
2019-08-22T15:02:13.806+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-30--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1 }, "name" : "num_1", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: { v: 2, key: { num: 1.0 }, name: "num_1", ns: "db01.collection1" }
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.816+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-31--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.collection1" }
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.820+0800 D REPL     [replication-0] Collection cloner running with 1 cursors established.

3.2 copying logs

2019-08-22T15:15:17.566+0800 D STORAGE  [repl writer worker 2] create collection db01.collection2 { uuid: UUID("8e61a14e-280c-4da7-ad8c-f6fd086d9481") }
2019-08-22T15:15:17.567+0800 I STORAGE  [repl writer worker 2] createCollection: db01.collection2 with provided UUID: 8e61a14e-280c-4da7-ad8c-f6fd086d9481
2019-08-22T15:15:17.567+0800 D STORAGE  [repl writer worker 2] stored meta data for db01.collection2 @ RecordId(22)
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] db01.collection2: clearing plan cache - collection info cache reset
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] create uri: table:db01/index-43--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection2" }),log=(enabled=false)

Reference resources:
https://docs.mongodb.com/v4.0/core/replica-set-sync/
https://docs.mongodb.com/v4.0/tutorial/resync-replica-set-member/#replica-set-auto-resync-stale-member
http://www.mongoing.com/archives/2369



Author: hs2021

Read the original text

This is the original content of yunqi community, which can not be reproduced without permission.