Must mongodb scan the whole table to locate oplog?


Mongodb oplog (similar to MySQL binlog) records all the modification operations of the database, in addition to the primary and secondary synchronization; oplog can play many tricks, such as

  1. Full backup + incremental backup of all oplog can realize the function of mongodb recovery to any point in time
  2. Through oplog, in addition to synchronization to the standby node, you can also synchronize data to a separate cluster (even heterogeneous databases) to achieve disaster tolerance, multi activity and other scenarios. For example, the open-source mongoshake of Alibaba cloud can achieve incremental synchronization based on oplog.
  3. Mongodb version 3.6 + abstracts oplog and provides an interface of “change stream”. In fact, it can continuously subscribe to database changes. Based on these changes, some custom events can be triggered.
  4. ……

In general, mongodb can connect with ecology through oplog to realize data synchronization, migration, recovery and other capabilities. When building these capabilities, there is a general requirement that tools or applications should have the ability to pull oplog continuously. This process is usually

  1. Build a cursor based on the last pulled site
  2. Iterating cursor to get new oplog

So the question is, since mongodb oplog itself has no index, do you need to scan the whole table every time you locate the starting point of oplog?

Implementation details of oplog

{ "ts" : Timestamp(1563950955, 2), "t" : NumberLong(1), "h" : NumberLong("-5936505825938726695"), "v" : 2, "op" : "i", "ns" : "test.coll", "ui" : UUID("020b51b7-15c2-4525-9c35-cd50f4db100d"), "wall" : ISODate("2019-07-24T06:49:15.903Z"), "o" : { "_id" : ObjectId("5d37ff6b204906ac17e28740"), "x" : 0 } }
{ "ts" : Timestamp(1563950955, 3), "t" : NumberLong(1), "h" : NumberLong("-1206874032147642463"), "v" : 2, "op" : "i", "ns" : "test.coll", "ui" : UUID("020b51b7-15c2-4525-9c35-cd50f4db100d"), "wall" : ISODate("2019-07-24T06:49:15.903Z"), "o" : { "_id" : ObjectId("5d37ff6b204906ac17e28741"), "x" : 1 } }
{ "ts" : Timestamp(1563950955, 4), "t" : NumberLong(1), "h" : NumberLong("1059466947856398068"), "v" : 2, "op" : "i", "ns" : "test.coll", "ui" : UUID("020b51b7-15c2-4525-9c35-cd50f4db100d"), "wall" : ISODate("2019-07-24T06:49:15.913Z"), "o" : { "_id" : ObjectId("5d37ff6b204906ac17e28742"), "x" : 2 } }

The above is an example of mongodb oplog. Oplog mongodb is also a collection, but it is different from ordinary collection

  1. Oplog is a capped collection, but when it exceeds the configured size, the oldest inserted data will be deleted
  2. There is no ID field in the oplog set, and ts can be the only identification of oplog. The data of oplog set is organized in TS order
  3. Oplog does not have any index fields. To find an oplog, you need to scan the whole table

When we pull the oplog, we pull it from the beginning for the first time, and then record the TS field of the last oplog after each pull. If the application is restarted, we need to find the starting point of the pull according to the last pulled TS field, and then continue to traverse.

Oploghack optimization

Note: for the wiredtiger storage engine, the following implementation requires mongodb version 3.0 + to support

If the underlying mongodb uses the wiredtiger storage engine, it is actually optimized when storing oplog. Mongodb will take the TS field as the key, the content of oplog as the value, and store the key value in the wiredtiger engine. The default configuration of wiredtiger is BTREE storage, so the data of oplog is actually stored in the order of TS field in wt. since it is sequential storage, there is a space for binary search and optimization.

The mongodb find command provides an option specifically for optimizing oplog positioning.

Must mongodb scan the whole table to locate oplog?

Basically, if your find set is oplog, the search condition is for the TS fieldgtegteqThe mongodb field will be optimized to locate the starting point quickly through binary search. When the standby node synchronously pulls the oplog, it actually takes this option, so that every time the standby node restarts, it can quickly find the synchronization starting point according to the last synchronization point, and then keep the synchronization.

Oploghack implementation

Because the students who ask questions are interested in internal implementation, here we simply list the key points. To have a deep understanding, we need to go deep into the details.

// src/monogo/db/query/get_executor.cpp
StatusWith<unique_ptr<PlanExecutor>> getExecutorFind(OperationContext* txn,
                                                     Collection* collection,
                                                     const NamespaceString& nss,
                                                     unique_ptr<CanonicalQuery> canonicalQuery,
                                                     PlanExecutor::YieldPolicy yieldPolicy) {
    //When building the find execution plan, if the oplogreplay option is found, take the optimization path
    if (NULL != collection && canonicalQuery->getQueryRequest().isOplogReplay()) {
        return getOplogStartHack(txn, collection, std::move(canonicalQuery));


    return getExecutor(
        txn, collection, std::move(canonicalQuery), PlanExecutor::YIELD_AUTO, options);
StatusWith<unique_ptr<PlanExecutor>> getOplogStartHack(OperationContext* txn,
                                                   Collection* collection,
                                                   unique_ptr<CanonicalQuery> cq) {

    // See if the RecordStore supports the oplogStartHack
    //If the underlying engine supports (WT support, mmapv1 does not support), find startloc according to the queried ts
    const BSONElement tsElem = extractOplogTsOptime(tsExpr);
    if (tsElem.type() == bsonTimestamp) {
        StatusWith<RecordId> goal = oploghack::keyForOptime(tsElem.timestamp());
        if (goal.isOK()) {
            //Finally, call Src / Mongo / db / storage / wiredtiger / wiredtiger_record_store. CPP:: oplogstarthack
            startLoc = collection->getRecordStore()->oplogStartHack(txn, goal.getValue());

     // Build our collection scan...
     //When building a full table scan parameter, bring startloc. The real execution will quickly locate this point
    CollectionScanParams params;
    params.collection = collection;
    params.start = *startLoc;
    params.direction = CollectionScanParams::FORWARD;
    params.tailable = cq->getQueryRequest().isTailable();

Author: Zhang Youdong

Read the original text

This is the original content of yunqi community, which can not be reproduced without permission.