Rocksdb code parsing – dB. H



  This article is an analysis of the overall function of rocksdb, mainly focusing on the DB. H file,To analyze the code of rocksdb, if you want to understand its function in an overall way, you should first parse the DB. H file. For the DB. H file of rocksdb, regardless of the previous string of structures, the core of the file is undoubtedlyclass DB, from more than 100 lines to more than 1000 lines, are this class.

  There are more than 7000 words in this paper, which is divided into four parts: overall method, attribute method, parameter analysis and others.

  The following is an analysis of the interior of the class. In order to save space, the function parameters are no longer written.

Holistic approach

The table is a holistic method. The main operations include the basic operations of put, get, delete and other key value storage, the operations for the database itself, and the expansion of rocksdb itself.

Open Use specified“name[reference string parameter]Open database[default column family]。 stay* dbptrA pointer to the database allocated by the heap is stored in and returned on successOK。 takenullptrstore in* dbptrAnd returns a non-zero value when an error occursOKStatus. When no longer needed* dbptrWhen, the caller should delete it(delete DB)。 The return value is directly given to DB. Generally speaking, there is only one database. If you want to classify and use column families, it will be much faster.
OpenForReadOnly Open the database in read-only form. At this time, the problem of column family is not considered. The operation of changing data like deleting will return an error and will not be triggeredcompactionOperation.
OpenForReadOnly Open a column family (or some) in read-only mode, and distinguish it from the above with parameters. The column family name does not existvectorArray. Only the specified column family can be opened. If the parameters of the specified column family are not written, the default column family will be operated. Default column family exists kDefaultColumnFamilyName Inside, you can modify it yourself. If you want to operate on n column families, you should create n-1 column families.

With the slave node function, this method creates a slave node that can dynamically track existing nodesMANIFESTFile. The user can call when neededTryCatchUpWithPrimaryMake the backup catch up with the primary node. The priority of the master node is higher than that of the slave node. If the slave node is started, it can be restored to the master node state as long as the handle of the master node is not destroyed.

optionsParameter specifies the option to turn on the slave node.

nameParameter specifies the name of the primary database used to open the primary node.

secondary_pathParameter points to the directory where the slave node stores its information logs.

dbptrIs corresponding to the open slave nodeout-arg[external pointer] this pointer points to a database in the heap, and the user should delete it after use.

Currently not supportedWALTrack the real-time slave node, but it will be available soon.

OpenAsSecondary The above method will open the slave node in the default column family, and this will open the slave node in the specified column family. One more parameter iscolumn_families, this parameter specifies the column family to open. [the so-called slave node can beMulti version control chain of mvccTo understand]

Open the database with column family and distinguish it from the above with parameters

In fact, after entering the code research, you can find that the column family must exist whether you specify it or not. If you don’t enter it, it is the default column family.

db_optionsSpecify database specific options

column_familiesIs a vector of all column families in the database, including column family names and options.

->Can useListColumnFamilies()To get a list of all column families. You can also open some column families as read-only.

The default column family name isdefault, stored inrocksdb::kDefaultColumnFamilyName.

If there is no problem, the size of the handle will be the same when returnedcolumn_familiesidentical–>handles [i]Will be used to align column familiescolumn_family [i]Handle to perform the operation.

The database opened with this function needs to be deletedDestroyColumnFamilyHandle()To delete all handles.

Resume The database recovery operation is different from the database repair operation at the bottom. This is for backup and main memory.
Close Close the database, including releasing resources and cleaning up files. In order to prevent accidents, this method should be called before the destructor, and if there is still no flash in the database, it will return failure.
ListColumnFamilies This method will open the database with the specified parameter name andcolumn_familiesParameter returns a list of all column families in the database.
CreateColumnFamily establishcolumn_familyAnd pass the parametershandleReturns the handle of the column family throughstatusReturns whether it was successful.
CreateColumnFamilies Batch create column families with the same column family option settings. The function may be partially successful and partially failed, and the successful state will be retained.
CreateColumnFamilies This is also used to create column families in batch. The difference is that the above column families have all the same attributes, and the passed in parameter is a string as the column family name. This is passed in through the column family descriptor, and each column family attribute may be different.
DropColumnFamily Delete a column family through the handle pointer, this call onlymainfestRecords in a delete record and prevent column families from being refreshed and compressed.
DropColumnFamily Batch delete column family, this call is only available inmainfestRecords in a delete record and prevent column families from being refreshed and compressed. If an error occurs, the request may be partially successful. User can callListColumnFamiliesTo check the results.

Closed bycolumn_familyHandle the specified column family and destroy the specified column family handle to avoid duplicate deletion. By default, this call deletes the column family handle. Note: use this method to close the column family instead of directly deleting the column family handle.


kvOne of the most basic operations of storage,SliceFormat input, determine whether to specify column family through parameters, and finallywrite_batchWrite as. The update operation is also throughPutTo achieve.

Delete kvOne of the most basic operations of storage, throughkeyDelete ifkeyIt does not exist and will not report an error, because it is essentially an insertion operation,PutofkeyhavedeleteMark, in subsequentcompactionThe previous content will be deleted in.
SingleDelete adoptkeydelete.requirementkeyExists and is not overwritten. IfkeyIt does not exist and will not report an error. Ifkeycoveroverwrite(Updated), then calling this will returnundefined, since the last timekeycallSingleDeleteSince (), only thekeyonly onePut(),SingleDelete() to operate correctly. [only one record is allowed in the library]–>The deletion method only deletes the latest records, so there cannot be multiple records. Currently, this feature is an experimental performance optimization for very specific workloads. The caller should ensure thatSingleDeleteOnly for unusedDelete() delete [because]deleteSo is the essenceput】Or not usedMerge() writtenkey。 takeSingleDeleteOperation andDeletesandMergesMixed use can lead to uncertain behavior.

Pressrange“begin_key”, “end_key”Delete data, but this is generally not used in production practice.

reason:1.Accumulating many ranges of logical deletions in memory tables will reduce read performance. (this can be done manually occasionally)flushTo avoid this.)2.Limiting the maximum number of open files in the presence of range logical deletion reduces read performance. (to avoid this problem, try tomax_open_filesSet to-1。 That is, there are no restrictions.)

Merge MergeOperation, which is simplyrocksdbFor a that already exists in the databasevalueAn operation to append, such askey1Correspondingvalue->hellobecomehelloworld。 Algorithm optimization for this operation. This operation includes’ read ‘changeWrite ‘three steps. openDBThe semantics of this operation is provided by the usermerge_operatorDecision, in fact, is to decide its latercompactionConsolidation method in the process.
Write Applies the specified updates to the database. If‘updates’No updates included, butoptions.sync = true[synchronize],WALWill still sync.
Get kvTo store one of the most basic operations, presskeyFind, do not affectvalue。 Return not foundStatus::IsNotFound(), other errors will be returnedStatusGetThe operation first queries the table in memory (first the cache, then thememtableAfter the query fails, it will be stored in the following persistent storageLSM-treeQuery layer by layer.
GetMergeOperands Returns all merge operands corresponding to the key.operands[operands] refers to the number of operations required after disassembling the step
MultiGet As the name suggests, disposablegetMultiple, in terms of parameters, are passed as vectors as arraysSliceFormat parameters. Ifkeys [i]Does not exist in the database, then theiThe returned status will beStatus :: IsNotFound() istrueAnd(* values[i]Will be set to some arbitrary value (usually “”)。 Otherwise, paragraphiThe returned States will haveStatus :: ok() istrue, and(* values[i]Store withkeys [i]Associated value.
MultiGet OverloadedMultiGet APIImprove performance and efficiency by performing batch operations in the read path. However, only those with full filters are supportedblockTable format.
KeyMayExist Used to judge akeyWhether it exists, ifkeyIf it does not exist in the database, thefalse, otherwise returntrue。 If the caller wants to find it in memorykeyTime acquisitionvalue, must be set“value_found”bytrue。 [defaultfalse】And callDB::GetCompared with (), this method is lighter and consumes less resources(whyIO(missing)].
NewIterator Create an iterator and return a heap based database iterator.NewIteratorThe result of () is initially invalid (the caller must call on the iterator before using the iterator)SeekOne of the methods). If it is not used for a long time, the caller deletes the iterator. Before deleting this database, the returned iterators should be deleted. If you only pass in the read option, the default column family will be operated. If you pass in the column family pointer or column family array, you can operate across column families. In addition, iterators are heap allocated and need to be deleted before deleting the database.
GetSnapshot Returns the handle [snapshot] of the current database state. The iterators created with this handle will observe the stable snapshot of the current database state. When the snapshot is no longer needed, the caller must callReleaseSnapshotresult)To release the snapshot. If the database cannot take snapshots or does not support snapshots, it returnsnullptr
ReleaseSnapshot Release the snapshot. Please do not create the snapshot again after releasing.



The structure property contains all valid property parameters. These parameters are obtained through getproperties(). These properties can reflect the current running state of the database in multiple dimensions. Of course, the specific architecture distribution of the underlying layer is not displayed. If you want to know [take db_bench as an example], you can find the folder beginning with rocksdb in the tmp folder of the system, where you can see the operation log, specific SST files and database parameter options. It is recommended to know. The setting of database parameter options largely determines the performance of the database. The parameters can be modified in relevant files [not recommended], can also be modified by calling relevant interfaces at runtime, or can be imported as external parameters when running the database.



Returns the string containing the layer of the file, expressed in ASCII.


Returns a string containing the compression ratio of layer N, expressed in ASCII. Here, the compression ratio is considered – uncompressed data size / compressed file size. “- 1.0” is returned if there is no file.


Returns a multiline string containing data described by kcfstats followed by data described by kdbstats.


Returns a multiline string summarizing the current SST file


It is equivalent to “rocksdb.cfstats-no-file-histogram” and “”. See the following description [histogram histogram] for details


Returns a multiline string that runs through the life cycle (“L”) of DB”)There are general column family statistics of each layer in the database, which are aggregated within the life cycle (“sum”) of DB and at the time interval (“int”) since the last retrieval. It can also be used to return statistics in a mapped format. In this case, each level and “sum” will have a double array of strings. When retrieving statistics in this form, “int” statistics will not be affected.


Output a histogram of how many files are read per layer and the latency of a single request.


Returns a multiline string containing general database statistics that are both cumulative (during the life cycle of the database) and spaced (since the last retrieval of kdbstats).


Returns a multiline string containing the number of files per tier and the total size of each level in MB.


Returns the number of immutable memtables that have not been flushed.


Returns the number of immutable memtables ready to flush.


If memtable flush is suspended, 1 is returned; Otherwise, 0 is returned.


Returns the number of tables being flushed


Returns 1 if at least one compaction is pending, otherwise 0


Returns the number of compactions in progress


Return statistics of background errors


Returns the approximate size (in bytes) of the active memtable


Returns the approximate size (in bytes) of the active memtable and the immutable memtable that is not flushed.


Returns the approximate size (in bytes) of the active memtable, the immutable memtable that has not been flushed and the immutable memtable that has been flushed.


Returns the total number of entries in the active memtable.


Returns the total number of entries in immutable memtable that are not flushed.


Returns the total number of deleted entries in the active memtable.


Returns the total number of deleted entries in immutable memtable that are not flushed.


Returns the estimated number of memtables and immutable memtables that are not flushed and the total keys in the persistent store.


Returns the estimated memory used to read the SST table, excluding the memory used in the block cache (such as filters and index blocks).


Returns 0 if delete obsolete files is enabled, otherwise a non-zero number.


Returns the number of flashes not released in the database


Returns a number representing the UNIX timestamp of the oldest unpublished snapshot.


Returns the number of existing versions. “Version” is an internal data structure. The details are in version_ More existing versions in set. H usually mean that more SST files have not been deleted (from iterators or incomplete compression).


Returns the current LSM version number, which is a Uint64_ T integer, incremented after any changes to the LSM tree. After restarting the database, the number will not be retained, starting from 0.


Returns an estimate of the amount of real-time data in bytes


Returns the minimum number of log files that should be retained


Returns the minimum file number to retain obsolete SST. If you can delete all obsolete files, you will return ` Uint64_ Maximum value of T ‘.


Returns the total size (in bytes) of all SST files. If there are too many files, the speed of online query may be reduced.


Returns the total size (in bytes) of all SST files belonging to the latest LSM tree.


Returns the number of layers to which l0 data is compressed


Returns the estimated total number of rewrite compressions to bring all levels below the target size. Does not apply to compression other than layer based.


Returns a string representation of the aggregate table property of the target column family.


Same as before, but only for layer n


Returns the current actual delayed write rate. 0 means no delay.


Judge the write status. If the write has stopped, return 1.


Returns an estimate of the oldest key timestamp in the database. At present, it is only applicable to FIFO compression, and compression is required_ options_ fifo.allow_ compaction = false。


Returns the block cache capacity


Returns the memory size of the entry residing in the block cache


Returns the memory size of the kV pair suspended [fixed] [pinned].


Returns a description of options.statistics as a multiline string


Next is the function about attributes:


The database implementation can export properties about its status through this method. If “attribute” is a valid attribute understandable by this database implementation (for valid attributes, see the attribute structure above)


As above, obtain the attribute and its corresponding mapping relationship


Similar to getproperty () above, but only applicable to a subset of properties whose return value is an integer.


Same as getintproperty(), but this function returns the aggregate int property of all column families.


Reset the internal statistics of the database and all column families. Note that this does not reset options. Statistics because it does not belong to the database


This returns the actual approximate size. If the user data is compressed ten times, the returned size will be one tenth of the corresponding user data size.


This method is similar to getapproximatesizes, except that it returns the approximate number of records in the memory table.


Compress the basic storage of the key range [* begin, * end]. The actual compression interval may be a superset of [* begin, * end]. In particular, deleted and overwritten versions will be discarded, and the data will be rearranged to reduce the operation cost of accessing the data. Typically, this operation can only be invoked by users who have some knowledge of the basics. In particular, begin = = nullptr is regarded as the key before all keys in the database, and end = = nullptr is regarded as the key after all keys in the database. Therefore, the following call will compress the entire database: DB – > compactrange (options, nullptr, nullptr); Note that after compressing the entire database, all data is pushed down to the last level containing any data. If the total compressed data size decreases, this level may not apply to storing all files. In this case, the client can change options.change_ Level is set to true to move the file back to the smallest or given level that can accommodate the dataset (specified by non negative options.target_level).


As the name suggests, setting database options does not care about the order, but only needs to pass in the content in the form of vector array.


Compactfiles() enters the list of files specified by the file number and compresses it to the specified layer. This behavior differs from compactrange() in that compactfiles() uses the current thread to execute the compression job. That is’ immediately ‘.


Pause function. This function will wait until all currently running background processes are completed. After returning, no background processes (operations such as compaction, batch and flush) will be run until continuebackgroundwork is called


Continue the operation of the entire database. Because the running background task is suspended after running before stopping, there will be no case where the task comes half way.


If a given column family was previously disabled, this feature enables their automatic compression. [disabled: set disable_auto_compactions to ‘true’ through setoptions() API]



Turn on and off manual compaction. Obviously, after disabling, the user cannot actively call compaction. In fact, if you do not know enough about rocksdb, it is not recommended to open it.


Gets the number of layers used by this database. [current maximum number of layers]


If the new compressed memory table does not overlap, it is pushed to the highest level of the.


The number of files that will stop writing at level 0. If you do not give column family parameters, you can operate on the default column family.


Get the database name — exactly the same name as the parameter provided to DB:: open()


Get environment objects from the database. Like leveldb, rocksdb handles the system environment in the form of objects.


Get the options we set for the database.


Flush operation is also one of the differences between rocksdb and leveldb. Flush all memtable contents to disk. When automatic flush is turned on, it will automatically flush the contents of a single column family. If you want to flush multiple column families, you need to use flush (options, column_families)


Flush the wal memory buffer to a file. If sync is true, syncwal will be called later.


Synchronize wal. Note that write() followed by syncwal() is not exactly the same as write() with sync = true: in the latter case, changes are not visible until synchronization is complete. Currently only allow in options_ mmap_ Valid when writes = false.

LockWAL/ UnlockWAL

Lock and unlock the wal. Contents of flushwal after locking.


The recently converted serial number will be incremented by one every time the actual key is passed in.


Indicates that the database retains the deletion of seqnum with serial number > = passed. If dboptions.preserve_ If deletes is set to false, it is invalid. This function assumes that the user calls this function with monotonically increasing seqnums (otherwise we cannot guarantee that some specific deletion operations have been processed); After the data is successfully updated, it returns true. If the user attempts to call seqnum < = current value, it returns false.


Prevent file deletion. Compression will continue, but no obsolete files will be deleted. Calling this thing many times has the same effect as calling this thing once.


Allow compression to delete obsolete files.

If force = = true, even if disablefiledeletions() is called multiple times before, the call to enablefiledeletions() will ensure that file deletion is enabled after the call.

If force = = false, enablefiledeletions enables file deletion only after being called at least the same number of times as disablefiledeletions(), so that two threads can call two methods at the same time without synchronization – that is, enablefiledeletions() is enabled only after both threads call enablefiledeletions()


Getlivefiles followed by getsortedwalfiles can generate lossless backups

–Retrieves a list of all files in the database.

These files are relative to dbname, not absolute paths. Although it is a relative path, the file name starts with “/”.

The valid size of the manifest file is in the manifest_ file_ Size. Manifest is a growing file, but only manifest_ file_ The portion specified by size is valid for this snapshot.

Flush_ Setting memtable to true will refresh before recording the [recording] active file. When we don’t want to wait for refresh, we can flush_ Memtable is set to false because the refresh may need to wait for compression to complete, which takes an uncertain time.

If you have multiple column families, even flush_ Memtable is true. You still need to call GetSortedWalFiles after GetLiveFiles to compensate for the new data that has been refreshed when refreshing the other column series.


First, retrieve the sorted list of all wal files of the oldest file


Retrieves information about the current wal file

Note that the log may have scrolled after this call, in which case current_ log_ File does not point to the current log file.

In addition, for optimization, current_ log_ File – > startsequence will always be set to 0


Retrieves the creation time of the oldest file in the database. Only if Max_ open_ This API is valid only when files = – 1. Otherwise, the returned status is status:: notsupported(). Use the environment provided to the database to set the file creation time.

If the database was created from a very old distribution, the SST file may not have a file_ creation_ Time attribute, even after moving to an updated release, it is possible that some files have never been compressed and may not have file_ creation_ Time attribute.

In both cases, file_ creation_ Time is treated as 0, which means that the API will return creation_ Time = 0 because the timestamp will not be less than 0.


Note: this API is not yet consistent with the writeprepared transaction.

Set ITER to include seq_ The iterator in the write batch of number. If the serial number does not exist, it is displayed in the requested SEQ_ The first available SEQ after no_ Returns an iterator at No. If the iterator is valid, status:: OK is returned. The wal must be_ ttl_ Seconds or wal_ size_ limit_ MB is set to a large value to use this API, otherwise the wal file will be actively cleared and the iterator may become invalid before reading the update.


Delete the file name from the DB directory and update the internal state to reflect that state.

—Only deleting SST and log files is supported.

—‘name’ must be a path relative to the DB directory. For example. 000001.sst,/ archive / 000003.log


Return the list of all table files and their levels, start key and end key


Get the metadata of the specified column series in the database. If the column family is not specified, the metadata of the default column family will be obtained.


Ingestexternalfile() loads external SST files into the database and supports two main modes:

(1) Duplicate keys in the new file overwrite the existing keys (default)

(2) Duplicate keys will be skipped (set to ingest_behind = true)

In the first mode, we will try to find the lowest level that the file can hold and extract the file to that level. Files whose key range overlaps the key range of the memory table will require us to refresh the memory table before extracting the file.

In the second mode, we will always ingest at the lowest level

In addition: (1) you can use sstfilewriter to create an external SST file. (2) Even if the file compression does not match the level compression, we will try to extract the file to the lowest level (3) if ingestexternalfileoptions – > ingest_ When behind is set to true, we always ingest at the bottom layer. Therefore, we should keep the lowest layer (see dboptions:: allow_ingest_behind flag).


Ingestexternalfiles() extracts files of multiple column families and automatically records the results in manifest. If this function returns OK, the extraction of all column families must be completed successfully. If this function returns another or the process crashes, non files will be extracted into the database after recovery.

Note that during the execution of this function, the application may observe a mixed state. If you use an iterator to perform a range scan on a column family, an iterator on one column family may return extracted data, while an iterator on another column family may return old data.

Users can use snapshots to get a consistent view of the data. If your database uses this API to extract multiple SST files, args. Size() > 1, rocksdb 5.15 and earlier will not be able to open it.

Requirement: each Arg corresponds to a different column family: that is, for 0 < = I


Createcolumnfamilywithimport() will use column_ family_ Name creates a new column family and imports the external SST file specified in the metadata into the column family.

(1) You can use sstfilewriter to create external SST files.

(2) You can export an external SST file from a specific column family in an existing database.

import_ The option in options specifies whether to copy or move external files (copy by default).

If the option specifies replication, in external_ file_ It is the caller’s responsibility to manage files in path. When option specifies the move, the call will ensure that the external is deleted on successful return_ file_ Path, and will not modify the file when any error returns. When an error is returned, the column family handle returned will be nullptr. Columnfamily will appear on successful return and will not exist on error return. During this call, columnfamily may exist for any crash.


The method of directly importing files has been deprecated. Please use the above method.


Get feature descriptions for all tables


The parameters are the family name and the starting key to obtain the recommended compression range


Start / stop tracking operation for database



Start / stop tracking block cache access


Given a time range [start_time, end_time], set a stats historyiterator to access the statistics history.


Synchronize the backup with main memory as much as possible

So far, class   The content of DB ends. Here are two functions related to the database


Destroys the contents of the specified database. Be very careful when using this method.


If the database cannot be opened, you can try calling this method to repair as much of the contents of the database as possible. Some data may be lost, so be careful when calling this function in a database that contains important information. Using this API, we will warn and skip with column_ Data related to column series not specified in families.

This concludes the contents of the DB. H file, mainly including various operations of the database as a whole, including column family related, node related, database basic related, and some operations unique to rocksdb. In addition, you can view the running status of the current database in real time and obtain the relevant return values through getproperties. Finally, there are some versions of rocksdb for general database operations. It has to be said that the function implementation of rocksdb is a little too comprehensive.


Recommended Today

SAP e-commerce cloud Spartacus UI and smartedit local test environment

https://localhost:9002/smarte… https://localhost:9002/backof… In backoffice, change the WCMS cockpit preview URL of Spartacus website to the local Spartacus URL: http://localhost:4299/powerto… Note to remove the HTTPS in front. Make sure there is no problem with local access to this URL. Click pages to create a new page: When storefront cannot render, first check whether the settings of […]