Container Monitoring Practice-Prometheus Storage Mechanism



Prometheus provides local storage, namely TSDB sequential database. Local storage brings Prometheus a simple and efficient use experience. After Prometheus 2.0, the ability of compressing data has been greatly improved. It can meet the monitoring needs of most users in the case of single node.

However, local storage also limits the scalability of Prometheus, and brings a series of problems such as data persistence. In order to solve the limitation of single-node storage, Prometheus does not implement cluster storage by itself, but provides a remote read-write interface, allowing users to choose the appropriate sequential database to achieve Prometheus scalability.

Prometheus 1.x version of TSDB (V2 storage engine) is based on LevelDB and uses the same compression algorithm as Facebook Gorilla, which can compress 16 bytes of data points to an average of 1.37 bytes.

Prometheus version 2.x introduces a new V3 storage engine, providing higher write and query performance

All of the following are based on Prometheus version 2.7

Local storage

Storage principle

Prometheus is stored in a two-hour block, each of which consists of a directory containing one or more chunk files (saving timeseries data), a metadata file, and an index file (locating timeseries data in the chunk file by metric name and labels).

The latest written data is stored in the memory block and written to disk after 2 hours. In order to prevent data loss caused by program crash, WAL (write-ahead-log) mechanism is implemented. When starting, replay is realized by writing log (WAL) to recover data.

When data is deleted, deleted entries are recorded in a separate tombstone file rather than deleted immediately from the chunk file.

By saving all sample data in the form of time window, the query efficiency of Prometheus can be significantly improved. When querying all sample data in a certain period of time, it is only necessary to simply query data from blocks falling within that range.

These 2-hour blocks are compressed into larger blocks in the background, and data compression is combined into higher level block files to delete lower level block files. This is consistent with the ideas of LSM trees such as level db, rock sdb, etc.

These designs are highly similar to Gorilla’s, so Prometheus is almost equivalent to a cache TSDB. Its local storage characteristics determine that it can not be used for long-term data storage, only for short-term window time series data storage and query, and does not have high availability (downtime opportunities cause historical data can not be read).

When the block data in memory is not written to disk, the wal file is mainly saved under the block directory:


The wal file in the persistent block directory is deleted and the timeseries data is stored in the chunk file. Index is used to index the location of timeseries in the wal file.


Storage configuration

For local storage, Prometheus provides some configuration items, including:

  • Storage. tsdb. path: The directory in which data is stored, defaulting to data/, can be specified if external storage is to be suspended.
  • Storage. tsdb. retention. time: Data expiration cleaning time, default save 15 days
  • Storage. tsdb. retention. size: experimental nature, declaring the maximum value of data blocks, excluding wal files, such as 512MB
  • Storage. tsdb. retention: It has been discarded and replaced by storage. tsdb. retention time.

Prometheus keeps all currently used blocks in memory. In addition, it keeps the latest blocks in memory, and the maximum memory can be configured through the storage. local. memory – chunks flag.

Monitor current memory usage:

  • prometheus_local_storage_memory_chunks
  • process_resident_memory_bytes

Monitor current storage metrics:

  • Prometheus_local_storage_memory_series: Number of current blocks of memory held by time series
  • Prometheus_local_storage_memory_chunks: Current number of persistent blocks in memory
  • Prometheus_local_storage_chunks_to_persist: The number of memory blocks that still need to be persisted to disk
  • Prometheus_local_storage_persistence_urgency_score: Emergency score

Storage upgrade of Prometheus 2.0

Prometheus 2.0 was released from November to 08, 2017, and the storage engine was optimized.

Overall performance improvement:

  • Compared with Prometheus 1.8, CPU usage decreased by 20%-40%.
  • Compared with Prometheus 1.8, disk space usage is reduced by 33% – 50%.
  • Without too many queries, the average load of disk I/O is less than 1%.

In dynamic environments such as Kubernetes clusters, the data plane of Prometheus usually looks like this

  • Vertical dimensions represent all stored sequences
  • Horizontal dimension represents the time of sample propagation

Such as:

requests_total{path="/status", method="GET", instance=""}
requests_total{path="/status", method="POST", instance=""}
requests_total{path="/", method="GET", instance=""}

Container Monitoring Practice-Prometheus Storage Mechanism

Prometheus regularly collects new data points for all series, which means that it must perform vertical writes at the right end of the timeline. However, when querying, we may want to access the rectangle of any area on the plane (various label conditions)

Therefore, in order to effectively find query sequences in large amounts of data, we need an index.

Vertical write mode can be well handled in Prometheus 1.x storage layer, but with the increase of scale, index or some problems arise. Therefore, the storage engine and index are redesigned in version 2.0. The main modifications are as follows:

Sample compression

Sample compression in existing storage layers played an important role in early versions of Prometheus. A single raw data point occupies 16 bytes of storage space. But when Prometheus collects hundreds of thousands of data points per second, it can fill the hard disk quickly.

However, the samples in the same series are often very similar. We can use this kind of samples (the same label) to compress effectively. Bulk compression of a series of blocks of many samples, in memory, each data point is compressed to an average of 1.37 bytes of storage.

This compression scheme works well and remains in the design of the new version 2 storage layer. Specific compression algorithm can be referred to: Facebook’s “Gorilla” paper

Time slice

We divide the new storage layer into blocks, each of which holds all sequences for a period of time. Each block acts as a separate database.

Container Monitoring Practice-Prometheus Storage Mechanism

In this way, each query only checks a subset of blocks within the requested time range, and the query execution time will naturally decrease.

This layout also makes it very easy to delete old data (which is a time-consuming operation in 1.x storage design). But in 2.x, once the block’s time range falls completely behind the reserved boundaries of the configuration, it can be completely discarded.

Container Monitoring Practice-Prometheus Storage Mechanism


Generally, Prometheus query is based on metric + label as the keyword, and it is very broad and user-defined characters. Therefore, it is impossible to use the conventional SQL database. The reservoir of Prometheus uses the inverted index concept in full-text retrieval, and regards each time series as a small document. And metric and label correspond to the words in the document.

For example, requests_total{path=”/status”, method=”GET”, instance=”″} are documents containing the following words:

  • __name__=”requests_total”
  • path=”/status”
  • method=”GET”
  • instance=”″

Benchmark test

Cpu, memory and query efficiency have been greatly improved compared with version 1.x

Specific test results reference:…

Fault recovery

If you suspect problems caused by corruption in the database, you can enforce crash recovery by starting the server using the storage. local. dirtyflag configuration.

If it’s not helpful, or if you just want to delete an existing database, you can easily start by deleting the contents of the storage directory:

  • 1. Stop service: stop prometheus.
  • 2. Delete data directory: rm-r < storage path >/*
  • 3. Start service: start Prometheus

Remote storage

Prometheus defaults to having its own storage for 15 days. However, local storage also means that Prometheus cannot persist data, store large amounts of historical data, and expand flexibly.
To ensure the simplicity of Prometheus, instead of solving these problems from the dimension of its own cluster, Prometheus defines two interfaces, remote_write/remote_read, which throws data out and processes it yourself.

Prometheus remote_storage is actually an adapter. It doesn’t care what kind of temporal database is on the other end of the adapter. You can write your own adpater if you like.

For example, the way of storage is: Prometheus – – sending data – > remote_storage_adapter – – storing data – > influxdb.

Prometheus docks with other remote storage systems in the following two ways:

  • Prometheus writes metrics to remote storage in standard format
  • Prometheus reads metrics from a remote URL in standard format

Container Monitoring Practice-Prometheus Storage Mechanism

Remote reading

In the process of remote reading, when a user initiates a query request, Promthues initiates a query request (matchers, ranges) to the URL configured in remote_read, and the adapter obtains the response data from the third-party storage service according to the request conditions. At the same time, the original sample data transformed into Promthues is returned to Prometheus Server.

When the sample data is obtained, Promthues uses PromQL to process the sample data twice locally.

Remote writing

Users can specify the URL address of Remote Write in the Promtheus configuration file. Once the configuration item is set, Prometheus sends the sample data to the adapter in the form of HTTP. Users can dock any external service in the adapter. External services can be real storage systems, public cloud storage services, or any form of message queue.

To configure

Configuration is very simple, you just need to configure the corresponding address.

  - url: "http://localhost:9201/write"

  - url: "http://localhost:9201/read"

Community support

Now the community has implemented the following remote storage solutions

  • AppOptics: write
  • Chronix: write
  • Cortex: read and write
  • CrateDB: read and write
  • Elasticsearch: write
  • Gnocchi: write
  • Graphite: write
  • InfluxDB: read and write
  • OpenTSDB: write
  • PostgreSQL/TimescaleDB: read and write
  • SignalFx: write

We can use read-write complete Influx DB, and we use multiple Prometheus servers to read and write remotely at the same time, which proves that the speed is still possible. And InfluxDB is ecologically complete, with many management tools.

Capacity planning

In general, each sample stored in Prometheus takes about 1-2 bytes. If you need to plan the local disk space of Prometheus Server, you can use the following formula to calculate:

Disk size = retention time * sample number per second * sample size

If the retention_time_seconds and sample size are unchanged, the capacity requirement of the local disk can only be reduced by reducing the number of samples per second (ingested_samples_per_second).

So there are two ways, one is to reduce the number of time series, the other is to increase the time interval of sampling.

Considering that Prometheus can compress time series, the effect of reducing the number of time series is more obvious.


Remote reading and writing solves the problem of data persistence in Promtheus. It can be extended elastically. In addition, the federated cluster model is also supported to solve the problems of horizontal expansion and network partitioning (such as monitoring data of geographical A+B+C, which is integrated into D), and the configuration of Federated cluster will be described in detail in the Prothues High Availability Article.

Attachment: Kubecon on Prometheus 2.0 handsome man in 2018

Container Monitoring Practice-Prometheus Storage Mechanism

There is also a book on Prometheus: Prometheus: Up & Running (more than 600 pages…).

We have not found any for sale in China. We have found an English pdf. We are still in the process of translating and understanding. New content will continue to synchronize with this series of blogs.

Container Monitoring Practice-Prometheus Storage Mechanism

. Find another copy:

Container Monitoring Practice-Prometheus Storage Mechanism

Reference material:


This article is a series of articles on container monitoring practice. For the complete content, see container-monitor-book.