Performance comparison test report of dolphindb and timescaledb

Time:2021-1-28

1、 Overview

DolphinDB

Dolphin DB is an analytical high-performance distributed temporal database written in C + +. It uses column memory engine with high throughput and low latency, integrates powerful programming language and high-capacity and high-speed stream data analysis system, and can carry out complex programming and operation in the database, which significantly reduces the time consumed by data migration.

Dolphin DB realizes high-speed distributed computing through memory engine, data localization, fine-grained data partition and parallel computing. It has built-in pipeline, map reduce, iterative computing and other computing frameworks. It uses embedded distributed file system to automatically manage partition data and its copies, providing load balancing and fault tolerance for distributed computing.

Dolphindb database supports standard SQL like syntax, provides scripting language similar to Python to operate data, and also provides API of other common programming languages. It performs well in historical data analysis modeling and real-time stream data processing in the financial field, as well as massive sensor data processing and real-time analysis in the Internet of things.

TimescaleDB

Timescaledb is the only open source and SQL supported temporal database in the market. It is developed on the basis of PostgreSQL database and is essentially a plug-in of PostgreSQL.

Timescaledb fully supports SQL and has the rich ecology of PostgreSQL. It also optimizes the fast insertion of time series data and complex queries, supports automatic fragmentation, automatic partition of time and space dimensions, supports parallel queries of multiple servers and multiple chunks, and optimizes internal writing (batch submission, memory index, transaction support, data pouring).

However, at present, timescaledb still does not support horizontal expansion (clustering), that is, it cannot dynamically add new data nodes to write data (write clustering for multi node timescale deployment is under active developmentgithub.com/timescale/t…)It only supports read-only clustering through streaming replication of PostgreSQL.

In this report, we compare the performance of timescaledb and dolphin dB on time series datasets. The test covers the import and export of CSV data files, disk space occupation and query performance. In all the tests we conducted, dolphin DB performed better. The main conclusions are as follows:

  • In the aspect of data import, the import performance of dolphin DB is better than timescaledb in the case of small data setsMore than 10 timesIn the case of large data sets, the import performance is the keyMore than 100 timesMoreover, it can be observed that the import rate of timescaledb decreases with the increase of import time, while dolphin DB remains stable.
  • In terms of data export, the performance of dolphin DB is better than that of timescaledb3 timesabout.
  • In terms of disk space, the space occupied by dolphin dB in small datasets is only the same as timescaledb1/6The space occupied by big data set is only the same as timescaledb1/17
  • In terms of query performance, dolphin DB outperforms timescaledb in four test samplesMore than 50 timesIn 15 test samples, timescaledb can be used10 ~ 50 timesIn 10 test samples, the performance is several times of timescaledb; only 2 test samples have lower performance than timescaledb.

2、 Test environment

At present, timescaledb does not support clusters that can write data, so we use stand-alone for testing. The configuration of single machine is as follows.

Host: Dell OptiPlex 7060
CPU: Intel Core i7-8700 (6-core 12 thread 3.20 GHz)
Memory: 32 GB (8GB × 4, 2666 MHz)
Hard disk: 2T HDD (222 MB / s read; 210 MB / s write)
OS:Ubuntu 16.04 LTS

The test version of dolphin DB is Linux v0.89 (January 31, 2019), and the maximum memory is set to28GB. The PostgreSQL version used in the test is Ubuntu 10.6-1 on X86_ The version of timescaledb plug-in is v1.1.1. According to the performance tuning method recommended by timescaledb official guide and the actual hardware configuration of the test machine, wepgtune.leopard.in.ua/The configuration file is generated on the website, and thewiki.postgresql.org/wiki/Tuning…This official configuration guide has been optimized toshared_buffersandeffective_cache_sizeSet to16GBAccording to the 12 thread CPU settingparallel workersSince only one mechanical hard disk is used, we willeffective_io_concurrencySet it to 1. Please refer to the appendix for the specific modified configurationpostgresql_test.confDocuments.

3、 Data set

This report tests the performance of dolphin dB and timescaledb under the small data level (4.2 GB) and big data level (270 GB)

In the small data level test, we load all partition data tables in the hard disk into memory in advance, that is, we use loadtable (memorymode = true) in dolphin dB and PG in PostgreSQL_ The preview plug-in loads it into shared_ buffers。

In the big data level test, we don’t load the hard disk partition table in advance, and the query test time includes the disk I / O time. In order to ensure the fairness of the test, we pass the Linux system command before each boot program testsync; echo 1,2,3 | tee /proc/sys/vm/drop_cachesClear the system page cache, directory item cache and hard disk cache respectively.

The following is the table structure and partition method of the two datasets:

4.2 GB device sensor record small data set (CSV format, 30 million pieces)

We selected thedevices_bigAs a small data set to test, the data set contains 3000 devices on 10000 time intervals (2016.11.15-2016.11.19)Sensor time, Device ID, Battery, Memory, CPUAnd so on.

Source:docs.timescale.com/v1.1/tutori…
Download address:timescaledata.blob.core.windows.net/datasets/de…

Data set total3Tens of millions of data(4.2 GBThe compressed package contains a device information table and a device sensor information record table. The table structure and partition method are as follows:

device_ Info table

Performance comparison test report of dolphindb and timescaledb

Readings table

Performance comparison test report of dolphindb and timescaledb

Data setdevice_idThis field has 3000 different values, which appear repeatedly in the records of readings table. Using text type not only takes up a lot of space, but also has low query efficiency. However, in timescaledb, it is difficult to use enum type for this field, and the symbol type of dolphin DB solves the two problems of storage space and query efficiency simply and efficiently.

Again, forbssidandssidThese two fields represent the WiFi information of the device connection. In practice, because of the uncertainty of the data, although there are a lot of duplicate values, it is not suitable to use the enum type.

Our partitioning scheme in dolphin DB database is totimeAs the first dimension of the partition, it is divided into four regions according to the day, and the boundary of the partition is[2016.11.15 00:00:00, 2016.11.16 00:00:00, 2016.11.17 00:00:00, 2016.11.18 00:00:00, 2016.11.19 00:00:00]; and thendevice_idAs the second dimension of partition, it is divided into 10 partitions every day, and the final size of the original data contained in each partition is about100 MB

We tried to build thedevice_idAs the second dimension of the partition, 90% of the query samples are not as good as those partitioned only by time dimension. Therefore, we choose to divide them into four zones only by time dimension and by day. This dimension is the same as dolphin DB’s partition method, but not by daydevice_idThis dimension is indexed in an officially recommended way (Refdocs.timescale.com/v1.0/using-…)To speed up the query, as shown below.

create index on readings (device_id, time desc);
create index on readings (ssid, time desc);
Copy code

270 GB stock trading big data set (CSV format, 23 CSVS, 6.5 billion pieces)

We test the level 1 quotation data of the stock market from August 1, 2007 to August 31, 2007 provided by the New York Stock Exchange (NYSE) as a big data set, which contains the quotation data of more than 8000 stocks in a monthTrading time, stock code, buying price, selling price, buying volume, selling volumeAnd other quotation information.

There are 6.5 billion (6561693704) quotation records in the data set. One transaction day record is saved in one CSV, and there are 23 transaction days in the month. The uncompressed CSV file is 270 GB in total. Source:www.nyse.com/market-data…

Taq table

We pressDate (date),Symbol (stock code)Every day, it is divided into 100 partitions according to symbol, and each partition is about 120 MB.

4、 Data import and export test

Import data from a CSV file

Dolphin DB database is imported using the following script:

timer {
    for (fp in fps) {
        loadTextEx(db, `taq, `date`symbol, fp, ,schema)
        Print now() + ": imported" + FP
    }
}
Copy code

4.2 GB device sensor records a total of 30 million pieces of data import time20 seconds, average rate1500000 pieces / second

270 GB stock trading big data set, 6561693704 pieces of data(TAQ20070801 - TAQ2007083123 files), import time38 minutes

In the import of timescaledb, due totimescaledb-parallel-copyThe tool does not support the column name of the first line of CSV. Let’s use it firsttail -n +2Skip the first line of the CSV and write the file stream to its standard input.

for f in /data/TAQ/csv/*.csv ; do
    tail -n +2 $f | timescaledb-parallel-copy 
        --workers 12 
        --reporting-period 1s 
        --copy-options "CSV" 
        --connection "host=localhost user=postgres password=postgres dbname=test sslmode=disable" 
        --db-name test 
        --table taq 
        --batch-size 200000
    
    Echo "file $f import complete"
done
Copy code

4.2 GB device sensor records a total of 30 million pieces of data import time5 minutes 45 seconds, average rate87000 / S

270 GB stock trading big data setTAQ20070801, TAQ20070802, TAQ20070803, TAQ20070806, TAQ20070807Five files (total size)70 GB)Included1.67 billionData import time24 hours, import rate19400 pieces / secondIt is expected that all data will be saved270 GBData import needs92 hours

The import performance is shown in the following table:

Performance comparison test report of dolphindb and timescaledb

The results show that the import rate of dolphin DB is much higher than that of timescaledb, and the gap is more obvious when the amount of data is large. In the process of import, it can be observed that the import rate of timescaledb decreases with the increase of import time, while dolphin DB remains stable.

In addition, it still takes about 2 minutes for timescaledb to build indexes after importing small datasets.

Exporting data to a CSV file

Using in dolphin DBsaveText((select * from readings), '/data/devices/readings_dump.csv')Export the data.

Using in timescaledbtime psql -d test -c "COPY (SELECT * FROM readings) TO /data/devices/devices_dump.csv DELIMITER ',' CSV"Export the data.

The export performance of small data sets is shown in the following table:

Performance comparison test report of dolphindb and timescaledb

5、 Comparison of disk space usage

After importing data, the analysis of the space occupied by timescaledb and dolphin DB databases is shown in the following table:

Performance comparison test report of dolphindb and timescaledb

The space utilization rate of dolphin DB is much higher than that of timescaledb, and the storage space occupied by the database in timescaledb is even larger than the size of the original CSV data file

  • Timescale only automatically compresses large fields (toast), but does not automatically compress data tables. That is, if the fields are small, each row is short and the number of rows is large, the data tables will not be compressed automatically. If ZFS and other compressed file systems are used, the query performance will be significantly affected. However, dolphin DB adopts lz4 compression by default.
  • Timescaledb useSELECT create_hypertable('readings', 'time', chunk_time_interval => interval '1 day')The original data table is transformed into a hypertable Abstract table to provide a unified query and operation interface for different data partitions. The underlying layer uses hyperchunk to store data. After analysis, it is found that the index of time series data field in hyperchunk is 0.8 GB, and the index of device is 0.8 GB_ The index of ID and SSID is 2.3 GB.
  • device_ id, ssid, The BSSID field has a large number of duplicate values, but the BSSID and SSID fields represent the WiFi information of the device connection. In practice, because of the uncertainty of the data, they are not suitable to use the enum type, and can only be stored in the form of duplicate strings; while the symbol type of dolphin DB can dynamically adapt according to the actual data, which simply and efficiently solves the problem of storage space.

6、 Query test

We compared the following eight categories of queries:

  • Click query to specify a field value for query
  • Range query queries data according to time interval for single or multiple fields
  • Precision query aims at data aggregation of different label dimension columns to realize high-dimensional or low dimensional field range query function
  • Aggregate query refers to the time series database that provides aggregate API support for counting, averaging, summation, maximum, minimum, moving average, standard deviation, normalization and so on
  • Comparison query reorganizes the contents of a field in the table into a table according to two dimensions (the first dimension is used as a column and the second dimension is used as a row)
  • Sampling query refers to the data sampling API provided by the database, which can manually specify the sampling method for each query to perform data sparse processing, so as to prevent the problem of large query time range and data overload
  • Association query filters different fields with the same precision and the same time range, and then filters out the related fields and groups them
  • Classic query is commonly used in real business

4.2 GB device sensor record small data set query test

For the test of small data sets, we first load all the data tables into memory.

Using dolphin DBloadTable(memoryMode=true)Load into memory.
Timescaledb useselect pg_prewarm('_hyper_2_41_chunk')Load to shared_ buffers。

The query performance is shown in the following table. See the appendix for the query script.

Performance comparison test report of dolphindb and timescaledb

For sampling query, timescaledb has a table sample clause to sample the data table, and the parameter is the sampling proportion, but there are only two sampling methods (system, Bernoulli). The system method samples the data according to the data block, and the performance is good, but all the rows in the selected block will be selected, and the randomness is poor. Bernoulli samples the whole meter, but the speed is slow. These two sampling methods do not support sampling by one field, while dolphin DB does not support sampling by whole table, only by partition. Due to different implementation methods, we do not conduct performance comparison.

For interpolation query, timescaledb (PostgreSQL) has no built-in interpolation query support, which requires hundreds of lines of code to implement. Seewiki.postgresql.org/wiki/Linear…Dolphindb supports four interpolation methods: fill with non null values in fill backward, fill with non null values in bfill forward, fill with linear interpolation in fill fill fill and fill with specified values in nullfill.

For comparison query, the comparison query function of timescaledb is implemented by the crosstab() function provided by the built-in tablefunc plug-in of PostgreSQL. However, from the example query, we can see that this function has great limitations
First, it requires the user to manually hard code all possible values and corresponding data types in the second dimension (row), which cannot be generated dynamically according to the data and is very cumbersome. Therefore, it cannot be used for dynamic data or fields with many values.
Second, it can only be sorted according to the type dimension of text, or converted from other type dimensions to text type in advance. When the amount of data is large, the conversion operation is inefficient and waste space.
However, dolphin DB natively supports pivot by statement, which can be automatically sorted by specifying only two dimensions of classification.

For association query, asof join is very convenient for time series data analysis. Dolphin DB natively supports as of join, but PostgreSQL does notgithub.com/timescale/t…

When count (*) is used to query the total number of records, timescaledb will scan the whole table, which is very inefficient.

270 GB stock trading big data set query test

In the big data level test, we don’t pre load the hard disk partition table to the memory, and the query test time includes the disk I / O time. In order to ensure the fairness of the test, we pass the Linux system command before each boot program testsync; echo 1,2,3 | tee /proc/sys/vm/drop_cachesClear the system’s page cache, directory item cache and hard disk cache. After starting the program, execute all test samples one by one.

The query performance is shown in the following table. See the appendix for the query script.

Performance comparison test report of dolphindb and timescaledb

7、 Appendix

  • Preview of CSV data format (take the first 20 lines)

device_info:devices.csv

readings:readings.csv

TAQ:TAQ.csv

  • DolphinDB

Install, configure, and start scripts:test_dolphindb.sh

Profile:dolphindb.cfg

Small data set test complete script:test_dolphindb_small.txt

Big data set test complete script:test_dolphindb_big.txt

  • TimescaleDB

Install, configure, and start scripts:test_timescaledb.sh

Small data set test complete script:test_timescaledb_small.sql

Big data set test complete script:test_timescaledb_big.sql

PostgreSQL modify configuration:postgresql_test.conf

PostgreSQL complete configuration:postgresql.conf

PostgreSQL permission configuration:pg_hba.conf

All possible values of the stock code:symbols.txt

Create a SQL statement of symbol enumeration type:make_symbol_enum.sql

Generate script of symbol enumeration type:make_symbol_enum.coffee

  • Test result processing script

REPL.coffee

Author: Dolphin DB
Link:https://juejin.cn/post/691408…
Source: Nuggets
The copyright belongs to the author. For commercial reprint, please contact the author for authorization. For non-commercial reprint, please indicate the source.