Dry goods time series database dolphin DB data import tutorial

Time:2021-7-21

When enterprises use the big data analysis platform, they first need to migrate massive data from multiple data sources to the big data platform.

Before importing data, we need to understand the basic concepts and features of dolphin DB database.

According to the storage media, dolphin DB data table can be divided into three types

  • Memory table: the data is only stored in the memory of the node, and the access speed is the fastest. However, after the node is closed, the data will be lost.
  • Local disk table: the data is saved on the local disk. Even if the node is restarted, the data can be easily loaded into the memory by script.
  • Distributed table: the data is physically distributed in different nodes. Through the distributed computing engine of dolphin dB, the unified query can still be done logically just like the local table.

The dolphin DB data table is divided into two types according to whether it is partitioned or not

  • Common table
  • Partition table

In traditional database, partition is for data table, that is, each data table in the same database can have different partition scheme; The partition of dolphin DB is for database, that is, a database can only use one partition scheme. If the partition schemes of two tables are different, they cannot be placed in the same database.

Dolphin DB provides three flexible data import methods

  • Import through a CSV text file
  • Import through HDF5 file
  • Import through ODBC

1. Import through CSV text file

Data transfer through CSV file is a common way of data migration. Dolphin DB providesloadTextploadTextandloadTextExThree functions to import a CSV file. Let’s take a sample CSV filecandle_201801.csvTo illustrate the usage of these three functions.

1.1 loadText

Syntax: loadtext (file name, [separator = ‘,’], [schema])

Parameters:

_ filename_ Is the file name.

_ delimiter_ And_ schema_ All are optional parameters.

_ delimiter_ Used to specify the separator for different fields. The default is “,”.

_ schema_ The data type used for each field after data import. It is a table type. Dolphin DB provides the function of automatic recognition of field types, but in some cases, the data types automatically recognized by the system do not meet the requirements. For example, we are importing the sample CSVcandle_201801.csvIn this case, the volume field will be recognized as the int type. In fact, we need the long type. In this case, we need to use the schema parameter.

Script for creating schema table:

nameCol = `symbol`exchange`cycle`tradingDay`date`time`open`high`low`close`volume`turnover`unixTime
typeCol = [SYMBOL,SYMBOL,INT,DATE,DATE,INT,DOUBLE,DOUBLE,DOUBLE,DOUBLE,INT,DOUBLE,LONG]
schemaTb = table(nameCol as name,typeCol as type)

When there are too many fields in a table, the script to create a schema table is very lengthy. To avoid this problem, dolphin DB provides the extracttextschema function, which can extract the table structure from the text file. We only need to modify the field type that needs to be specified.

dataFilePath = "/home/data/candle_201801.csv"
schemaTb=extractTextSchema(dataFilePath)
update schemaTb set type=`LONG where name=`volume        
tt=loadText(dataFilePath,,schemaTb)

1.2 ploadText

Ploadtext loads data files into memory as partition tables in parallel. The syntax is exactly the same as that of loadtext, but the speed of ploadtext is faster. Ploadtext is mainly used to load large files quickly. In its design, it makes full use of multiple cores to load files in parallel. The degree of parallelism depends on the number of cores in the server itself and the local executors configuration of the nodes.

Let’s compare the performance of loadtext and ploadtext.

First, a 4 g or so CSV file is generated by script

filePath = "/home/data/testFile.csv"
appendRows = 100000000
dateRange = 2010.01.01..2018.12.30
ints = rand(100, appendRows)
symbols = take(string('A'..'Z'), appendRows)
dates = take(dateRange, appendRows)
floats = rand(float(100), appendRows)
times = 00:00:00.000 + rand(86400000, appendRows)
t = table(ints as int, symbols as symbol, dates as date, floats as float, times as time)
t.saveText(filePath)

Loadtext and ploadtext are used to import files respectively. This node is a CPU with 4 cores and 8 threads.

timer loadText(filePath);
//Time elapsed: 39728.393 ms
timer ploadText(filePath);
//Time elapsed: 10685.838 ms

The results show that the performance of ploadtext is almost four times that of loadtext.

1.3 loadTextEx

Syntax: loadtextex (dbhandle, tablename, [partitioncolumns], filename, [separator = ‘, [schema]))

Parameters:

_ dbHandle_ Is a database handle.

_ tableName_ Is the name of the distributed table that holds the data.

_ partitionColumns_、_ delimiter_ And_ schema_ Is an optional parameter.

When the partition scheme is not sequential partition, it needs to be specified_ partitionColumns_, Represents a partition column.

_ fileName_ Represents the name of the import file.

_ delimiter_ Used to specify the separator for different fields. The default is “,”.

_ schema_ The data type used for each field after data import. It is a table type.

Loadtext function always imports data into memory, when the data file is very large, the memory of the working machine is easy to become a bottleneck. Loadtextex can solve this problem very well. It saves static CSV file as a distributed table of dolphin dB in a gentle data flow way by importing and saving at the same time, instead of importing all the memory and saving it as a partition table, which greatly reduces the memory requirement.

First, create a distributed table to store data

dataFilePath = "/home/data/candle_201801.csv"
tb = loadText(dataFilePath)
db=database("dfs://dataImportCSVDB",VALUE,2018.01.01..2018.01.31)  
db.createPartitionedTable(tb, "cycle", "tradingDay")

Then import the file into the distributed table:

loadTextEx(db, "cycle", "tradingDay", dataFilePath)

When the data needs to be used for analysis, the partition metadata is loaded into memory through the loadtable function. When the query is actually executed, dolphin DB will load the data into memory on demand.

tb = database("dfs://dataImportCSVDB").loadTable("cycle")

2. Import through HDF5 file

HDF5 is a more efficient binary data file format than CSV, which is widely used in the field of data analysis. Dolphin DB also supports importing data through HDF5 format files.

Dolphin DB uses HDF5 plug-in to access HDF5 files. The plug-in provides the following methods:

  • HDF5:: LS: lists all group and dataset objects in the H5 file.
  • HDF5:: lstable: lists all dataset objects in the H5 file.
  • HDF5:: hdf5ds: returns the metadata of the dataset in the H5 file.
  • HDF5:: loadhdf5: import the H5 file into the memory table.
  • HDF5:: loadhdf5ex: import H5 file into partition table.
  • HDF5:: extracthdf5 schema: extract table structure from H5 file.

When calling plug-in methods, you need to provide a namespace in front of the method. For example, when calling loadhdf5, HDF5:: loadhdf5. If you don’t want to use a namespace every time, you can use the use keyword:

use hdf5
loadHdf5(filePath,tableName)

To use the plug-in of dolphin dB, you need to download the HDF5 plug-in first, and then deploy the plug-in to the plugins directory of the node. Before using the plug-in, you need to load it first. Use the following script:

loadPlugin("plugins/hdf5/PluginHdf5.txt")

The import of HDF5 file is similar to that of CSV file. For example, we will import the sample HDF5 filecandle_201801.h5Import, which contains a dataset: candle_ 201801, then the simplest import method is as follows:

dataFilePath = "/home/data/candle_201801.h5"
datasetName = "candle_201801"
tmpTB = hdf5::loadHdf5(dataFilePath,datasetName)

If you need to specify the data type, you can use HDF5:: extracthdf5 schema. The script is as follows:

dataFilePath = "/home/data/candle_201801.h5"
datasetName = "candle_201801"
schema=hdf5::extractHdf5Schema(dataFilePath,datasetName)
update schema set type=`LONG where name=`volume        
tt=hdf5::loadHdf5(dataFilePath,datasetName,schema)

If the HDF5 file is very large and the working machine memory can not support full loading, you can use HDF5:: loadhdf5ex to load the data.

First, create a distributed table to store data

dataFilePath = "/home/data/candle_201801.h5"
datasetName = "candle_201801"
dfsPath = "dfs://dataImportHDF5DB"
tb = hdf5::loadHdf5(dataFilePath,datasetName)
db=database(dfsPath,VALUE,2018.01.01..2018.01.31)  
db.createPartitionedTable(tb, "cycle", "tradingDay")

Then, import the HDF5 file through the HDF5:: loadhdf5ex function

hdf5::loadHdf5Ex(db, "cycle", "tradingDay", dataFilePath,datasetName)

3. Import through ODBC interface

Dolphin DB supports ODBC interface to connect to the third-party database, and directly reads the table into dolphin DB’s memory data table from the database. Using the ODBC plug-in provided by dolphin dB, data can be easily migrated from the database supported by ODBC to dolphin dB.

The ODBC plug-in provides the following four methods for manipulating third-party data sources:

  • ODBC:: Connect: turn on the connection.
  • ODBC:: close: close the connection.
  • ODBC:: query: query the data according to the given SQL statement and return it to the memory table of dolphin dB.
  • ODBC:: execute: execute the given SQL statement in the third party database without returning data.

Before using ODBC plug-in, you need to install ODBC driver. Please refer toODBC plug-in tutorial

Taking connecting to SQL server as an example, the specific configuration of the existing database is as follows:

server:172.18.0.15

Default port: 1433

Connection user name: SA

Password: 123456

Database name: SZ_ TAQ

The data of January 1, 2016 is selected as the database table, and the table name is candle_ 201801, the fields are the same as the CSV file.

To use the ODBC plug-in to connect to the SQL Server database, the first step is to download the plug-in, extract and copy all the files in the pluginsodbc directory to the plugins / ODBC directory of the dolphin DB server, and complete the initialization of the plug-in through the following script:

//Loading plug-ins
loadPlugin("plugins/odbc/odbc.cfg")
//Connect to SQL Server
conn=odbc::connect("Driver=ODBC Driver 17 for SQL Server;Server=172.18.0.15;Database=SZ_TAQ;Uid=sa;
Pwd=123456;")

Before importing data, create a distributed disk database to save data

//Taking the table structure from SQL server as the template of importing table from dolphin DB
tb = odbc::query(conn,"select top 1 * from candle_201801")
db=database("dfs://dataImportODBC",VALUE,2018.01.01..2018.01.31)
db.createPartitionedTable(tb, "cycle", "tradingDay")

Import data from SQL server and save it as a dolphin DB partition table:

data = odbc::query(conn,"select * from candle_201801")
tb = database("dfs://dataImportODBC").loadTable("cycle")
tb.append!(data);

Importing data through ODBC avoids the process of exporting and importing files. Moreover, through timing operation mechanism of dolphin dB, it can also be used as a data channel for timing synchronization of time series data.

4. Financial data import cases

The following is an example of the data file import of the daily K-line chart of the stock market. The data is saved on the disk in the format of CSV file, with a total of 10 years of data. The data is saved in annual subdirectories, with a total of about 100g of data. The path example is as follows:

2008

—- 000001.csv

—- 000002.csv

—- 000003.csv

—- 000004.csv

—- …

2009

2018

The structure of each file is consistent, as shown in the figure

Dry goods time series database dolphin DB data import tutorial

4.1 zoning planning

Before importing data, it is necessary to do a good job in data partition planning, which involves two aspects

  • Determine the partition field.
  • Determine the granularity of the partition.

First of all, according to the daily query execution frequency, we use trading and symbol fields to partition the combined range (range). By partitioning the common retrieval fields, we can greatly improve the efficiency of data retrieval and analysis.

The next step is to define the granularity of the two partitions.

The time span of the existing data is from 2008 to 2018, so we divide the data according to the year. When planning the time partition, we should consider leaving enough space for the subsequent data, so we set the time range as 2008-2030.

yearRange =date(2008.01M + 12*0..22)

There are thousands of stock codes here. If the stock codes are partitioned by value, then each partition is only a few megabytes in size, and there are many partitions. When a distributed system executes a query, it will divide the query statements into several subtasks and distribute them to different partitions for execution. Therefore, partition by value will lead to a large number of tasks, and the execution time of tasks is very short, which will lead to the system spending more time on managing tasks than on the task itself. This partition method is obviously unreasonable. Here, we divide all stock codes into 100 intervals according to the range, and each interval serves as a partition. The size of the final partition is about 100m. Considering that there is new stock data coming in later, a virtual code 999999 is added to form a partition with the last stock code to save the data of subsequent new stocks.

Get the partition range of symbol field through the following script:

//Traverse all the annual directories, reorganize the stock code list, and divide it into 100 intervals through cutpoint
symbols = array(SYMBOL, 0, 100)
yearDirs = files(rootDir)[`filename]
for(yearDir in yearDirs){
    path = rootDir + "/" + yearDir
    symbols.append!(files(path)[`filename].upper().strReplace(".CSV",""))
}
//De duplication and increase expansion space:
symbols = symbols.distinct().sort!().append!("999999");
//They were divided into 100 parts
symRanges = symbols.cutPoints(100)
Define two compo partitions and create database and partition table through the following script:
columns=`symbol`exchange`cycle`tradingDay`date`time`open`high`low`close`volume`turnover`unixTime
types =  [SYMBOL,SYMBOL,INT,DATE,DATE,TIME,DOUBLE,DOUBLE,DOUBLE,DOUBLE,LONG,DOUBLE,LONG]

dbDate=database("", RANGE, yearRange)
dbID=database("", RANGE, symRanges)
db = database(dbPath, COMPO, [dbDate, dbID])

pt=db.createPartitionedTable(table(1000000:0,columns,types), tableName, `tradingDay`symbol)

It should be noted that the partition is the smallest unit of data stored in dolphin dB, and the write operation of dolphin DB to the partition is exclusive. When tasks are in parallel, it is necessary to avoid multi tasks writing data to a partition at the same time. In this case, each year’s data is assigned to a single task, and the data boundaries of each task operation are not overlapped, so it is impossible for multiple tasks to write to the same partition.

4.2 import data

The main idea of data import script is very simple, that is to read and write all the CSV files one by one to the distributed database table through the circular directory tree dfs://SAMPLE_ Trddb, but there are still many details in the specific import process.

The first problem is that the data format saved in the CSV file is different from the data format inside dolphin dB, such as the time field. In the file, “9390100000” is used to represent the time accurate to milliseconds. If you read it directly, it will be recognized as a numerical type instead of a time type, Therefore, we need to use the data conversion function datetimeparse combined with the format function format to convert the data during data import. The key scripts are as follows:

datetimeParse(format(time,"000000000"),"HHmmssSSS")

Although it’s very simple to implement through circular import, in fact, 100g data is composed of a lot of about 5m fine files. If a single thread operation will wait for a long time, in order to make full use of the resources of the cluster, we split the data import into multiple subtasks according to the year, and send them to the task queue of each node in turn for parallel execution, so as to improve the efficiency of import. This process is divided into the following two steps

(1) Define a user-defined function. The main function of the function is to import all files under the specified year directory

//Cycle all the data files in the annual directory
def loadCsvFromYearPath(path, dbPath, tableName){
    symbols = files(path)[`filename]
    for(sym in symbols){
        filePath = path + "/" + sym
        t=loadText(filePath)
        database(dbPath).loadTable(tableName).append!(select symbol, exchange,cycle, tradingDay,date,datetimeParse(format(time,"000000000"),"HHmmssSSS"),open,high,low,close,volume,turnover,unixTime from t )            
    }
}

(2) The function defined above is submitted to each node for execution through RPC function and submitjob function

nodesAlias="NODE" + string(1..4)
years= files(rootDir)[`filename]

index = 0;
for(year in years){    
    yearPath = rootDir + "/" + year
    des = "loadCsv_" + year
    rpc(nodesAlias[index%nodesAlias.size()],submitJob,des,des,loadCsvFromYearPath,yearPath,dbPath,tableName)
    index=index+1
}

In the process of data import, you can use pnoderun (getrecentjobs) to observe the completion of background tasks.

caseFull script